illustration

Copyright© Schmied Enterprises LLC, 2024.

Professional venture capital companies usually provide advisors and training to make sure their investment is safe.

Stanford research suggests that the successful startups are the ones that have a strong demand in their market compared to other factors like founder skills or financial backing.

We are in a new century. Microsoft, Google, Apple, and Amazon grew large on some principles shared with the nearby universities over decades. Those methodologies are widely known, and they can be followed easily.

There are four factors that eventually will matter, when measuring success. Timing is important. Product and price has to match the target customer base. Design and promotion must be appealing.

What happens next? The product needs to scale. Here are the rules to have such a product for non-technical people.

Scaling came back in waves over history. IBM mainframes scaled quite well with terminals in the same buildings. Personal computing leveraged the strength of semiconductors to provide way better experience to each user at home.

What does technology need to support any demand?

You will be surprised, your solution's architecture does not matter anymore. If it handles five customers on a cloud instance, we can scale it easily.

The workload may be quite isolated, in which case more instances will handle the load very well with linear cost increase per customer, the usual requirement.

Facebook was one of the earliest companies, that had to deal with network effects of communication. If instances are interconnected some engineering decisions can help. Additional customers may add additional communication channels compared to the size of the existing cluster. This is a polynomial grade increase but that fades away after a few hundred to thousand connections are reached.

You scale a cluster from ten to thousand nodes to handle from a hundred to ten thousand customers. Your data becomes scattered by design. Isolated workloads can use load balancers with a simple logic. You just need to use the hash of the identity of your customer to direct requests to the same place.

Some network effects and software with communication like TikTok requires a bit of what is called sharding. Incoming requests are redirected by the customer, but the heavier workloads of video traffic and machine learning will require spreading the data evenly. Multiple customers can use the same video streams, so if a place gets crowded it can stream in parallel to a new place that can handle more requests. Hadoop, Kubernetes, and Nomad are systems that traditionally handled such workloads. Extra usage streams generate extra copies that are source of a multiplied number of streams. A scheduled term cleanup can save on idle time at night.

Be careful about requirements of centralization. Any requirements of collecting traces, or processing videos will require heavy hardware, possibly GPUs. Streaming systems like Kafka, Flink, or Spark are the systems, that support these. However, the insights rarely justify the expenses. If your solution requires only polynomial interconnections like friends or limited size groups, it can be handled. Any centralization may require polynomial, or exponential resource requirements.

So, how do you write your app to scale? Use the model, view, and controller logic. Write your app design for your platform of Android, iOS, or Web. Separate any logic into code running in Docker as JavaScript, Java, or Python. Keep your model separately in a database of MySQL, PostgreSQL, MongoDB, Redis, or Oracle. Professional IT companies can just scale out your view and controller logic with serverless solutions like AWS Lambda. The model can scale up using a database like Snowflake, PingCap, or AWS Redshift.

Here are the right steps to scale out.

1. Write your code and make sure to avoid any long term memory use except caching. All data is fetched and put over the network to another system. We emulated this with our solution tig inspired by Redis.

2. Do your design and logic in a way that it is short running, less than half a second for each call. This will allow others to use serverless containers like AWS Lambda. We emulated this in storm.

3. A professional company can just scale it at this point.

Keeping these three steps gives your solution value. The market value increases, if it can handle hundred thousand customers by just wrapping with the right scale out logic and no changes. Any rewrites, cross customer calls and heavy caching would limit scalability and resale value. SQL queries that last longer than half a second may also need to use streaming and materialized views.

Here is what we will do.

1. Any design and logic code that runs within half a second is scheduled to serverless handlers.

2. Any logic that needs specific data of a customer finds that data in a memory cache database by sharding of its hash.

3. Auto-scaling of the cluster is done listening to demand. Every storage node is temporary for an hour, or a month. Data not touched is deleted at that point. Important data is backed up every two weeks before cleanup to another container kept for a year or five years.

4. Serverless runners and temporary storage scale up and down using an algorithm like mitosis of cells.

5. Long-running queries like summing up cash flows run in the background creating snapshots.

Problem. Scaling out will spread the data across nodes.

Solution. Indeed, a magnitude larger workload cannot be covered by each node anymore. Just copying, and caching to all nodes does not work anymore. Sharding with a good hash is the right way in this case.

Problem. Sharding by data hash is the most optimal way to scale out storage.

Solution. Indeed. If the location is a direct function of the key or value to find the data, search can be repeated. Such a shard is a hash algorithm by definition like SHA256. Refer to our tig example.

Problem. Flexible supply of compute called auto-scaling is hard to solve.

Solution. Not anymore. A fixed term to terminate nodes helps to plan ahead. If the workload increases or decreases, the number of new nodes replacing the ones that terminate with a term will just follow the demand without code. Refer to our tig example.

Problem. The mitosis algorithm is the optimal way to follow demand.

Solution. Indeed. Mitosis originally happens in cells that split, when enough nutrients like glucose are collected. Mitosis in clusters tracks the filling up of a node compared to the expected lifetime. Any extra demand can trigger adding a new node, or delaying replacement on less demand. An example is a node that runs for an hour to handle a thousand requests. If seven hundred requests were handled after half an hour a mitosis creates an extra node.

Problem. Mitosis and any scaling are to be done with planned tear down.

Solution. Indeed. Any planned termination of nodes makes sure that any streaming of useful data is the main code path tested well. Data cannot be expected to stay anywhere for long this way. Graceful degradation used to be a less tested way of doing such logic on demand. Testing is expensive, failures are more frequent.

Problem. Fault tolerance is outdated.

Solution. Our suggestion using term nodes torn down after a period for sure is optimal. The term can be adjusted based on expected fault rates. Redundancy is the main path handling an expected shutdown streaming out to a new node or backup, and releasing the data not marked as used anymore. Refer to our tig example.

Problem. Separation of storage and compute is solved in term nodes and streaming.

Solution. Indeed. Classically compute used to be coupled with storage on less flexible off the shelf hardware using Hadoop. However, tight coupling is only necessary for very interactive workloads. Streaming larger buffers can handle processing on nearby nodes without any degradation of cluster throughput in non-interactive cases.

Problem. Term temporary storage or cache containers solve the issue of permanent storage on Kubernetes.

Solution. Traditionally there was no economically efficient way to handle permanent storage on clusters running Kubernetes workloads designed for compute. Our approach of term storage in tig can run on Kubernetes. Temporary data is deleted after a while automatically without the need of development, test and operations resources. Important data can be streamed out as a snapshot. Eventually the only permanent storage is a backup that is ideally always written, never read.

Problem. The statistically ideal replication count is two.

Solution. Indeed. Systems use to scale out but hardware errors multiply with more machines are in use. Replicating to one single node can handle any rare data loss switching to the replica. This is as safe as running a productivity software on DOS saving occasionally. More replication would require more software and testing time. Network usage of keeping multiple replicas in sync can be expensive. Hardware utilization starts to drop compared to the economic value created.

Problem. I do not trust data stored in the cloud.

Solution. Use more cloud providers using random keys stored on one and XOR with the key stored in another. Use the same logic to scale up to more providers. Any tampering can be captured almost immediately this way. It is almost impossible to coordinate attacks across cloud providers, unless it is government related. That would require local court permissions anyway.

References:

Tig storage

Storm load balancing

Sat consistent databases

This article was revised on May 4, 2024.