As a developer who writes web applications, I have witnessed the growth of cloud computing, and specifically Microsoft Azure, with great interest. It can’t be denied that cloud applications are potentially much easier to scale and perform because the cloud fabric is elastic in a way, that on-premises environment is not. However, scalability doesn’t come by default in your application. If you deploy a simple web application in Azure, it doesn’t magically become more scalable.
This article is published from the DNC Magazine for Developers and Architects. Download this magazine from here [PDF] or Subscribe to this magazine for FREE and download all previous and current editions
In this article, I will explain why and how to build for scalability, mentioning Azure components that are useful in that effort.
Why Scalability Matters?
If we look up a dictionary, scalability is described as scale-ability, which means the ability to scale or to accommodate a growing amount of work in a graceful way. This will be our goal throughout this article.
It is important to highlight that “a graceful way” can mean many things for many applications, but usually it refers to smooth and progressive degradation of service that allows application to work, but with somewhat increased response times. Between returning an error to the user and making the user wait a little more, we should always choose the later.
The process of building a scalable application begins with the design of the application. According to scalability experts Martin Abbot and Michael Fisher, who wrote an excellent little book called “Scalability Rules”, scalability should be designed for in three phases: the design itself, the implementation and the deployment.
Figure 1: The scalability has to be built for, beginning with the application design
When we design for scalability in the first place, we can address many potential issues without compromising existing code base, because it’s not there yet. On paper, we can quickly iterate candidate architectures until we have found one that might be the right one. You should be able to identify architectures that can give you at least 20 times your desired workload in this phase.
In the next phase, implementation, you build your solution according to the architecture designed in the previous phase. Here, our goal is to scale in between 3 to 20 times our envisioned workload, using techniques that I will outline in this article. Here, the cost of implementing changes is high as there is already an existing codebase to take into account.
Once the application is deployed, we can also streamline the deployment architecture to optimize for scalability, but we can’t strive to get too much of it. By using a sound deployment architecture, we can get from 50% to 300% more scalability than the one we targeted for initially. Also, leveraging the scale up and down by provisioning extra instances during high load, and shutting them down when the load is reduced, is very important in order to maximize the use of our resources (and money).
If scalability is so good and great, why don’t we always build for scalability? Well, that’s a good question. The answer lies in the cost versus benefit equation. Scalable applications are usually more complex and more expensive to build than an application with no explicit high scalability in mind. So we must ensure that when we build an application, the benefits of adding complexity and additional cost to ensure high scalability are worth it.
Enemies of Scalability
There are features of web applications that make it difficult to scale past their intrinsic user loads. Some of them are introduced by a design choice, but many of them are just “default” features of a normal-looking web application. I will briefly mention four of them.
Figure 2: A typical Azure web application with the “enemies of scale” highlighted in red
A bottleneck is a single component in the application where all the communication must pass through. Bottlenecks are easy to create, because they are natural points of communication between the layers in our architecture. For example, usually there is a Data Access Layer of some kind in almost any application. If this layer is centralized, all the communication to and from the database passes through it, and we have a bottleneck.
The bottlenecks put a limit to the communication flow in the application, as it can’t accommodate more requests and calls than the bottleneck allows for.
Round trips and Latency
Unnecessary round trips and communication in our application are also an enemy of scal. When our application runs in a local network, the communication is very quick as the latency introduced by the network is almost negligible. But when our application runs in the cloud, and is accessed by thousands of users, the latency increases as there are many network components between the communicating components. The clearest cases are requests from the browser, or the client side to the server side of the application. Each request has to travel across the Internet to the datacenter, passing through routers, switches and other network devices, all of which introduce some delay or latency.
A web application is by nature an example of request and response model. One of the pages in the application issues a request to the back-end, and the back-end responds. During the request and until the response is issued, the calling page and its executing thread are blocked, but they aren’t doing anything useful. They just sit idle waiting for the response to come. It limits our application maximum throughput, as there is a limited number of threads in the web server to make the requests to the back-end. If all of them are blocked waiting for the response, our application will put any additional request on hold until one of these threads is free, which adds to the average response time as well. If the wait is too long, we will see timeouts in our application.
Single Points of Failure (SPOF)
Finally, almost every component of our application can act as a single point of failure. A single point of failure or SPOF is a component that if down, breaks the application. In a single node deployment with no load balancing, the entire application is a huge single point of failure. We can spin more nodes with our application, and it will usually just shift the failure point to the database, as usually all the nodes will share a common database. We must also make the database redundant to avoid a single point of failure.
Components of Highly Scalable Applications
Now that I have shared with you the enemies of scale, it’s time to see their heroic counterparts. These are the components of highly scalable applications, architectural patterns and techniques that aid in the pursuit of great scalability.
Minimizing Locks in Storage
In most of the deployments there is a centralized storage location, such as a database, where the “single source of the truth” is stored. The applications read and write data from and to this centralized location, and the dynamics of these actions cause locks to prevent incoherent data. We can’t go faster than our database throughput, can we?
Yes, we can! But before doing so, we must prune our storage operations to avoid locking, waits and bottlenecks. By having a streamlined storage layer, we can build our application layers on solid ground.
Relational databases suffer from having to be used for reads and writes at the same time. Read operations need fast response time, while write operations need high throughput. To shield read operations from non-confirmed write operations (and keep the data consistent), relational databases have varying degrees of isolation levels. Higher the isolation level, more the locks and waits in the database operations and lesser concurrency, but with higher data consistency.
The techniques of minimizing storage locking are multiple. We can split or partition the storage into different parts, and then we get more throughput as there are more components in parallel that sum the throughput of all the individual partitions. We can split our database into read and write parts (called Command-Query Responsibility Segregation or CQRS) to eliminate the locking due to transaction isolation levels of our database. We can even dispense with a relational database and use a NoSQL database or even an in-memory database. This is called Polyglot Persistence and it means to choose the best storage mechanism for every data storage need in our application, instead of making a compromise. In Azure we can use Table Storage, SQL Azure, DocumentDB or any IaaS storage technology that’s available.
We can also dispense with the immediate consistency of data as in relational databases, and embrace eventual consistency that ensures that the data will be eventually updated to the latest version, just not immediately. Almost every scalable architecture uses eventual consistency in one way or another.
The second component of scalable applications is caching. Caching is undoubtedly the cheapest way to reduce unnecessary round trips and latency, by copying the data read from the server, and storing them locally. By using caching we basically trade increased RAM memory consumption for lower response times. All caching mechanisms are prone to stale data, or data that is not reflecting the latest state. Stale data is the inevitable consequence of caching, and can be mitigated at least in part by using multiple caching levels and distributing cache across different servers.
In Azure, the recommended caching solution is Redis. It is a very performant key-value dedicated cache.
Non-Blocking Asynchronous Requests
The third superhero of highly scalable apps is the use of asynchronous, or non-blocking requests. We have seen that an application thread issuing a request is essentially blocked until the response is returned from the back-end. If the programming language allows us to use asynchronous calls, and C# certainly does, we can free the thread from idly waiting for the response. It can have an enormous impact on the throughput of the application, as we are using the existing threads in a much more efficient manner. The same threads can be reused again, while still waiting for the existing requests to the back-end, to respond.
Luckily, using .NET 4.5 or .NET Core with its async/await keywords, and Azure API that has asynchronous methods, helps us accommodate many asynchronous calls in our application, thereby minimizing the useless blocking of the calling threads on the web server.
With asynchronous calls, we optimize the use of threads, but at the end we still wait for the response to arrive. By using queues, we can decouple from the request-response model and just send a message to the queue, and return immediately. The sender is then free to attend to other requests. The receiving part is also free to get the messages from the queue and process them at its own pace. It complicates things a little bit, but brings together a whole load of improvements. We can offload long operations to be processed by other nodes of the application, and then their results are sent back through another queue. We also get fault tolerance as the queue mechanism usually allows for multiple retries if a processing node has failed. We can throttle the requests to the back-end and minimize the pressure on it, which comes in handy to prevent Distributed Denial of Service (DDOS) attacks.
With Azure we have two main queuing mechanisms: Storage Queues and Service Bus queues. You can think of them as the fast, low-level queues; and transactional, higher-level queues, respectively.
The solution to the single point of failure is applying redundancy to all levels of our solution. Redundancy means having more than one instance of a single component in our application. Strictly speaking, redundancy helps to attain high availability, which is the capability of the system to withstand the loss of a component such as a service or a node that crashes. However, as scalable applications usually experience heavy user loads, it makes their components more prone to crashing, and that’s why redundancy is so closely related to scalability.
Redundancy also introduces a whole range of problems such as detecting dead nodes, resolving the request routing to the appropriate healthy node (load balancing, such as Azure Traffic Manager) and having to deal with repeated operations. The atomic operations in our application have to be idempotent, which means that even if they are repeated many times, the end result will be the same. Usually, this idempotency is achieved with the addition of unique operation ID such as GUID, where we can check if a single operation has already been executed before.
Figure 3: The same application as before, with the components of scalability highlighted in green
In this article I hope you are convinced that scalability can’t be an afterthought. On the contrary, it has to be designed from the very beginning if we really want something that’s going to endure high user load.
Scalability adds complexity to our solution. There’s no way out of it, but now that we know the enemies of scale and the components of scalability, we are armed with the knowledge to tame this complexity.
If you are interested in an in-depth explanation of how scalable applications are done, with a complete makeover of a web application in Azure, you can check my Pluralsight course that is included in the MSDN subscription.