The Azure SLA is a critical concept for organizations using Azure to build applications. When you are using multiple Azure services (and what app doesn’t?) we are talking about the Azure Composite SLA.
So what does the Composite SLA involve?, how it is calculated?, and what’s its importance?
Understanding the Azure SLA and the Azure Composite SLA
From a legal perspective, the Azure SLA is simply a contractual agreement that “guarantees” the availability of a specific service. For example, traditional Azure Virtual Machines have a 99.9% monthly SLA, meaning they are guaranteed to be up and running 99.9% of the time, with some exceptions (Azure Service Level Agreement Summary). Other services, such as Azure SQL Database, offer higher SLAs, like 99.99% for Premium tiers, depending on the configuration.
A fun fact; the only Azure Service offering 100% update is still Azure DNS.
Understanding the SLA for a single Azure service is easy, each service is listed in the official Microsoft SLA.
The term “Composite SLA” refers to the overall reliability or availability of an application that depends on several Azure services, each with its own SLA. The Composite SLA is calculated by combining the individual SLAs, typically using a “likelihood-based method” where the overall availability is the product of the individual availabilities, assuming the services are in series (i.e., the application fails if any critical service fails).
Example of a Composite SLA
For instance, consider an e-commerce application using an Azure App Service with a 99.9% SLA.
It uses an Azure Function which has a 99.5% SLA and also an Azure Kubernetes Service (Basic tier) with a 99% SLA.
Since the app requires all three services to function (they’re in series), the Composite SLA is the product of their individual SLAs: 99.9% × 99.5% × 99% = 0.999 × 0.995 × 0.99 = 0.9836055
That’s approximately 98.36% uptime.
In a 30-day month (720 hours), 98.36% uptime means about 11.8 hours of potential downtime, a steep drop from the 43 minutes you’d might expect with just the App Service’s 99.9% SLA.
This calculation highlights how the overall reliability can be considerably lower than the individual SLAs, which might make your architecture unsuitable for the business needs.
Importance for application reliability
Managing the Composite SLA is crucial for ensuring application availability, which directly impacts business outcomes. Downtime can lead to lost revenue, decreased user satisfaction, and potential reputational damage. The evidence leans toward the importance of designing for resiliency, as outlined in the Azure Well-Architected Framework‘s Reliability pillar, which includes strategies like redundancy, failover mechanisms, and error handling to mitigate the risk of service failures.
For example, if a critical service like SQL Database experiences an outage, the application might be unable to serve data, even if other services like App Service remain operational. This underscores the need to understand and manage the Composite SLA to meet business requirements for uptime, especially for mission-critical systems.
Calculating and designing for the Composite SLA
The method for calculating the Composite SLA, as mentioned, is to multiply the availabilities of critical services. However, this approach assumes a series configuration, which may not always be true. In practice, applications can be designed to handle failures gracefully, reducing the impact on overall availability. For instance, implementing retry logic for transient failures or using load balancers to distribute traffic across multiple instances can improve reliability. This is actually critical when building application to run in the cloud.
To enhance the Composite SLA, you can:
- Select services with higher SLAs: Choose higher-tier services for critical components. For example, Azure SQL Database offers 99% for Basic, 99.9% for Standard, and 99.99% for Premium tiers, allowing for better reliability at a higher cost (Azure SQL Database SLA).
- Implement redundancy: Use zone-redundant or geo-redundant configurations for services like Storage or Compute to ensure availability across failures in a single region.
- Minimize the blast radius: Design the application so that the failure of a non-critical service does not affect the core functionality, effectively reducing the number of services considered in the composite calculation.
These strategies align with the Reliability pillar of the Well-Architected Framework, emphasizing proactive design to handle failures and maintain availability.
Recent SLA updates and their impact
Azure receives with frequent updates that can affect service SLAs and reliability. There have been over 1.200 service enhancements in the past year alone, including new features, region expansions, and SLA improvements.
For example, if a service like Azure Cosmos DB introduces a new SLA tier with 99.999% availability, applications using that service could see an improved composite reliability, especially if it’s a critical component. So staying informed about these updates is important for engineers and architects to optimize their application’s design and ensure alignment with the latest capabilities.
Potential challenges
Managing the Composite SLA can sometimes be challenging. For example with prioritizing which services to invest in for higher SLAs, as costs can change significantly. For instance, upgrading to a Premium tier for SQL Database might improve the SLA from 99.9% to 99.99%, but at a much higher cost.
Additionally, overcomplicating the design with excessive redundancy can increase complexity and overhead, which might be offsetting the reliability gains.
Practical tips for managing the Composite SLA
To effectively manage the Composite SLA, consider the following top tips:
- Identify critical services: Determine which services are essential for your application’s functionality and focus on their SLAs.
- Calculate and Monitor: Regularly calculate your application’s composite SLA based on the services it uses and monitor their performance using Azure Monitor and other tools.
- Design for resilience: Build your application to be resilient to service failures through redundancy, error handling, and failover mechanisms.
- Test for failures: Regularly test your application under failure scenarios to ensure it behaves as expected and meets the desired availability.
And of course follow changes to Azure services that might affect their SLAs (or offer new ways to improve reliability).
Takeaway
The Azure Composite SLA is an important concept for understanding and managing the overall reliability of applications built on top of multiple Azure services. By calculating the composite availability, designing for resiliency, and staying updated with Azure service improvements, your organization can ensure their applications meet availability requirements and deliver business value.
Be sure to visit the Azure SLA Board https://azurecharts.com/sla to see the uptime of your favorite Azure service. And the Azure status history to get insights into previous failures.
There’s no one-size-fits-all approach to the cloud. That’s why we meet you where you are. Are you ready to transform your DevOps practices? Contact us today to start your journey with DevOps Masterminds.