Steady reinvention: A quick historical past of block storage at AWS

Marc Olson has been a part of the crew shaping Elastic Block Retailer (EBS) for over a decade. In that point, he’s helped to drive the dramatic evolution of EBS from a easy block storage service counting on shared drives to an enormous community storage system that delivers over 140 trillion every day operations.

On this submit, Marc supplies an interesting insider’s perspective on the journey of EBS. He shares hard-won classes in areas comparable to queueing principle, the significance of complete instrumentation, and the worth of incrementalism versus radical modifications. Most significantly, he emphasizes how constraints can typically breed inventive options. It’s an insightful take a look at how one in all AWS’s foundational providers has advanced to fulfill the wants of our clients (and the tempo at which they’re innovating).

–W


Steady reinvention: A quick historical past of block storage at AWS

I’ve constructed system software program for many of my profession, and earlier than becoming a member of AWS it was principally within the networking and safety areas. After I joined AWS practically 13 years in the past, I entered a brand new area—storage—and stepped into a brand new problem. Even again then the dimensions of AWS dwarfed something I had labored on, however lots of the similar strategies I had picked up till that time remained relevant—distilling issues all the way down to first rules, and utilizing successive iteration to incrementally remedy issues and enhance efficiency.

If you happen to go searching at AWS providers right now, you’ll discover a mature set of core constructing blocks, however it wasn’t all the time this fashion. EBS launched on August 20, 2008, practically two years after EC2 grew to become out there in beta, with a easy thought to offer community hooked up block storage for EC2 situations. We had one or two storage specialists, and some distributed methods people, and a stable information of laptop methods and networks. How arduous might it’s? Looking back, if we knew on the time how a lot we didn’t know, we might not have even began the undertaking!

Since I’ve been at EBS, I’ve had the chance to be a part of the crew that’s advanced EBS from a product constructed utilizing shared arduous disk drives (HDDs), to 1 that’s able to delivering a whole lot of hundreds of IOPS (IO operations per second) to a single EC2 occasion. It’s outstanding to mirror on this as a result of EBS is able to delivering extra IOPS to a single occasion right now than it might ship to a whole Availability Zone (AZ) within the early years on high of HDDs. Much more amazingly, right now EBS in combination delivers over 140 trillion operations every day throughout a distributed SSD fleet. However we undoubtedly didn’t do it in a single day, or in a single huge bang, and even completely. After I began on the EBS crew, I initially labored on the EBS consumer, which is the piece of software program answerable for changing occasion IO requests into EBS storage operations. Since then I’ve labored on virtually each part of EBS and have been delighted to have had the chance to take part so instantly within the evolution and development of EBS.

As a storage system, EBS is a bit distinctive. It’s distinctive as a result of our main workload is system disks for EC2 situations, motivated by the arduous disks that used to take a seat inside bodily datacenter servers. A variety of storage providers place sturdiness as their main design objective, and are keen to degrade efficiency or availability with a view to defend bytes. EBS clients care about sturdiness, and we offer the primitives to assist them obtain excessive sturdiness with io2 Block Specific volumes and quantity snapshots, however additionally they care quite a bit concerning the efficiency and availability of EBS volumes. EBS is so intently tied as a storage primitive for EC2, that the efficiency and availability of EBS volumes tends to translate virtually on to the efficiency and availability of the EC2 expertise, and by extension the expertise of working functions and providers which are constructed utilizing EC2. The story of EBS is the story of understanding and evolving efficiency in a really large-scale distributed system that spans layers from visitor working methods on the high, all the way in which all the way down to customized SSD designs on the backside. On this submit I’d prefer to inform you concerning the journey that we’ve taken, together with some memorable classes which may be relevant to your methods. In any case, methods efficiency is a fancy and actually difficult space, and it’s a fancy language throughout many domains.

Queueing principle, briefly

Earlier than we dive too deep, let’s take a step again and take a look at how laptop methods work together with storage. The high-level fundamentals haven’t modified by means of the years—a storage gadget is linked to a bus which is linked to the CPU. The CPU queues requests that journey the bus to the gadget. The storage gadget both retrieves the information from CPU reminiscence and (ultimately) locations it onto a sturdy substrate, or retrieves the information from the sturdy media, after which transfers it to the CPU’s reminiscence.

Architecture with direct attached disk
Excessive-level laptop structure with direct hooked up disk (c. 2008)

You may consider this like a financial institution. You stroll into the financial institution with a deposit, however first you must traverse a queue earlier than you possibly can converse with a financial institution teller who will help you together with your transaction. In an ideal world, the variety of patrons getting into the financial institution arrive on the precise charge at which their request may be dealt with, and also you by no means have to face in a queue. However the actual world isn’t excellent. The true world is asynchronous. It’s extra possible that a couple of individuals enter the financial institution on the similar time. Maybe they’ve arrived on the identical streetcar or prepare. When a bunch of individuals all stroll into the financial institution on the similar time, a few of them are going to have to attend for the teller to course of the transactions forward of them.

As we take into consideration the time to finish every transaction, and empty the queue, the common time ready in line (latency) throughout all clients might look acceptable, however the first particular person within the queue had the perfect expertise, whereas the final had a for much longer delay. There are a variety of issues the financial institution can do to enhance the expertise for all clients. The financial institution might add extra tellers to course of extra requests in parallel, it might rearrange the teller workflows so that every transaction takes much less time, decreasing each the whole time and the common time, or it might create totally different queues for both latency insensitive clients or consolidating transactions which may be quicker to maintain the queue low. However every of those choices comes at an extra value—hiring extra tellers for a peak which will by no means happen, or including extra actual property to create separate queues. Whereas imperfect, except you’ve infinite sources, queues are essential to soak up peak load.

Simple diagram of EC2 and EBS queueing from 2012
Simplified diagram of EC2 and EBS queueing (c. 2012)

In community storage methods, we now have a number of queues within the stack, together with these between the working system kernel and the storage adapter, the host storage adapter to the storage material, the goal storage adapter, and the storage media. In legacy community storage methods, there could also be totally different distributors for every part, and totally different ways in which they consider servicing the queue. Chances are you’ll be utilizing a devoted, lossless community material like fiber channel, or utilizing iSCSI or NFS over TCP, both with the working system community stack, or a customized driver. In both case, tuning the storage community typically takes specialised information, separate from tuning the appliance or the storage media.

Once we first constructed EBS in 2008, the storage market was largely HDDs, and the latency of our service was dominated by the latency of this storage media. Final yr, Andy Warfield went in-depth concerning the fascinating mechanical engineering behind HDDs. As an engineer, I still marvel at everything that goes into a hard drive, but at the end of the day they are mechanical devices and physics limits their performance. There’s a stack of platters that are spinning at high velocity. These platters have tracks that contain the data. Relative to the size of a track (<100 nanometers), there’s a large arm that swings back and forth to find the right track to read or write your data. Because of the physics involved, the IOPS performance of a hard drive has remained relatively constant for the last few decades at approximately 120-150 operations per second, or 6-8 ms average IO latency. One of the biggest challenges with HDDs is that tail latencies can easily drift into the hundreds of milliseconds with the impact of queueing and command reordering in the drive.

We didn’t have to worry much about the network getting in the way since end-to-end EBS latency was dominated by HDDs and measured in the 10s of milliseconds. Even our early data center networks were beefy enough to handle our user’s latency and throughput expectations. The addition of 10s of microseconds on the network was a small fraction of overall latency.

Compounding this latency, hard drive performance is also variable depending on the other transactions in the queue. Smaller requests that are scattered randomly on the media take longer to find and access than several large requests that are all next to each other. This random performance led to wildly inconsistent behavior. Early on, we knew that we needed to spread customers across many disks to achieve reasonable performance. This had a benefit, it dropped the peak outlier latency for the hottest workloads, but unfortunately it spread the inconsistent behavior out so that it impacted many customers.

When one workload impacts another, we call this a “noisy neighbor.” Noisy neighbors turned out to be a critical problem for the business. As AWS evolved, we learned that we had to focus ruthlessly on a high-quality customer experience, and that inevitably meant that we needed to achieve strong performance isolation to avoid noisy neighbors causing interference with other customer workloads.

At the scale of AWS, we often run into challenges that are hard and complex due to the scale and breadth of our systems, and our focus on maintaining the customer experience. Surprisingly, the fixes are often quite simple once you deeply understand the system, and have enormous impact due to the scaling factors at play. We were able to make some improvements by changing scheduling algorithms to the drives and balancing customer workloads across even more spindles. But all of this only resulted in small incremental gains. We weren’t really hitting the breakthrough that truly eliminated noisy neighbors. Customer workloads were too unpredictable to achieve the consistency we knew they needed. We needed to explore something completely different.

Set long term goals, but don’t be afraid to improve incrementally

Around the time I started at AWS in 2011, solid state disks (SSDs) became more mainstream, and were available in sizes that started to make them attractive to us. In an SSD, there is no physical arm to move to retrieve data—random requests are nearly as fast as sequential requests—and there are multiple channels between the controller and NAND chips to get to the data. If we revisit the bank example from earlier, replacing an HDD with an SSD is like building a bank the size of a football stadium and staffing it with superhumans that can complete transactions orders of magnitude faster. A year later we started using SSDs, and haven’t looked back.

We started with a small, but meaningful milestone: we built a new storage server type built on SSDs, and a new EBS volume type called Provisioned IOPS. Launching a new volume type is no small task, and it also limits the workloads that can take advantage of it. For EBS, there was an immediate improvement, but it wasn’t everything we expected.

We thought that just dropping SSDs in to replace HDDs would solve almost all of our problems, and it certainly did address the problems that came from the mechanics of hard drives. But what surprised us was that the system didn’t improve nearly as much as we had hoped and noisy neighbors weren’t automatically fixed. We had to turn our attention to the rest of our stack—the network and our software—that the improved storage media suddenly put a spotlight on.

Even though we needed to make these changes, we went ahead and launched in August 2012 with a maximum of 1,000 IOPS, 10x better than existing EBS standard volumes, and ~2-3 ms average latency, a 5-10x improvement with significantly improved outlier control. Our customers were excited for an EBS volume that they could begin to build their mission critical applications on, but we still weren’t satisfied and we realized that the performance engineering work in our system was really just beginning. But to do that, we had to measure our system.

If you can’t measure it, you can’t manage it

At this point in EBS’s history (2012), we only had rudimentary telemetry. To know what to fix, we had to know what was broken, and then prioritize those fixes based on effort and rewards. Our first step was to build a method to instrument every IO at multiple points in every subsystem—in our client initiator, network stack, storage durability engine, and in our operating system. In addition to monitoring customer workloads, we also built a set of canary tests that run continuously and allowed us to monitor impact of changes—both positive and negative—under well-known workloads.

With our new telemetry we identified a few major areas for initial investment. We knew we needed to reduce the number of queues in the entire system. Additionally, the Xen hypervisor had served us well in EC2, but as a general-purpose hypervisor, it had different design goals and many more features than we needed for EC2. We suspected that with some investment we could reduce complexity of the IO path in the hypervisor, leading to improved performance. Moreover, we needed to optimize the network software, and in our core durability engine we needed to do a lot of work organizationally and in code, including on-disk data layout, cache line optimization, and fully embracing an asynchronous programming model.

A really consistent lesson at AWS is that system performance issues almost universally span a lot of layers in our hardware and software stack, but even great engineers tend to have jobs that focus their attention on specific narrower areas. While the much celebrated ideal of a “full stack engineer” is valuable, in deep and complex systems it’s often even more valuable to create cohorts of experts who can collaborate and get really creative across the entire stack and all their individual areas of depth.

By this point, we already had separate teams for the storage server and for the client, so we were able to focus on these two areas in parallel. We also enlisted the help of the EC2 hypervisor engineers and formed a cross-AWS network performance cohort. We started to build a blueprint of both short-term, tactical fixes and longer-term architectural changes.

Divide and conquer

Whiteboard showing how the team removed the contronl from from the IO path with Physalia
Removing the control plane from the IO path with Physalia

When I was an undergraduate student, while I loved most of my classes, there were a couple that I had a love-hate relationship with. “Algorithms” was taught at a graduate level at my university for both undergraduates and graduates. I found the coursework intense, but I eventually fell in love with the topic, and Introduction to Algorithms, generally known as CLR, is likely one of the few textbooks I retained, and nonetheless sometimes reference. What I didn’t notice till I joined Amazon, and appears apparent in hindsight, is that you could design a company a lot the identical means you possibly can design a software program system. Completely different algorithms have totally different advantages and tradeoffs in how your group capabilities. The place sensible, Amazon chooses a divide and conquer strategy, and retains groups small and centered on a self-contained part with well-defined APIs.

This works effectively when utilized to parts of a retail web site and management airplane methods, however it’s much less intuitive in how you would construct a high-performance knowledge airplane this fashion, and on the similar time enhance efficiency. Within the EBS storage server, we reorganized our monolithic improvement crew into small groups centered on particular areas, comparable to knowledge replication, sturdiness, and snapshot hydration. Every crew centered on their distinctive challenges, dividing the efficiency optimization into smaller sized bites. These groups are in a position to iterate and commit their modifications independently—made doable by rigorous testing that we’ve constructed up over time. It was essential for us to make continuous progress for our clients, so we began with a blueprint for the place we wished to go, after which started the work of separating out parts whereas deploying incremental modifications.

One of the best a part of incremental supply is that you could make a change and observe its influence earlier than making the subsequent change. If one thing doesn’t work such as you anticipated, then it’s simple to unwind it and go in a special route. In our case, the blueprint that we specified by 2013 ended up wanting nothing like what EBS appears to be like like right now, however it gave us a route to begin transferring towards. For instance, again then we by no means would have imagined that Amazon would someday build its own SSDs, with a expertise stack that may very well be tailor-made particularly to the wants of EBS.

All the time query your assumptions!

Difficult our assumptions led to enhancements in each single a part of the stack.

We began with software program virtualization. Till late 2017 all EC2 situations ran on the Xen hypervisor. With units in Xen, there’s a ring queue setup that permits visitor situations, or domains, to share data with a privileged driver area (dom0) for the needs of IO and different emulated units. The EBS consumer ran in dom0 as a kernel block gadget. If we observe an IO request from the occasion, simply to get off of the EC2 host there are lots of queues: the occasion block gadget queue, the Xen ring, the dom0 kernel block gadget queue, and the EBS consumer community queue. In most methods, efficiency points are compounding, and it’s useful to give attention to parts in isolation.

One of many first issues that we did was to write down a number of “loopback” units in order that we might isolate every queue to gauge the influence of the Xen ring, the dom0 block gadget stack, and the community. We had been virtually instantly stunned that with virtually no latency within the dom0 gadget driver, when a number of situations tried to drive IO, they’d work together with one another sufficient that the goodput of the complete system would decelerate. We had discovered one other noisy neighbor! Embarrassingly, we had launched EC2 with the Xen defaults for the variety of block gadget queues and queue entries, which had been set a few years prior primarily based on the restricted storage {hardware} that was out there to the Cambridge lab constructing Xen. This was very sudden, particularly once we realized that it restricted us to solely 64 IO excellent requests for a whole host, not per gadget—definitely not sufficient for our most demanding workloads.

We fastened the principle points with software program virtualization, however even that wasn’t sufficient. In 2013, we had been effectively into the event of our first Nitro offload card dedicated to networking. With this first card, we moved the processing of VPC, our software defined network, from the Xen dom0 kernel, into a dedicated hardware pipeline. By isolating the packet processing data plane from the hypervisor, we no longer needed to steal CPU cycles from customer instances to drive network traffic. Instead, we leveraged Xen’s ability to pass a virtual PCI device directly to the instance.

This was a fantastic win for latency and efficiency, so we decided to do the same thing for EBS storage. By moving more processing to hardware, we removed several operating system queues in the hypervisor, even if we weren’t ready to pass the device directly to the instance just yet. Even without passthrough, by offloading more of the interrupt driven work, the hypervisor spent less time servicing the requests—the hardware itself had dedicated interrupt processing functions. This second Nitro card also had hardware capability to handle EBS encrypted volumes with no impact to EBS volume performance. Leveraging our hardware for encryption also meant that the encryption key material is kept separate from the hypervisor, which further protects customer data.

Diagram showing experiments in network tuning to improve throughput and reduce latency
Experimenting with network tuning to improve throughput and reduce latency

Moving EBS to Nitro was a huge win, but it almost immediately shifted the overhead to the network itself. Here the problem seemed simple on the surface. We just needed to tune our wire protocol with the latest and greatest data center TCP tuning parameters, while choosing the best congestion control algorithm. There were a few shifts that were working against us: AWS was experimenting with different data center cabling topology, and our AZs, once a single data center, were growing beyond those boundaries. Our tuning would be beneficial, as in the example above, where adding a small amount of random latency to requests to storage servers counter-intuitively reduced the average latency and the outliers due to the smoothing effect it has on the network. These changes were ultimately short lived as we continuously increased the performance and scale of our system, and we had to continually measure and monitor to make sure we didn’t regress.

Knowing that we would need something better than TCP, in 2014 we started laying the foundation for Scalable Reliable Diagram (SRD) with “A Cloud-Optimized Transport Protocol for Elastic and Scalable HPC”. Early on we set a couple of necessities, together with a protocol that might enhance our skill to get well and route round failures, and we wished one thing that may very well be simply offloaded into {hardware}. As we had been investigating, we made two key observations: 1/ we didn’t must design for the overall web, however we might focus particularly on our knowledge middle community designs, and a couple of/ in storage, the execution of IO requests which are in flight may very well be reordered. We didn’t must pay the penalty of TCP’s strict in-order supply ensures, however might as an alternative ship totally different requests down totally different community paths, and execute them upon arrival. Any boundaries may very well be dealt with on the consumer earlier than they had been despatched on the community. What we ended up with is a protocol that’s helpful not only for storage, however for networking, too. When utilized in Elastic Network Adapter (ENA) Express, SRD improves the efficiency of your TCP stacks in your visitor. SRD can drive the community at larger utilization by profiting from a number of community paths and decreasing the overflow and queues within the intermediate community units.

Efficiency enhancements are by no means a few single focus. It’s a self-discipline of repeatedly difficult your assumptions, measuring and understanding, and shifting focus to probably the most significant alternatives.

Constraints breed innovation

We weren’t happy that solely a comparatively small variety of volumes and clients had higher efficiency. We wished to deliver the advantages of SSDs to everybody. That is an space the place scale makes issues tough. We had a big fleet of hundreds of storage servers working hundreds of thousands of non-provisioned IOPS buyer volumes. A few of those self same volumes nonetheless exist right now. It could be an costly proposition to throw away all of that {hardware} and change it.

There was empty house within the chassis, however the one location that didn’t trigger disruption within the cooling airflow was between the motherboard and the followers. The great factor about SSDs is that they’re sometimes small and lightweight, however we couldn’t have them flopping round free within the chassis. After some trial and error—and assist from our materials scientists—we discovered warmth resistant, industrial energy hook and loop fastening tape, which additionally allow us to service these SSDs for the remaining lifetime of the servers.

An SSD in one of our servers
Sure, we manually put an SSD into each server!

Armed with this information, and numerous human effort, over the course of some months in 2013, EBS was in a position to put a single SSD into every a kind of hundreds of servers. We made a small change to our software program that staged new writes onto that SSD, permitting us to return completion again to your software, after which flushed the writes to the slower arduous disk asynchronously. And we did this with no disruption to clients—we had been changing a propeller plane to a jet whereas it was in flight. The factor that made this doable is that we designed our system from the beginning with non-disruptive upkeep occasions in thoughts. We might retarget EBS volumes to new storage servers, and replace software program or rebuild the empty servers as wanted.

This skill emigrate buyer volumes to new storage servers has come in useful a number of occasions all through EBS’s historical past as we’ve recognized new, extra environment friendly knowledge buildings for our on-disk format, or introduced in new {hardware} to switch the previous {hardware}. There are volumes nonetheless energetic from the primary few months of EBS’s launch in 2008. These volumes have possible been on a whole lot of various servers and a number of generations of {hardware} as we’ve up to date and rebuilt our fleet, all with out impacting the workloads on these volumes.

Reflecting on scaling efficiency

There’s another journey over this time that I’d prefer to share, and that’s a private one. Most of my profession previous to Amazon had been in both early startup or equally small firm cultures. I had constructed managed providers, and even distributed methods out of necessity, however I had by no means labored on something near the dimensions of EBS, even the EBS of 2011, each in expertise and group measurement. I used to be used to fixing issues on my own, or possibly with one or two different equally motivated engineers.

I actually take pleasure in going tremendous deep into issues and attacking them till they’re full, however there was a pivotal second when a colleague that I trusted identified that I used to be turning into a efficiency bottleneck for our group. As an engineer who had grown to be an professional within the system, but in addition who cared actually, actually deeply about all features of EBS, I discovered myself on each escalation and in addition eager to overview each commit and each proposed design change. If we had been going to achieve success, then I needed to learn to scale myself–I wasn’t going to unravel this with simply possession and bias for motion.

This led to much more experimentation, however not within the code. I knew I used to be working with different good people, however I additionally wanted to take a step again and take into consideration how one can make them efficient. One in every of my favourite instruments to return out of this was peer debugging. I keep in mind a session with a handful of engineers in one in all our lounge rooms, with code and some terminals projected on a wall. One of many engineers exclaimed, “Uhhhh, there’s no means that’s proper!” and we had discovered one thing that had been nagging us for some time. We had ignored the place and the way we had been locking updates to crucial knowledge buildings. Our design didn’t often trigger points, however sometimes we might see gradual responses to requests, and fixing this eliminated one supply of jitter. We don’t all the time use this method, however the neat factor is that we’re in a position to mix our shared methods information when issues get actually difficult.

By all of this, I spotted that empowering individuals, giving them the flexibility to soundly experiment, can typically result in outcomes which are even higher than what was anticipated. I’ve spent a big portion of my profession since then specializing in methods to take away roadblocks, however go away the guardrails in place, pushing engineers out of their consolation zone. There’s a little bit of psychology to engineering management that I hadn’t appreciated. I by no means anticipated that one of the crucial rewarding elements of my profession could be encouraging and nurturing others, watching them personal and remedy issues, and most significantly celebrating the wins with them!

Conclusion

Reflecting again on the place we began, we knew we might do higher, however we weren’t positive how a lot better. We selected to strategy the issue, not as a giant monolithic change, however as a sequence of incremental enhancements over time. This allowed us to ship buyer worth sooner, and course appropriate as we realized extra about altering buyer workloads. We’ve improved the form of the EBS latency expertise from one averaging greater than 10 ms per IO operation to constant sub-millisecond IO operations with our highest performing io2 Block Specific volumes. We achieved all this with out taking the service offline to ship a brand new structure.

We all know we’re not achieved. Our clients will all the time need extra, and that problem is what retains us motivated to innovate and iterate.