Untitled

Okay, picture this: you're moving house, right? Massive undertaking. Boxes everywhere, chaos reigns supreme. Now, imagine Uber decided to move its entire house – like, all its data, all its systems – not just to a new location, but to the cloud. Big move, right? But wait, there’s more! They weren’t just moving to the cloud; they decided to completely renovate the foundation while they were at it! That renovation? Swapping out the usual tech guts – think of them as the pipes and wires of their data centers – for something kinda new and definitely cooler: Arm-based computers.

Sounds a bit wild, doesn't it? Well, that’s exactly what Uber did back in February 2023. They kicked off this epic journey to move everything to the cloud with Oracle Cloud and Google Cloud – and at the same time, they decided to bring in Arm chips, which are normally found in your phone, not massive data centers. Why on earth would they do that? Two big reasons: first, to save a ton of cash and get better bang for their buck – price-performance, as the tech folks say. And second, to have more options when it comes to hardware. You know, in case the usual stuff becomes hard to get, like when everyone was scrambling for, well, everything not too long ago.

So, Uber wasn't just doing a simple cloud migration; they were diving headfirst into a multi-architecture world. Think of it as deciding to build your new house with both brick and super eco-friendly bamboo, just to see which works best and be ready for anything. This whole thing was a massive tech puzzle, and it involved pretty much every team at Uber working together.

Now, this blog post – and hey, this is the first part, cliffhanger alert! – it’s gonna walk us through the nitty-gritty of that journey, focusing on the tech headaches they bumped into when they decided to mix and match computer architectures.

First things first, why Arm in the cloud in the first place? Let's zoom out and look at why cloud companies like Oracle are even using these Ampere processors, which are Arm-based. Basically, it boils down to energy. These massive data centers? They guzzle power like crazy. Arm chips are famous for being super energy-efficient – think about how long your phone battery lasts. Ampere took that low-power magic and brought it to data centers. For Oracle, this means huge energy savings. Less juice used, less money spent on electricity. Makes sense, right? Plus, and this is kinda cool, these Arm chips are so efficient you can pack more computing power into the same physical space. Think of it like getting a super-efficient engine for your car – you can go faster and further on the same tank of gas, and maybe even fit a bigger engine in without making the car bigger!

Okay, so Oracle’s reasons are clear. But what about Uber? Why did they jump on the Arm bandwagon? Well, for Uber, it’s all about not putting all their eggs in one basket – hardware diversity, they call it. Plus, Uber is serious about going green, becoming a zero-emissions company. Using energy-sipping Arm chips is a big step towards that. And guess what? All those energy and space savings that Oracle gets? Uber gets to enjoy those benefits too, in the form of better prices and lower costs. Sustainability and saving money? Win-win!

Alright, so how did Uber actually do this? It wasn't like just swapping out a lightbulb. It was a whole process, broken down into seven phases, like leveling up in a game. Let’s run through them real quick:

Phase one: Host Readiness. Gotta make sure the basic software on the computers themselves – the operating system and all that – can even work with Arm. Think of it as checking if your new house foundation can even support the walls.

Phase two: Build Readiness. Uber needs to build all their software, right? They had to update their “build pipeline” – that’s like the factory where they make their software – to make software that works on both Arm and the old x86 stuff. Imagine your factory now needs to make both brick and bamboo building materials.

Phase three: Platform Readiness. Their deployment systems – the guys who actually put the software onto the computers – needed to be smarter. They had to learn how to place software on the right kind of computer – Arm or x86 – and have safety nets in place. Like a construction crew that knows exactly where to use brick and where to use bamboo, and has backup plans if things go wrong.

Phase four: SKU Qualification. SKU is just tech-speak for “specific type of hardware.” Uber had to test out these new Arm computers to make sure they were reliable and fast enough. Basically, kicking the tires on the new hardware to see if it’s any good.

Phase five: Workload Readiness. This is where they had to go through all their actual services – you know, the things Uber does, like routing rides and processing payments – and make sure they could run on Arm. Imagine checking if all your furniture actually fits in the new house and works with the bamboo floors.

Phase six: Adoption Readiness. Setting up all the tests and monitoring to make sure everything runs smoothly on Arm. Like hiring inspectors to check if the new house is up to code and everything’s working as it should.

And finally, phase seven: Adoption. The big move! Actually shifting services, one by one, over to the Arm-based computers. Moving in, room by room, into the new house.

They did try to do some of these phases at the same time to speed things up, because, you know, nobody wants a house move to drag on forever. But let’s dig into the really fun part: the challenges of getting their infrastructure ready for Arm.

So, it all started with a seemingly simple goal: build one service for Arm, and get it running on these new Arm computers. Easy peasy, right? Wrong! Turns out, their entire system, from the computers themselves to the software that builds their software, was all built for one type of computer: x86. It was like realizing your whole house is built for only one type of furniture, and now you want to use a completely different kind.

First up: Host Readiness. Before they could even think about running services, they had to make sure the computers themselves were ready for Arm. This meant starting from scratch and building a whole new “host image.” Think of it as creating a brand new operating system, kernel and all the essential software that makes Uber tick, but specifically for Arm. Every single piece of software had to be rebuilt, tested, and double-checked to make sure it played nice with Arm hardware. Once they had this Arm-ready foundation, they could start bringing in Arm computers and get ready for the next step: building actual Uber services for Arm.

Now, Building Services for Arm – this sounded straightforward at first too. Just build the software for Arm, right? Nope, again! As they started digging, they realized their software building system was also deeply, deeply tied to x86. For years, Uber used this system called “Makisu” to build their container images. Container images? Think of them as pre-packaged software bundles, ready to run anywhere. Makisu was super-fast and efficient, but it had a big limitation: it could only build software for one architecture at a time – x86. It couldn’t “cross-compile” for Arm. Imagine your factory can only make brick, but now you need bamboo. You can't just flip a switch and suddenly make bamboo!

And here’s the kicker: Uber has over 5,000 services! And all their build processes were tangled up with Makisu. Plus, many of these services had custom steps in their build process that were also tied to how Makisu worked. So, ditching Makisu was a huge deal.

Instead of completely ripping out Makisu – which would be like demolishing half your factory – they decided to get clever and Evolve the Build Pipeline. They decided to bring in a new container image builder that could build for Arm. Their plan? Use this new builder to create an Arm-compatible version of Makisu itself! Once they had Arm-Makisu, then Makisu could build Arm versions of everything else. Think of it as building a machine that can build bamboo, and then using that machine to make all the bamboo you need for your house.

They chose “Bazel” as this new builder. Why Bazel? Because Bazel can build software for different architectures, even if it's running on a different one. It’s like having a factory that can make both brick and bamboo, no matter if the factory itself is made of brick or bamboo. And, a lucky bonus, Uber already used Bazel for some of their other software, so they already had some know-how in-house.

Okay, so they had Bazel to build Arm-Makisu. Problem solved? Nope! Time for some Breaking the Circular Dependency drama! See, Makisu runs on Buildkite, Uber’s main system for automating software builds. And Buildkite itself runs on their “Stateful Platform” called Odin. And Odin? Odin relied on a bunch of essential “host agents” – little software helpers that do things like logging, tracking metrics, networking – basically, the behind-the-scenes crew that keeps everything running smoothly on every computer at Uber. And guess what? All these critical host agents were built using… you guessed it… Makisu!

It was a classic chicken-and-egg problem. Or maybe a better analogy is a chain reaction. Before they could fully bootstrap Makisu for Arm, they had to untangle and rebuild every single piece of this puzzle. Each component – host agents, Odin, Buildkite, and finally Makisu itself – had to be switched from being built by Makisu to being built by Bazel. It was a cascade of dependencies! Like having to rebuild the foundation, then the walls, then the roof of your factory, all while still trying to produce building materials!

But they tackled it step-by-step. First, they migrated the host agents to Bazel, then the Odin components, then Buildkite, and finally, Makisu itself. Slowly, systematically, using Bazel’s multi-architecture magic, they transformed their infrastructure, piece by piece.

Once Makisu and the whole Buildkite stack were running on Arm, they moved onto the next big step: Distributing the Build Process. Instead of just building software on one type of machine, they set up a “distributed build pipeline.” This new pipeline could spread out the build process across both Arm and x86 computers. Makisu would run natively on each type of computer, building software specifically for that architecture. Then, in a final step, the pipeline would combine the x86 and Arm software versions into a single, unified “multi-architecture container image.” Think of it as having two factories, one for brick and one for bamboo, both working at the same time, and then combining their outputs into a single, ready-to-use building material package.

This distributed approach was a smart move. It meant they didn’t have to rewrite every single software build process from Makisu to Bazel – which, remember, would have been a massive undertaking. Plus, building software natively for both Arm and x86 meant they could support software that couldn’t be cross-compiled. And, native builds were faster, cutting down build times and keeping things snappy.

But, like everything, there were trade-offs. The big one was cost. Building software for both architectures basically doubled their build costs. With over 400,000 software builds per week (back when they wrote this!), that extra cost added up fast. But, even with the doubled build costs, switching to Arm was still worth it in the long run, because of the energy savings and better performance. Plus, multi-architecture builds made the whole transition smoother. They could use the same software package on both Arm and x86 computers, which made deployment way easier.

Finally, after all that groundwork, they were ready to Deploy the First Services. Uber is all about taking things slow and steady, especially when it comes to their live systems. So, for Arm, they extended their deployment systems to handle “architecture-specific placement.” This gave them super-fine control over where a service ran – on Arm or x86. They could carefully move services to Arm, one at a time, like cautiously moving furniture into the new bamboo rooms. They even built in a safety net! If something went wrong with a service on Arm, the system would automatically switch it back to x86. Smart, right?

And then… success! The first services were built, deployed, and running on Arm-based computers! It was a moment of celebration, a proof-of-concept that Arm could actually work alongside x86 in Uber’s massive infrastructure.

But, as they say, the journey of a thousand miles begins with a single step. This was just the beginning. Adapting 5,000 services to run on this multi-architecture platform? That’s a whole different ball game.

And that, my friends, is where part one ends! In the next part of this blog series, they promise to dive deeper into the actual adoption process and the strategies they used to migrate all those services. So, stay tuned for part two to see how they tackled the even bigger challenge!

And a quick shout-out to all the folks who made this happen – the Uber teams, Oracle Cloud, Google Cloud, Ampere, and Arm. It was a team effort, for sure!

So, that’s the Arm story so far at Uber. Pretty cool, huh? Who knew moving to the cloud could involve so much… architecture? Anyway, hope you enjoyed this little tech deep dive! Catch you in the next one!

Editor is loading...