September 17, 2015

Thoughts on Azure

This is a blog post that's been a long time in the making. There's a lot here to discuss, and I may not be able to get to everything, but this is a review of Azure that needs to be written, which I hope Microsoft will see and learn from, and hopefully will educate some of the engineers out there evaluating between Azure and other cloud services.

Disclaimer: these views are, in no shape or form, representative of the views of my employer, and is simply one person's view from the perspective of an engineer.

First things first: I do want to say that Azure is actually a pretty solid platform. It has been very reliable since day one, and I believe they will have no issues to really report here in terms of IaaS reliability and other resources. You will have no trouble setting up a service that's highly available, due to the fact that they not only have multiple regions and availability zones, but also "availability sets" that allow you to partition failures within an AZ (if the failures happen to be planned maintenance events, that is.)

However, the other aspects of Azure that make up the entire customer experience: the APIs, the documentation, the portal(s), etc.: they have been less than stellar. In fact, a lot of recent interaction and recent experiences have lead me to believe that they can be quite negligent in many aspects, making me believe that while my services will be up and running, there is a good chance that the rug will be pulled underneath them at any time, and will require me to crawl through their sparse documentation and do massive migrations, simply because they want to change things, not necessarily for the better.

APIs and Powershell cmdlets

The APIs are relatively decently documented, for the exception of Azure Resource Manager, which is what Microsoft wants everyone to start adopting as part of their v2 resource migration.

It took a long time before I was able to find out about their schemas (another tool: https://resources.azure.com/, this will help you figure out what parameters can be tweaked for whatever resource you've already created), and even then it was a bit confusing, due to the fact that there's a split between 'classic' resources and 'v2' resources.

Not only that, but there's also currently an issue with their Powershell cmdlets: if you want to interact with v2 resources, you need to use Switch-AzureMode to switch over to Azure Resource Manager mode. This causes a few issues:

Because a lot of resources are common between v2 and classic, you end up with the same cmdlet names between AzureServiceManagement and AzureResourceManager roles, and those cmdlets don't even share the same parameters
Due to this fact, you now have a very stateful command line suite that may end up switching around your global environment variables, affecting not only your current session, but all sessions on that system (https://github.com/Azure/azure-powershell/wiki/Deprecation-of-Switch-AzureMode-in-Azure-PowerShell)

This is also the case with xplat-cli, though I don't think switching over to ARM mode will modify every session on your machine.

There isn't a Powershell cmdlet for every service interaction. You may have to peek into their .NET libraries from time to time in order to do certain things, like add Event Hubs to Service Buses, for example.

This has been a gigantic thorn in our sides. We've probably spent at least three or four weeks unpacking a lot of undocumented gotchas, researching as much as we can from various GitHub examples and blog posts, and banging our heads on the wall due to the fact that there's just no documentation on lots of the options in ARM.

Not only that, but they're already starting to deprecate v1 resources and APIs. I don't have a timeline on this, but everyone's been saying it's either this year or early next year. v1 resources, by the way, are not compatible with v2 resources, so you cannot have a v1 VM on a v2 virtual network, or vice versa. Thus, you likely will have to invest a lot of resources in setting up a "DR site" in v2, then flip things over in one go.

This is simply unacceptable. The main competitor, Amazon, has thorough documentation, a thorough understanding on how to handle APIs and resources, and does not pull the rug underneath people's feet when it comes to production deployments. I believe they have been trying to get people to migrate over to VPC for the past three or four years, and... well, they still have EC2 classic. I believe Amazon has been much more careful about what API options they expose and how to handle API versions when they add new features.

The last couple of times we've interacted with Microsoft about some issues we've had that happen to involve v1 APIs and resources, we've basically been told to go pound sand and to migrate our resources over to v2. This isn't how you interact with your customers.

Another thing about their APIs: take a look at how you do CRUD operations on virtual networks, and tell me if I'm crazy for thinking that they decided to induce a race condition for no reason whatsoever, when the rest of the APIs for interacting with IaaS resources are actually sane.

Documentation

It's okay, but heavily biased toward interactive creation of assets. Mostly non-existent for Azure Resource Manager. I hope they improve this soon.

The Portal(s)

They seem user friendly enough on first glance. However, I would absolutely love to see a combination of the classic portal and the preview portal - namely, the alphabetical ordering of resources and immediate listing of the names of services in the latter, paired with the much simpler, less confusing, and not needlessly complex interface of the former.

They're a mess. There's been numerous times I've had things completely hang in the preview portal and get stuck in a bad state, and there's been numerous times in the classic portal where I've been massively confused by where I can find certain things, due to the fact that the icons don't necessarily describe the whole story about what the underlying service is all about, not to mention that the damned services are not alphabetized and are jumbled together in one list.

The preview portal has this weird horizontal scrolling paradigm going, which is completely awkward when dealing with a regular mouse, and a bit annoying with a Mac trackpad in Chrome (as it'll sometimes go back a page if I try to scroll back.) It'd be much better to have vertical scrolling pages and the sidebar like in the classic portal.

By the way: the preview portal and classic portal have the same exact weird divisions as v1 and v2 APIs/resources. So you can only see v2 resources in the preview portal, and there are some v1 resources you can only see in the classic portal (Service Bus, for example.)

I hope they unify the portals soon, and I really hope they get their design paradigms straight.

VMs

Nothing to write home about. You can create Windows and Linux machines at will, and they seem fairly performant, though I haven't really ran any extensive performance tests. There are a few things that bother me however:

No Azure Metadata Service. This makes it hard to import out of band attributes that you need to access later on. Chef is a great example of this: if you bootstrap using knife, the plugin has full knowledge of the outside world, and so it sets some attributes relating to the Azure cloud service that the VM is being deployed to, etc. If you're deploying via ARM (via the Chef extension), you have no such introspection, and there's no parameters that allow you to set custom Ohai hints or JSON attributes - therefore, you need to jump through a bunch of hoops with modifying node attributes via knife exec.
Windows VM images vended by Microsoft do not have WinRM properly configured by default. You need to go through a number of steps, such as generating an SSL certificate and popping open the firewall on the machine in order to actually use WinRM.
Completely different paradigms of handling VMs between v1 and v2. In v1, you have what's called a cloud service, which you can think of as a one-to-many NAT with a DNS record and a publicly accessible IP address attached to it. You can have every one of your VMs have a cloud service attached to it (so each of your VMs are individually addressable from the outside, a la EC2 Classic), or have multiple VMs behind a single cloud service. In v2, you don't have that anymore - you instead use load balancers and public IP addresses attached to each VM. This could make migrations between v1 and v2 somewhat difficult.
Vendor images and custom created images occupy different 'spaces' - you cannot refer to one or the other in a consistent manner, and you need to have a 'storage account' to store your custom images in.
Vendor images from Microsoft are updated every few months, and the oldest image is deleted. This can cause havoc in your automation scripts every few months. I highly recommend creating custom images in order to keep the image in a semi-stable state, but then you have to deal with (4), which adds even more complexity...

Point (1) has been a huge thorn in our sides since we use Chef fairly extensively with Chef Search, and this makes it incredibly hard when we need to divide up clusters based on what cloud service they're in. We've ran into race conditions and numerous other problems trying to work around the fact that there is no such thing as an Azure Metadata Service. You may have more luck setting up something like etcd or consul, but YMMV.

Interactions with Microsoft

Abysmal. Let me just get that out there without any polish. Multiple times, we've been told to go pound sand; multiple times, we've been told to just go RTFM; and multiple times we've sat on conference calls just to have a discussion on when the next conference call should be. All of this could have been avoided if they did the following:

Did not deprecate APIs so quickly
Adequately documented all of their APIs and schemas, or have links over to where we could find more information about this stuff
Had a mindset toward customer satisfaction versus trying to reach feature parity with the biggest contender on the block
Recognized that there are significant production deployments in Azure, and that they cannot just completely drop all support for a huge class of resources without a good amount of backlash and resentment

If I were running a web service and I needed to choose between Azure or whoever else, I'd probably go somewhere else. While their services are rock solid, their handling of APIs, documentation, and customer interaction have been so dismal that it does not give me much confidence that my service will actually stay running due to the fact that they change things all the damn time, and want people to go to the newest, shiniest thing as much as possible. For that reason, Azure is, in reality, not a very stable platform at all, and unfortunately, it has nothing to do with power outages or network faults.