Signs that your Cloud Architecture is Unhealthy

Published

If you work at a tech company and these apply to you, your architecture and probably tech culture could use a renovation:

1. Deployments are Untrusted during "Big Events"

Maybe you're showing the product off to a potential investor, or there's a tradeshow or convention you're at, but you've been told "not to touch prod" for a while until the buzz dies down.

2. You have to do a lot of night or off-hours Maintenance

Similar to 1, because deployments and maintenance is not trusted to just work, engineers are expected to work nights and weekends to deploy their applications or perform trivial updates.

3. Everyone has Access to Production

Depending on what company you're at, this might not be a severe problem, but it's a problem nonetheless. Usually it starts out where no one has access. Then the request tickets start coming in, and one by one engineers are granted access to production resources and databases, probably without any automatic expiration of credentials. When access is revoked, hell is raised.

Usually this happens because the development or staging environments have diverged so much from production that they are no longer useful environments, and are more or less just there for show.

4. Any one particular server dying could result in a catastrophe

Despite being on a public cloud, people tend to forget that at any moment a server could just die. Usually it's not the providers fault, but sometimes it is, and is therefore totally out of your hands. Of course, this server was running a stateful application like a database, or was running a really important job that couldn't be interrupted, or was in the middle of serving a very long, un-retryable request to an important client.

5. It is extremely difficult to re-create any particular environment

Maybe someone played with Terraform or Cloudformation way back when, but your environment at this point is totally maintained manually at this point. Pray no one asks you to set up a QA environment.

Fun fact: I have worked in a place where I was given a fully manually configured environment, and the powers that be could not fathom putting time and effort into putting it into code. I spent free time importing as many resources as I could into Terraform.

6. Running your configuration management tool (Ansible, Puppet, Chef, etc.) is dangerous and avoided

Due to the number of manual changes made to servers, and how long its been since configuration management was last executed on a particular server, you have no idea what it's going to do when you run it next, so you just avoid it and make your change manually.

7. Multiple microservices must be deployed "at the same time" to avoid downtime

This one is pretty self explanatory, but in case its not, there is no such thing as microservices going out at the same time. Especially when there are multiple replicas of the same service, there will always be the case where one service is upgraded before the other, there will be downtime, and then the other service comes up.

8. The development of in-house solutions is considered hostile, and purchasing off-the-shelf products is always favored

There is always a time and place to purchase off the shelf products, but developing in-house solutions should always be considered a valid option. If you're afraid to write software, then it's time to pack up the 'ol keyboard because you're in the wrong business.

Off-the-shelf software saves you a fixed startup cost, say 40 hours. Meaning you can get up and running quicker than if you were writing something from scratch. However, the planet keeps on spinning and tomorrow tends to always come, and because that software wasn't tailor made for your needs, you might find yourself spending more and more time wrangling it to do what you want. Eventually those 40 hours saved will be gone, and you're now in the red both in time and money.

Solutions

Build for failures or they will find you

Netflix's Chaos Monkey is well known at this point. Every single request, database query and job should be built with the understanding that this could and probably will fail at some point. This shouldn't catch anyone by surprise. Don't point fingers at whose fault it is when the failure happens because you should be expecting them to happen at any give moment.

Code should automatically retry requests. Databases should be on at least 2 and ideally more than 3 instances, automatically failing over when necessary.

Prioritize good coding practices

There's nothing new to say here - good code makes for a product that is safe to change and resilient to bugs, errors, and unexpected cases. Don't be afraid to refactor despite what Hackernews tells you. Writing tests and decoupling unwieldy code should be treated as part of the normal software development workflow and not some weird thing you do after the fact when you have some extra time (there is no such thing as extra time).

This means building for the long haul. Avoid getting bitten in the ass 2 months from now because of a decision to get something out the door slightly quicker. Those 2 months will always come.

Use your automation as much as possible

There's a place for doing manual work, especially when you're just testing out ideas or new products, but the moment that manual changes make their way to prod, your automation begins to rot. You should be able to make a change to 100 servers as easily as you can to 1, and you should know exactly how each server is configured.

Automation, like good programming practices, is a startup cost that pays off in the future. It can be really easy to just say "we'll tackle that another time and just get this done now," but this inevitably comes back to bite you in the ass and could take down the whole company with it.