HOME  → POSTS  → 2018

Understanding Trust in Your Infrastructure

Engineering for Site Reliability1608 words8 minutes to read

Only a tiny fraction of the code your application runs was written by you or your team. How do you know you can trust the code that was written by other people? Where would you even start?

Trust dial

What do I mean by “trust”?

Movies and TV shows have given us a version of trust which essentially boils down to “Do you trust me?” as they hold out their hand to another person. In the movies things generally work out in the end, even if they run into a little more trouble along the way. This is the kind of trust that teenagers, newly in-love, have with their new person.

This is also the kind of trust that most engineers have in their software dependencies. This is not what trust is, and is a high-risk way to build applications.

Teen Titans

If you’ve ever been spurned by an ex-lover, or have grown-up around shady people, you’ll likely have a different definition of trust. A marriage counselor may say something like “trust, but verify”. A person who has grown-up in a bad neighborhood or around shady people may have the perspective that trust is earned, not given. A certain amount of paranoia is a good thing.

However, as with everything, you can also have too much paranoia. These are the teams who ship an application, and if it’s not broken, they don’t touch it. Their curse is that they fall so far behind the security and maintenance curves, that their applications become ticking time bombs — defeating the very purpose they think their paranoia addresses.

The point that I’d like you to take away from this is that trust is earned, not given. When you come from this perspective, you make better technical decisions.

What is my application?

Depending on the type of engineer you are (front-end, backend, ops), you may look at the applications you work on through different lenses.

  • Some see the client-side, browser code they’re writing.
  • Some see the Golang, Node.js, Python, or PHP code they’re writing.
  • Some see the package dependencies, and their package dependencies, and so on…
  • Some see code like the Docker runtime, OpenSSL, cURL, or the Linux kernel.

In truth, all of these answers are correct. The best engineers know how important it is to look at the entire stack — from the application, to the runtime, to the hypervisor, to the kernel.

Reusable layers, and understanding trust

It’s a common (and extremely sensible) pattern to re-use and build atop existing technology layers. By leveraging this powerful foundation, we can build bigger, better, and more powerful appications and services! But we also need to understand how core concepts like trust work between all of these layers.

Wrong way sign

Let me give a few examples of anti-patterns that are also very commonplace in many organizations (mostly due to ignorance, as opposed to malice):

NOTE: I’m speaking from a context of applications which run on popular cloud infrastructure services like AWS, GCP, or Azure, and have sane processes in place like actively-supported system images (e.g., AMIs).

  • Fetching application dependencies live from upstream sources (e.g., the internet is ephemeral; is your app?).

  • Running package manager updates when spinning-up a new machine (e.g., modifying the underlying system image at boot-time; yum -y update).

  • Running package manager updates when deploying to Production (e.g., picking up potentially untested software without a testing stage in-between).

  • Adding new package manager repositories from random places on the internet (e.g., taking candy from strangers).

  • Relying exclusively on a single availability zone or region from their cloud infrastructure provider.

“These aren’t anti-patterns,” you say. “They’re just how development is done.”

Thank you for your thoughts, hypothetical reader. But consider the following:

An unpublished package broke the internet

In case you forgot, in early 2016, one package broke the entire Node.js ecosystem.

Broken collarbone

David Haney writes in his piece “NPM & left-pad: Have We Forgotten How To Program?”:

Okay developers, time to have a serious talk. As you are probably already aware, this week React, Babel, and a bunch of other high-profile packages on NPM broke. The reason they broke is rather astounding:

A simple NPM package called left-pad that was a dependency of their code.

left-pad, at the time of writing this, has 11 stars on GitHub. The entire package is 11 simple lines that implement a basic left-pad string function. […]

What concerns me here is that so many packages and projects took on a dependency for a simple left padding string function, rather than their developers taking 2 minutes to write such a basic function themselves.

Each and every application team which was hit by this issue, and allowed it to impact a Production-facing deployment, failed to understand trust.

In this case, they should have implemented a package caching system, which can fetch a dependency on the first request, then cache that version for all subsequent requests. That way, if there is an issue with an upstream source, you will not be impacted.

Crashing the entire stack

I was working at Amazon Web Services back in 2010 when AWS Elastic Beanstalk was still in development. The team was working to build an easy-to-use solution around the idea of “application containers” (back before Docker was spun-out from dotCloud, an early PaaS provider). At the time I was helping them add PHP + Apache support to Elastic Beanstalk in time for launch, as I was the de-facto “PHP guy” in AWS.

/etc/passwd

Development was running on a pre-release version of what would become Amazon Linux. The original configuration was designed to run yum -y update on boot, which essentially means pick up the latest versions of all installed packages. While the team was thinking about system security (and avoiding outdated packages), everything broke on the day that the Amazon Linux team published a new version of Apache with backwards-incompatible changes. The development team failed to understand trust.

Fortunately, it was a little before the public launch, and so only a few internal beta customers and developers were impacted. But watching that incident was the day that I learned that you don’t arbitrarily install all system updates. You should do that in your development environment instead, work out the issues, then roll something out to Production that has been tested and works as expected.

Dev/Prod parity

If you’ve never heard of the 12-factor app methodology, you are absolutely missing out. One of the chapters is entitled “Dev/prod parity”, which essentially boils down to keeping development, staging, and production as similar as possible.

One thing that I’ve seen bite a team is that they were deploying an application by pushing the source code from Git to the production instances, then resolving their packages directly on the instance. (To be fair, this was back in the days when Capistrano was hot, and we’ve come a long way since then.)

But even in the world of Docker and continuous integration, I still see similar things happen. A team will build a Docker image for their dev app in their CI pipeline, push it to their Docker repository, then deploy it to dev. Then they build the image again when deploying to staging. Then again when they deploy to Prod. This is the same problem! The dependencies are not being tested appropriately in the earlier environments before progressing to the production environment.

With Docker, some teams have figured out how to make the exact same mistakes even faster! Those teams have failed to understand trust.

Instead, you should build the production-ready Docker image once, then promote that same image up to each environment as the requisite confidence is built.

“But how do I include my development dependencies inside my Docker container?”

Batman slaps Robin

Docker images that are built should be the exact same bytes, regardless of the environment. Your dev build should write out logs in the same way as your Production app would (although perhaps to a local location). You should be able to define things like environment variables that are read by the Docker daemon at container launch. Or by defining a local volume to mount containing configuration information. But the insides of the Docker image should always be completely identical between environments.

Guidelines for trust

When you’re provisioning software onto a machine that will run in Production, you don’t want to be running software from anywhere. You need to know that you can trust the source of the software before you ship it into Production.

In my case, I tend to work on teams which run servers with a blend of RedHat Enterprise Linux (RHEL), CentOS, and Amazon Linux. Containers are commonly Ubuntu, Debian, or Alpine. I work with applications written in nearly every major programming language. These are my criteria for determining whether or not to trust a package or Docker image.

  1. Packages are maintained by CentOS, RedHat, Amazon, Ubuntu, Debian, Alpine, etc., directly.

  2. Packages are maintained by the vendor of the software directly (e.g., Docker, Amazon, PHP, Node Foundation, Angular, Kubernetes).

  3. Packages are maintained by a reputable third-party source (as few of these as possible; e.g., NodeSource).

  4. Packages are maintained by us. That is, we compile them from source ourselves (into .rpm, .deb, or .apk packages), or we write the software packages ourselves (e.g., composer, pip, npm, dep).

Your criteria may look different, and that’s OK. Some engineering teams are better at this, while others are still maturing.

If you don’t have criteria, and generally just install software from anywhere, I have two pieces of advice.

  1. Stop it.

  2. Our criteria has been very good to us. Feel free to borrow ours.

Ryan Parman

is an engineering manager with over 20 years of experience across software development, site reliability engineering, and security. He is the creator of SimplePie and AWS SDK for PHP, patented multifactor-authentication-as-a-service at WePay, defined much of the CI/CD and SRE disciplines at McGraw-Hill Education, and came up with the idea of “serverless, event-driven, responsive functions in the cloud” while at Amazon Web Services in 2010. Ryan's aptly-named blog, , is where he writes about ideas longer than . Ambivert. Curious. Not a coffee drinker.