Continued from previous post. Written in public, but not edited for public consumption.
Ultimately, a list of desiderata isn't useful without understanding the use cases that drive them.
Who is this for? What is their goal? What are the core use cases I want to enable?
Ultimately, I have four goals:
- For the experimenting engineer, I want time-to-new-API to approach zero.
- For the developing/testing engineer, I want the time-to-peer-testable-API to approach zero.
- For the deploying engineer, I want to minimize rollout risk.
- For the operating engineer, I want to maximize instrumentation and operability.
In a healthy devops culture, where a team owns their product top to bottom, these engineers are all the same set of people.
It's worth unpacking each of these statements.
The Experimenting Engineer
Anecdotally, projects which have short time-to-proof-of-life gather momentum, support, and funding much more quickly than those that don't. Stories sell, and stories with working prototypes sell better.
Indeed, the most successful projects are those which never go through a formal approval process; they instead iterate quickly and quietly, gathering feedback from users with each pass and improving until they become indispensably useful.
Obvious: all other things being equal, a system that encourages experimentation and iteration in a safe, scalable, maintainable way will always beat one that doesn't.
Specifically, an engineer writing software that interacts with some control plane—primarily SDKs, CLIs, and web UIs. This software is distributed (if only ephemerally, in the case of web UIs) and run on client machines, introducing compatibility, concurrency, and trust challenges.
"Time to New API"
When experimenting, time-to-task is an indirect measure of frustration. The longer a task takes, the more likely the user will be frustrated by it.
"New API" means both a brand new API (new product area) and an evolution of an existing API. Creating either should be equally fast and easy.
The Developing/Testing Engineer
It's not enough that a developer can get something running on their machine. They must also get it running on someone else's machine.
Most healthy development teams have a process for reviewing and testing code. At my last company, reviewers regularly ran through test plans on posted code reviews.
The most troublesome changes happen when a client change relies on an API change. You need to test several combinations to have confidence that the change is safe:
- Old API, old clients (baseline)
- New API, new clients (the thing you actually care about)
- New API, old clients (ensure new API is backwards compatible, and can support gradual rollouts and rollbacks)
- Old API, new clients (ensure client is backwards compatible)
And, this is just against a local environment. In many cases, you also want to test new clients against multiple staging and production environments. The latter requires the ability to perform shadow deployments, and selectively enable access/visibility to them.
The Deploying Engineer
The deploying engineer cares about predictable rollout and rollback mechanisms. They need to be able to quickly understand the nature of the change and its probably impact, both ahead of time (e.g., in preparation for a deployment) and in retrospect (e.g., while investigating potential causes of a livesite issue).
During the deployment itself, the engineer needs to have confidence that the rollout isn't adversely impacting customers. That means being able to gradually roll out and monitor the change. If an issue is detected, the operator should be able to cancel the deployment, and the system should provide an easy means to get back into a known-good state.
- Persistent health checks
- Gradual deployments
- Change verification tests
- Cancellable rollouts
This ends up looking a lot like a general purpose deployment tool, which then begs the question: can we use an existing general purpose tool? Perhaps the same one used to deploy the gateway itself?
This requires we decouple application configuration distribution from feature activation. This ends up being desirable, as it supports the shadow deployment, feature flag, and whitelist scenarios mentioned earlier.
It allows us to manage application code and application configuration as a single deployable unit (part of the same dependency closure) which is necessary for creating byte-for-byte clones of environment—vital for creating test environments and reproducing issues.
The Operating Engineer
The Operating Engineer cares about issues that come up after the software is already in production. The system should help the operator answer the following questions:
- "A thing happened that should not have happened. Why?"
- "What do I do about it?"
During an ongoing issue/outage, where there is a strong correlation between "deployment time" and "time that bad things started happening", these questions are typically answered in reverse order, with the initial resolution being, "Roll back the change."
But where no such correlation exists, or when performing a Root Cause Analysis for a permanent fix, the system must actively aid in answering these questions.
Practically speaking, this means multiple layers of monitoring and alarming:
- Self-reported application telemetry provides the highest level of granularity and visibility, but is the first to fail when application processes do.
- Agent- and host-reported application and host telemetry provides redundancy when application-reported telemetry fails, and can provide an alternate view to help triangulate issues.
- Externally reported application and host telemetry, often in the form of pings and "telemetry drop detection" provide a view of the application from outside the host, but still internal to the production network. This can be especially important for detecting network issues between the LB and the host.
- POP-reported telemetry provides a true "outsider's view" for detecting spikes in client-perceived latencies, dropped connections, etc. These are vital when your customers are globally distributed.
To help answer the two core questions above, the system should contribute metrics and debugging information to each layer in some way.