What is this?
I'm using my blog to take notes for a project I'm noodling on. It's being done in the open, but isn't written for public consumption.
What does it mean to have a cloud-native, container-native, web-native API gateway?
Why This Question?
Every large company I've worked for has eventually needed to build an API gateway, whether for their own internal use, or as a customer-facing product. I've seen... five? ...different interpretations of an API gateway.
That seems like a lot of wasted engineering. Can we do better? What might an extensible API gateway platform enabling the following desiderata look like?
Table-stakes. Most UI clients and all non-UI clients expect REST as the primary control plane API interface.
Native WebSocket Support
In particular: WebSocket support includes ability for stateless backend client to 'address' a message by clientId, and have it arrive at the server holding the message.
WebSockets and mailbox support is needed for GraphQL subscriptions.
Streaming IO By Default
Data plane APIs push a lot of bytes. To effectively support these APIs while also allowing plugin-provided middleware, we need to be doing streaming IO by default. This increases the complexity of middleware APIs.
Native GraphQL Support
Not obvious. We consider UI clients API clients. Increasingly, callers expect to have control over the shape of the data returned.
Especially useful: subscription and mutation support, which are hard to get right without the right underlying primitives (message routing, workflows, etc).
Native Workflow Support
Not obvious. Needed for asynchronous requests, scheduled requests, triggered requests. Especially useful for complex GraphQL Mutations that may need to span multiple backend systems.
Native Monotonically Increasing ID Generation
Not super obvious. TBD, since this quickly becomes a scaling bottleneck. Useful for easy coordination across systems. Might be needed as basis for distributed lease library (again, useful for GraphQL Mutations support), and Workflow support.
Native Functions Support
A cloud-native API gateway should support cloud-native primitives. There's increasing demand for true serverless development.
At minimum, this means that developers should be able to use the same Functions-as-a-Service tooling and infrastructure to write their request transformation middleware as they use to respond to implement their serverless control plane.
We'll have to make some opinionated decisions here. This also potentially flies in the face of the World-In-A-Box Execution requirement (unless all modern FaaS providers provide local simulation environments; TBD).
As Functional As Possible
The Java world is littered with state. It is therefore littered with explicit state management, and thus state management bugs.
As much as possible, we strive to model computation as stateless transformations on simple data, rather than as stateful interaction models.
All components can run in a single executable binary, without IPC overhead. Among other reasons, this is incredibly useful for testing and simulation.
Entire system can run as a single executable binary. In particular, this requires that all boundary components communicate via well-defined code interfaces whose implementations can vary (e.g., no explicit coupling via REST API calls, etc).
Pervasive Virtual Time
North of the network stack, all time is simulated. All components requiring time use a centrally configurable clock, and use it to derive all timing information. This allows us to use logical clocks to simulate and test complicated race conditions.
Pervasive Virtual IO
Likewise, network and file IO should be done against code interfaces, rather than directly against network and file APIs.
Network IO, and File IO Operations Modeled as Commands
Not obvious. Wherever possible, model network and file IO operations as time-independent, location-parameterizable operations.
Many teams reside on the same logical "API Gateway". The gateway is the source of truth for "the publicly accessible surface area."
Pervasive, live metrics allow us to wire the API gateway to an appropriate system for processing, indexing, and search/visualization.
End-to-End Request Tracing
We should be able to trace a single request through all layers of the system, including plugin-provided middleware.
Customer Request Isolation
An operator should be able to segment and directly observe specific requests from specific customer/clients, without the request needing to filter through the metric pipeline. This is important for livesite issues.
Pervasive, Audit-Ready Event Logging
All systems dump to a virtual service interface that we can hook up to a SIEM.
Independently Deployable Applications/Namespaces
A single Gateway instance handles application namespaces potentially owned by many teams.
We should be able to stage, preview, test, approve configuration and data-plane changes. This includes:
- New middleware
- New API schemas
- New live-load configuration (including support-created customer isolation requests)
The system needs to support multiple concurrent production versions. This is needed for segmented rollouts, gradual rollouts, etc.
We should be able to select specific sets of customers and quietly point them at new API versions for testing, request isolation, etc.
All traffic changes (whether full-fleet or segmented) should support gradual rollout with one-click rollback mechanisms.