api-gateway, software

Cloud-Native API Gateway Brainstorming

What is this?

I'm using my blog to take notes for a project I'm noodling on. It's being done in the open, but isn't written for public consumption.

The Question

What does it mean to have a cloud-native, container-native, web-native API gateway?

Why This Question?

Every large company I've worked for has eventually needed to build an API gateway, whether for their own internal use, or as a customer-facing product. I've seen... five? ...different interpretations of an API gateway.

That seems like a lot of wasted engineering. Can we do better? What might an extensible API gateway platform enabling the following desiderata look like?


Native REST

Table-stakes. Most UI clients and all non-UI clients expect REST as the primary control plane API interface.

Native WebSocket Support

In particular: WebSocket support includes ability for stateless backend client to 'address' a message by clientId, and have it arrive at the server holding the message.

WebSockets and mailbox support is needed for GraphQL subscriptions.

Streaming IO By Default

Data plane APIs push a lot of bytes. To effectively support these APIs while also allowing plugin-provided middleware, we need to be doing streaming IO by default. This increases the complexity of middleware APIs.

Native GraphQL Support

Not obvious. We consider UI clients API clients. Increasingly, callers expect to have control over the shape of the data returned.

Especially useful: subscription and mutation support, which are hard to get right without the right underlying primitives (message routing, workflows, etc).

Native Workflow Support

Not obvious. Needed for asynchronous requests, scheduled requests, triggered requests. Especially useful for complex GraphQL Mutations that may need to span multiple backend systems.

Native Monotonically Increasing ID Generation

Not super obvious. TBD, since this quickly becomes a scaling bottleneck. Useful for easy coordination across systems. Might be needed as basis for distributed lease library (again, useful for GraphQL Mutations support), and Workflow support.

Native Functions Support

A cloud-native API gateway should support cloud-native primitives. There's increasing demand for true serverless development.

At minimum, this means that developers should be able to use the same Functions-as-a-Service tooling and infrastructure to write their request transformation middleware as they use to respond to implement their serverless control plane.

We'll have to make some opinionated decisions here. This also potentially flies in the face of the World-In-A-Box Execution requirement (unless all modern FaaS providers provide local simulation environments; TBD).

Development Characteristics

As Functional As Possible

The Java world is littered with state. It is therefore littered with explicit state management, and thus state management bugs.

As much as possible, we strive to model computation as stateless transformations on simple data, rather than as stateful interaction models.

World-In-A-Box Execution

All components can run in a single executable binary, without IPC overhead. Among other reasons, this is incredibly useful for testing and simulation.

Testing Characteristics

World-In-A-Box Testing

Entire system can run as a single executable binary. In particular, this requires that all boundary components communicate via well-defined code interfaces whose implementations can vary (e.g., no explicit coupling via REST API calls, etc).

Pervasive Virtual Time

North of the network stack, all time is simulated. All components requiring time use a centrally configurable clock, and use it to derive all timing information. This allows us to use logical clocks to simulate and test complicated race conditions.

Pervasive Virtual IO

Likewise, network and file IO should be done against code interfaces, rather than directly against network and file APIs.

Network IO, and File IO Operations Modeled as Commands

Not obvious. Wherever possible, model network and file IO operations as time-independent, location-parameterizable operations.

Operational Characteristics

Multitenant System

Many teams reside on the same logical "API Gateway". The gateway is the source of truth for "the publicly accessible surface area."

Live Metrics

Pervasive, live metrics allow us to wire the API gateway to an appropriate system for processing, indexing, and search/visualization.

End-to-End Request Tracing

We should be able to trace a single request through all layers of the system, including plugin-provided middleware.

Customer Request Isolation

An operator should be able to segment and directly observe specific requests from specific customer/clients, without the request needing to filter through the metric pipeline. This is important for livesite issues.

Pervasive, Audit-Ready Event Logging

All systems dump to a virtual service interface that we can hook up to a SIEM.

Deployment Characteristics

Independently Deployable Applications/Namespaces

A single Gateway instance handles application namespaces potentially owned by many teams.

Change-Managed Deployments

We should be able to stage, preview, test, approve configuration and data-plane changes. This includes:

  • New middleware
  • New API schemas
  • New live-load configuration (including support-created customer isolation requests)

Red/Green Deployments

The system needs to support multiple concurrent production versions. This is needed for segmented rollouts, gradual rollouts, etc.

Segmented Rollouts

We should be able to select specific sets of customers and quietly point them at new API versions for testing, request isolation, etc.

Gradual Rollouts

All traffic changes (whether full-fleet or segmented) should support gradual rollout with one-click rollback mechanisms.

Author image

About Mason Graye

Mase writes code, reads books, and lifts weights. And that's about it.
  • Seattle, WA