Securing Supergraphs
Best practices for securing federated GraphQL APIs
Supergraphs benefit from the same standard methods you use to reduce the attack surface of any API. The OWASP Top Ten list provides a helpful summary of some of the most common risks.
In addition, there are GraphQL-specific actions you must take to limit your supergraph's exposure to potential threats. These threats are mostly related to denial-of-service (DoS) attacks, and they fall under the categories of API discoverability and malicious queries.
API discoverability
One of the most important ways to protect a GraphQL API against would-be attackers is to limit the API's discoverability in production. Although the inherent discoverability of a GraphQL API enhances developer experience when working locally, we typically don't want to offer these same capabilities in a production environment for non-public APIs. The following sections explore some of the key ways to limit API discoverability.
Turn off introspection in production
GraphQL's built-in introspection query is the fastest way for bad actors to learn about your schema. To prevent this, turn off introspection in your production supergraph, and also limit access to any staging environments where introspection is enabled.
ⓘ NOTE
When using the Apollo Router, introspection is turned off by default.
This video about GraphQL API abuse provides a deep dive into how introspection can facilitate API exploitation and why it's also important to layer additional measures on top of introspection that's turned off to limit discoverability.
Obfuscate error details in production
Many GraphQL servers improve their developer experience by providing detailed error information in operation responses when something goes wrong. In your production supergraph, make sure that verbose error details are removed from API responses.
For example, by default an Apollo-Server-based subgraph provides the exception.stacktrace
property under the errors
key in a response. This value is useful while developing and debugging your server, but you should not expose stack trace details to public-facing clients.
ⓘ NOTE
The Apollo Router omits all error data by default.
You might want to selectively expose error details to clients in your production environment. You can do this with a combination of the Apollo Router's include_subgraph_errors
option and Rhai scripts for response manipulation.
Avoid autogenerating schemas
Another strategy when reducing your GraphQL API's discoverability is to avoid autogenerating schemas whenever possible, especially the fields on the root operation types. There are many developer tools that enable you to autogenerate a GraphQL schema based on a set of initial object type definitions in a schema or some existing database tables. Although these approaches to schema generation can speed up initial API development, they also make it easier for bad actors to guess what kind of generic CRUD-related fields were autogenerated based on commonly used patterns.
An autogenerated schema also increases the risk that you expose sensitive data unintentionally via your GraphQL API. And as a schema design best practice, you should design your schema deliberately to serve client use cases and product requirements. This helps your entire organization get the most out of your supergraph.
Only allow the router to query subgraphs directly
As a best practice for every supergraph, only the supergraph's router should query individual subgraphs directly. The Apollo Federation subgraph specification outlines that each subgraph schema includes _entities
and _service
root fields on the Query
type to assist with composition and query planning. These fields expose the subgraph to additional security concerns if accessed directly by a client.
The Query._service
object includes an sdl
field, which includes the full SDL representation of the subgraph's schema. This field exposes as much data about a subgraph's schema as a standard introspection query, which means it should not be accessible in production.
The Query._entities
field enables the router to resolve fields of any type that's marked with @key
by providing the id for that entity. If this field is exposed publicly, it means that any client can circumvent internal resolver logic and fetch any entity data by mimicking the router. Because the resolver for _entities
is automatically provided by your subgraph library, you can't modify that logic either. That means you would have to manually check all operations that include the _entities
field and block any malicious queries.
Another reason to restrict access to subgraphs is related to the collection of field-level traces. This tracing data is included in the extensions
key of the response from a subgraph to the router, where the data is aggregated into a trace shape based on the query plan and then sent to GraphOS. That means that any client that can query your subgraphs directly can view this data in the operation response and make inferences about a subgraph based on it.
In addition to the above security concerns, preventing the outside world from accessing subgraphs directly also helps you ensure that clients (even well-meaning ones) route operations to the consolidated graph only and don't invent unintended use cases for the types and fields in a subgraph schema that are solely meant for executing the router's query plan.
Malicious queries
After you implement measures to limit API discoverability in public-facing environments, the next step in protecting a GraphQL API is to guard it against both intentionally and unintentionally malicious queries. Again, many GraphQL-related vulnerabilities have to do with how an unprotected API may be exploited in DoS attacks, but there are other considerations as well.
In the sections that follow, we will explore measures that can help mitigate the impact of malicious queries for any GraphQL API, such as limiting query depth and amount, paginating list fields where appropriate, validating and sanitizing data, setting timeouts, authentication and authorization in the router, and guarding against batched query abuse. And for GraphQL APIs with third-party clients, we will also explore using query cost analysis to support rate limiting.
Limit query depth
GraphQL enables clients to traverse through a graph and express complex relationships between the nodes in an operation's selection set. But as far as backing data sources are concerned, this can quickly turn into too much of a good thing when there are no guardrails to restrict how deeply queries can be nested. For example:
query DeepBlogQuery {author(id: 42) {posts {author {posts {author {posts {author {# and so on...}}}}}}}}
One of the most straightforward protections against deeply nested queries such as this one is to set a maximum query depth. And because an operation can specify multiple root fields, you may also consider limiting query breadth at the root level as well.
Paginate fields where appropriate
Paginating fields is another important mechanism to control how many items a client can request at once. For example, a Posts subgraph service might have no problem resolving a thousand total Post
objects in this request:
query {authors(first: 10) {nameposts(last: 100) {titlecontent}}}
What will happen when the orders of magnitude increase for each field argument, and a hundred-thousand Post
s are requested?
query {authors(first: 100) {nameposts(last: 1000) {titlecontent}}}
When paginating fields, it's important to set a maximum number of items that can be returned in a single response. In the example above, you might want to return a GraphQL error when executing the posts
field resolver instead of attempting to return a thousand posts for each of the hundred authors.
Validate and sanitize data
Validating and sanitizing client-submitted data is important for any API, and a supergraph is no exception. In general, the usual rules for validation and sanitization of untrusted inputs apply to GraphQL when resolving fields based on user-provided inputs. And as previously discussed, when users supply invalid values as operation arguments, the resulting errors should provide as few details as possible in production environments.
A well-designed GraphQL schema can also help guard against injection attacks by codifying validation and sanitization directly into types. For example, enum values can limit the range of what can be submitted for argument values, and custom scalars or directives can also help to validate, escape, or normalize values. However, custom scalars should be handled with care, because misusing them might create other vulnerabilities, such as a JSON
scalar type enabling a NoSQL injection attack.
Set timeouts
Timeouts are another useful tool for stopping GraphQL operations that consume more server resources than expected. In a supergraph, timeouts are commonly applied at any combination of three different levels:
- At the highest level, you can set a timeout on the router's HTTP server (or an idle timeout on a load balancer in front of it).
- An an intermediate level, you can set a timeout on the router's requests to individual subgraphs. You can configure timeouts at both the HTTP and subgraph levels using the Apollo Router's traffic shaping configuration.
- At the most granular level, subgraphs can set a timeout for individual operations. The duration of the request can be checked against this timeout as each field resolver function is called. You might accomplish this using resolver middleware or an Apollo Server plugin in a subgraph.
Use rate limiting as needed
Particularly for GraphQL APIs that are consumed by third-party clients, depth and breadth limiting and paginated fields may not provide enough demand control. For these cases, rate-limiting API requests may be warranted. Enforcing rate limits for a GraphQL API is more complicated than a REST API because GraphQL operations may vary widely in size and complexity so the rate limit can't be based on individual requests alone. Instead, we have to think about how much of the graph an operation may traverse in the context of a single request.
There's no one-size-fits-all approach to implementing rate limits for a GraphQL API. For example, the GitHub GraphQL API sets a maximum node limit, along with a point score based on the field connections in a query. It then counts this score against a maximum of 5,000 points per hour.
The Shopify API, on the other hand, assigns different point values to various types and connection fields (also considering the number of items returned by the connection field), while assigning mutation operations a higher value due to the server resources they typically consume. They then use a leaky bucket algorithm that allocates 50 points per second (up to a maximum of 1,000 points) to accommodate sudden bursts in API traffic from a client.
Both the GitHub and Shopify rate-limiting approaches concern the complex topic of query cost analysis (also known as query complexity analysis). As we can see from these examples, assigning a "cost" to a query is complicated and nuanced, and it should be done in a way that suits the API in question. There are several query cost-related packages on npm that can be added to a GraphQL server, but before using any of them, make sure that the assumptions these libraries make on your behalf hold true for your API.
For example, you might want to set fixed costs for different kinds of nodes, or you might manually set costs on a per-type or per-field basis by annotating them with directives (or do some combination of both). You might also have different considerations for how type complexity (the cost returning the number of fields requested) and response complexity (the cost of providing responses for the requested fields) are handled. Or for a completely different approach that doesn't explicitly count types and fields, you could set and iterate query costs based on field tracing data and set a maximum time budget per query.
Given the potential scope of developing a bespoke query cost analysis solution, you should first verify that your API actually needs one. For graphs that are consumed by first-party clients only, other demand control mechanisms might suffice. If you do need to add comprehensive query cost analysis to your GraphQL API, then the work that IBM has done in this area to develop the GraphQL Cost Directives specification may be instructive. Their work in this area was originally published in a paper (with a supplemental video to highlight some of the key concepts) and is further explored in this series of blog posts.
Authentication and authorization in the router
Enforcing authentication and authorization in the router protects your underlying APIs from malicious queries. Dropping unauthenticated, unauthorized queries at the entry point of your supergraph frees up your downstream graphs to process only valid requests, thereby reducing load and enhancing performance.
Hardening access to your supergraph at the router also adds another layer of security when implementing zero-trust and defense-in-depth strategies. The router centralizes authentication and authorization logic, which downstream services can reinforce with their own checks.
To enforce authentication and authorization in the Apollo Router:
- Enable JSON Web Token (JWT) authentication.
- Control access to fields and types with authorization directives.
Batched requests
Batched requests are another potential attack vector for malicious queries. There are two different flavors of batching attacks to consider. The first threat is related to GraphQL's inherent ability to "batch" requests by allowing multiple root fields in an operation document:
query {astronaut(id: "1") {name}second: astronaut(id: "2") {name}third: astronaut(id: "3") {name}}
Without any restrictions in place, clients could effectively enumerate through all nodes in a single request like the one above while slipping past other brute force protections. Limiting query breadth or using query cost analysis can help protect a GraphQL API from this type of abuse.
Another form of batching occurs when a client sends batches of full operations in a single request, which can be helpful for performance reasons in some scenarios. The Apollo Router does not support this form of batching. When operations are batched, Apollo Server receives an array of operations and sends back an array of responses to be parsed by the client (there's a batch link directly available in Apollo Client to facilitate this):
[{“operationName”: "FirstAstronaut"“variables":{},"query":"query FirstAstronaut {\n astronaut(id: \"1\") {\n name\n }\n}\n”},{“operationName”: "SecondAstronanut"“variables":{},"query":"query SecondAstronanut {\n astronaut(id: \"2\") {\n name\n }\n}\n”},{“operationName”: "ThirdAstronaut"“variables":{},"query":"query ThirdAstronaut {\n astronaut(id: \"3\") {\n name\n }\n}\n”}]
With batched operations, it's important to consider how an entire batch might impact rate limit calculations and query cost analysis to ensure that clients can't cheat rate limits through race conditions.
Finally, beyond batching of fields and operations, some forms of GraphQL-related batching can help mitigate DoS attacks and generally make your API more performant overall. Even with depth limiting in place, GraphQL queries can easily lead to exponential growth of requests to backing data sources. DataLoaders are one way to help make as few requests as possible to backing data sources from resolver functions within the context of a single operation.
Security with managed federation
Apart from protecting your GraphQL API from bad actors and locking down private data, you also need a window into how your API is being used (and by whom) to harden your GraphQL security posture. This is where a schema registry and observability tooling (such as those provided by GraphOS) come into play to help control who makes changes to your API and also monitor API usage and send alerts when something isn't right.
Know who's using your graph (and how)
To enhance the utility of traces collected in your observability tooling, it's a best practice to require every client to identify itself and assign a name to every operation it executes. The web and mobile versions of Apollo Client provide straightforward APIs for setting custom headers for a client's name and version. These help you segment traces and metrics in GraphOS by client. Other API clients can set the apollographql-client-name
and apollographql-client-version
request headers manually to provide client awareness. (As a bonus, client awareness also helps you identify which clients might be impacted by a proposed breaking change to your API when running schema checks.)
Additionally, tracing data in GraphOS can help you monitor API performance and errors. You can configure alerts to push notifications to your team when something goes wrong, whether it's an increase in requests per minute, changes in your p50, p95, or p99 response times, or errors in operations run against your graph. For example, a notification about a sudden increase in the error percentage might indicate that a bad actor is trying to circumvent introspection that's been turned off and learn about a graph's schema by rapidly guessing and testing different field names. And if you want to leverage error data outside of GraphOS as well, you can also use the Apollo Router's support for OpenTelemetry to integrate with other APM tools.
Restrict write access to your graph
You should manage internal access to your supergraph as thoughtfully as you manage communication from external clients. GraphOS provides both graph API keys and personal API keys to restrict access to the graphs within an organization. It also supports SSO integration and different member roles so that team members can be assigned appropriate permissions when contributing the graph.
Beyond member roles, GraphOS also allows certain variants to be designated as protected variants to further restrict who can make changes to their schemas, which is especially important in production environments.
Additional resources
For further reading on GraphQL-related security concerns, the OWASP GraphQL Cheat Sheet is an excellent resource to help you review the security posture of your graph.