Constructor Blog

Get the latest news about launches, product updates, and more.

The Hunt for Prod Bugtober: Apollo GraphQL Edition

Andrew Kallem
Andrew Kallem
Nov 15, 2022

Could there really be a prod-only bug for over 6 months in one of the most popular GraphQL client libraries...?

That was the question on my mind as I debugged an inexplicable regression in Constructor. It turned out the answer was yes.

This is the play-by-play of how I uncovered a material bug in Apollo Client causing it to return corrupt data to callers. And it only happened in production.

Along the way I'll touch on topics in GraphQL, deploy previews, cache normalization, optimistic UI, memoization, git, and debugging. Even if you're well-versed in all of those, you may find this story interesting because it has some surprising twists.

So you can best experience this story as I did, the rest of this post is written in the present tense.


Bug Bite

As I watch a session recording of one of our users, I'm alarmed and perplexed by an inexplicable bug: after he deletes a task in a checklist, it reappears just a few moments later! Worse, as the user makes some additional edits, a different task is displayed twice, one time checked off and one not! How could the same task be in two different states?

The Hunt Begins

I check the database. The deleted task is definitely deleted, the checked off task is definitely marked as complete, and there is definitely only one record for the task being displayed twice. I shift my attention to the frontend codebase.

We use LogRocket for session recordings. It's a brilliant tool: not only can you watch a session recording, you can traverse a complete snapshot of the DOM at any point.

I pause the recording at a moment when the deleted task is still displayed and that other task is displayed twice. Sure enough, the DOM is consistent with what's being shown.

I consider a few hypotheses:

  • Is this purely an artifact in LogRocket, and the user didn't actually experience what I'm seeing in the session recording? No. In the recording the user tries to delete the same task again. They must have seen it reappear, just as I saw in the recording.
  • Does the user have a browser extension installed that manipulates the DOM and is acting erroneously? It seems unlikely, but I can't rule it out.
  • Our team recently released several significant improvements to Constructor's checklists. Did we introduce a bug? Maybe. But surely we couldn't have missed a regression this obvious during QA?

Conclusion: None of these hypotheses seem especially likely. The best path forward is to try to reproduce the bug.

Nothing Works

I spin up a local development instance of Constructor and play around with checklists. I try all sorts of things. Nothing I do triggers the bug. Not even when I replicate the exact state and sequence of events in the recording.

I begin to suspect that what I saw in the recording is the fault of a rogue browser extension or other quirk on that particular user's system.

There's nothing else I can do if I can't reproduce the bug. But I'm reassured that if I can't reproduce it, it must be very rare, so it's not likely to affect many other users.

Conclusion: I drop it and move on to other work.

Not Staying Local

Fast forward a week. I'm using Constructor, as I do every day, and I happen to delete a task. It reappears! 😲

I treat that browser tab like a crime scene. I don't know whether I'll be able to trigger the bug again, so I want to preserve as much state as possible in the tab where it occurred. I open a separate tab and try deleting another task. It happens again! I try once more, and again, I see the bug. After a bit more experimentation, I'm also able to trigger a state where the same task appears twice, just like in the session recording.

Apparently, it's extremely easy to trigger the bug! Or rather, it's extremely easy to trigger it only in production. 😱

(As for how we missed the bug in QA and our team's day-to-day use of Constructor... I'll come back to that later. For now my concern was understanding and fixing the bug or finding a cheap mitigation. Simply rolling back the recent release wasn't a practical option, for reasons beyond the scope of this post.)

Conclusion: The bug is real, and it occurs only in production.

Sneak Preview

Constructor is a single-page React app, and we generate a Netlify deploy preview for all PRs. These aren't full-stack ephemeral environments, but they're still great for previewing frontend changes on production data before the changes land. Since I still can't reproduce the bug locally, my best option for debugging is to use these preview deploys.

To begin, I open a PR with a console log that dumps out the data being passed to the checklist React component. When I trigger the bug — deleting a task in the preview and seeing it reappear shortly after, or making an edit that causes a task to be displayed twice — the logs reveal that the component is being passed bad data, data that for some reason includes the deleted task and that lists the other task twice. I check the network traffic in the browser developer tools. There's no problem there; the backend responded with valid data that was consistent with the database.

Conclusion: The bug is a purely frontend issue.

Can I Check Your id?

Constructor's frontend uses Apollo Client to interact with our backend GraphQL API and to provide caching and state management for the frontend. The bug presents in a way that reminds me of something I'd previously encountered with Apollo's normalized cache. Could it be happening again?

(If you're already familiar with GraphQL and Apollo Client's normalized cache, feel free to skip this section.)

What is a normalized cache? Put simply, and ignoring some of the finer details, each entity retrieved from the backend GraphQL API is stored in a frontend cache once under a standard key. By default the entity's cache key is its GraphQL __typename concatenated with its id field, like ticket:100. If that entity refers to some other entity, the relationship is stored as a reference, not as a duplicate copy of that other entity.

For example, let's say the frontend issued a GraphQL query like this:

query {
    ticket_by_id(id: 100) {
        id
        title

        checklists {
            id
            title

            tasks {
                id
                title
                is_complete
                is_deleted
            }
        }
    }
}

The data portion of the response might look like this:

{
    "ticket_by_id": {
        "__typename": "ticket",
        "id": 100,
        "title": "The ticket",
        "checklists": [
            {
                "__typename": "checklist",
                "id": 500,
                "title": "A checklist",
                "tasks": [
                    {
                        "__typename": "task",
                        "id": 2000,
                        "title": "Foo",
                        "is_complete": true,
                        "is_deleted": false
                    },
                    {
                        "__typename": "task",
                        "id": 2001,
                        "title": "Bar",
                        "is_complete": false,
                        "is_deleted": false
                    }
                ]
            }
        ]
    }
}

(If you're wondering why the response objects include __typename even though it's not specified in the query, that's because Apollo Client and other GraphQL clients have an option to automatically add it to all queries.)

The normalized cache might look like this, tracking the relationships using a special __ref key:

{
    "ticket:100": {
        "__typename": "ticket",
        "id": 100,
        "title": "The ticket",
        "checklists": [{ "__ref": "checklist:500" }]
    },
    "checklist:500": {
        "__typename": "checklist",
        "id": 500,
        "title": "A checklist",
        "tasks": [
            { "__ref": "task:2000" },
            { "__ref": "task:2001" }
        ]
    },
    "task:2000": {
        "__typename": "task",
        "id": 2000,
        "title": "Foo",
        "is_complete": true,
        "is_deleted": false
    },
    "task:2001": {
        "__typename": "task",
        "id": 2001,
        "title": "Bar",
        "is_complete": false,
        "is_deleted": false
    }
}

The frontend code reads data exclusively from the normalized cache. Why is that good? It provides a single source of truth for each entity. If a field on an entity changes (say, it gets a new title), the new value is stored on the entity in the normalized cache, and by virtue of Apollo Client's useQuery React hook, the new value is immediately reflected everywhere in the UI that it's displayed. The UI can't go out of sync, displaying a different value in one part versus another, because it's all reading from the same source.

Or can it?

One way this can go wrong is if a GraphQL query omits the id field. Without an id, Apollo can't construct a key to store the entity under in the normalized cache.

For example, let's say our earlier query omitted id on tasks, like this:

query {
    ticket_by_id(id: 100) {
        id
        title

        checklists {
            id
            title

            tasks {
                title
                is_complete
                is_deleted
            }
        }
    }
}

The response object would, obviously, no longer include id for each task. But what would the normalized cache look like? It would look like this:

{
    "ticket:100": {
        "__typename": "ticket",
        "id": 100,
        "title": "The ticket",
        "checklists": [{ "__ref": "checklist:500" }]
    },
    "checklist:500": {
        "__typename": "checklist",
        "id": 500,
        "title": "A checklist",
        "tasks": [
            {
                "__typename": "task",
                "title": "Foo",
                "is_complete": true,
                "is_deleted": false
            },
            {
                "__typename": "task",
                "title": "Bar",
                "is_complete": false,
                "is_deleted": false
            }
        ]
    }
}

The task entities are no longer normalized at the top level, they're instead nested underneath the parent checklist.

Why is that bad? Well, imagine the app had two queries, one that did include id (the first example) and one that didn't (the second example). The task entities would appear in the normalized cache twice, once at the top level (as in the first example) and again underneath the parent checklist (as in the second example). We no longer have a single source of truth for each task. As a consequence, if a task is modified, it's not hard to imagine the UI displaying inconsistent or wrong values.

All that said, I'm skeptical it's the cause of the bug. (Don't worry, it's knowledge that will be relevant later.) I'm skeptical because:

  • I see no reason why the cache normalization would behave differently in development versus production.
  • I had already added to the codebase a test that all GraphQL queries include the id field where appropriate.

I entertain the possibility that the test itself could be broken. I manually check that the id field is being included where it should be. It is.

Conclusion: This is a dead-end. It isn't causing the bug.

It's Not Looking Optimistic

I know from the console log I added earlier that the checklist React component is being passed bad data. But why?

The bad data is coming from Apollo's useQuery React hook. useQuery reads from the normalized cache. It stands to reason that the normalized cache must contain the bad data. But how?

My instinct is that it's our app's fault, that the app is somehow injecting bad data into Apollo's normalized cache. I hypothesize it's due to how the app implements optimistic UI.

Optimistic UI is a simple idea. Whenever a user makes a change, the effect is made visible in the UI immediately without waiting for a backend response. The app feels snappy, and the network latency communicating with the backend is hidden from the user.

To implement optimistic UI with Apollo Client, whenever you execute a GraphQL mutation, you can (optionally) declare an optimistic response. It's essentially a prediction of what the actual backend response will look like.

Apollo uses the optimistic response to update the normalized cache immediately, so the user sees their change reflected in the UI instantly. Later when the actual backend response is received, Apollo rolls back the optimistic response and applies the actual response. If the optimistic response was right, the UI is already correct, so the user sees no additional change.

Maybe when deleting or changing a task, the codebase specifies an incorrect optimistic response, and that incorrect data is somehow getting permanently stuck in the normalized cache (rather then being rolled back). To test if that's a factor, I open a PR where all optimistic responses throughout the app are disabled. In the deploy preview I try to trigger the bug. The bug is still there.

Conclusion: Another dead-end. The app's handling of optimistic UI is not causing the bug.

Cache Me If You Can

I begin to suspect there's a bug in Apollo Client itself. It's time to examine the normalized cache more closely.

The normalized cache is implemented by a class called InMemoryCache. There are several methods for interacting with it, which I was already familiar with. I open a PR with additional console logs for:

  • A call to readFragment() to read the checklist data from the cache "directly" (rather than via the useQuery hook).
  • A call to extract() to view the entire cache contents in as raw a form as Apollo Client can give me.

In the deploy preview, I trigger the bug. To my amazement, the logs for both readFragment() and extract() indicate they're returning correct data! Yet the log from earlier still shows that the checklist component is receiving bad data from the useQuery hook!

Conclusion: It now seems overwhelmingly likely that there's a bug in Apollo Client. But bizarrely, it only affects some cache methods. And again, it's only in production... 🤔

Returning Home

It's time to figure out how to reproduce the bug locally.

I search the Apollo Client repo for NODE_ENV and quickly find a file which sets a __DEV__ flag to true in development environments. I patch my local version of this file in node_modules to set the flag to false, and I try to trigger the bug locally. Success 🎉🎉🎉! Now I can easily reproduce it locally just like in production.

Conclusion: I have no doubt that this is a production-only bug in Apollo Client.

It's Freezing Out

What does the __DEV__ flag control? I search the repository and find only a few references. There are so few, I decide to quickly test each one manually.

I patch my local version to reenable __DEV__. Then, one at a time, I patch each usage of __DEV__ replacing it with false (emulating production for that usage) and try reproducing the bug. Only one patch does so: this one in maybeDeepFreeze.ts.

Deep freezing, that rings a bell. I remember Apollo's documentation mentioning that there is a key difference between development and production regarding the data returned by useQuery. Callers are supposed to treat the data as immutable, since it may contain direct references to Apollo's internal structures. But immutability is only enforced in development.

Specifically, in development Apollo returns objects that have been frozen using Object.freeze. (Well, a deep version of it.) If the caller tries to modify the object, it'll fail loudly. Presumably any such issue would be quickly caught during development. But in production, object freezing is disabled to improve performance. The name maybeDeepFreeze.ts is apt.

Conclusion: The bug is reproducible only when Apollo does not deep freeze the data it returns, as in production. The bug doesn't manifest when the data is frozen.

The Apollo Missions

Now that I can easily reproduce the bug locally, I want to pinpoint when it was introduced.

I install a few different Apollo Client versions locally and try reproducing the bug. It doesn't take long to find that the bug was introduced in v3.5.0. The previous version, v3.4.17, is fine.

I check the diff between these two versions. Unfortunately it's too big to easily spot which change introduced the bug. I want to find the precise commit. It's time for a git bisect.

If you've never used git bisect, you should learn. Given two commits, one "bad" (say, having a bug) and one "good", git will check out a commit halfway between. You test it and tag it as "bad" or "good". Then the process repeats, with git shepherding you through a binary search to home in on the first "bad" commit.

I clone the Apollo Client repo and configure my local Constructor app to use that clone as a dependency. This turns out not to be hard to set up: I just change the apollo-client dependency in my local package.json to file:path/to/apollo-client/dist and run npm link path/to/constructor/node_modules/react from within the Apollo Client clone.

I initiate a git bisect, marking the commit for v3.4.17 as "good" and the commit for v3.5.0 as "bad". The result? Commit 6ec967d introduced the bug.

That commit doesn't have many changes (ignoring docs and tests), so I feel like I have a chance at spotting the bug by carefully examining the diff.

But my hopes are quickly dashed. It's a merge commit. I check each of its two parents (34b41a0 and ffaf81d), and I can't reproduce the bug in either one!

Conclusion: The Apollo Client codebase landed two separate changes that are each fine independently but produce the bug when combined. 🙁

Committed to Something

I run git merge-base 34b41a0 ffaf81d to find the common ancestor for the two commits that introduced the bug when combined. It's 9214160. I review the commit histories for 34b41a0 and ffaf81d back to that common ancestor. There aren't many changes in the latter, and one of them catches my eye: a configuration option called canonizeResults that previously defaulted to true was changed to default to false.

In my local clone of the repo, I check out the other parent (34b41a0). As expected, canonizeResults still defaults to true there. I change it to false and run Constructor locally. Previously this commit didn't exhibit the bug, but with canonizeResults now set to false, it does!

I've identified how one side of the merge contributes to the bug. It's time to figure out the other side.

I again scan through the commit history for 34b41a0, along with the diff between v3.4.17 and v3.5.0. Eventually I zero in on commit 756ab87. It interests me because:

I revert that commit locally and try reproducing the bug again. Now I can't!

Conclusion: The changes in 756ab87 introduced the bug (in combination with canonizeResults being false).

(At this point, we released a temporary fix to Constructor by enabling the canonizeResults option. But I continued the investigation because I wasn't positive leaving canonizeResults enabled would be safe, and I wanted to understand the root cause.)

Something Deep Inside

I examine readFromStore.ts and the changes introduced in 756ab87 more closely.

The module has just one substantial public method called diffQueryAgainstStore(). The comment describing it is helpful:

Given a store and a query, return as much of the result as possible and identify if any data was missing from the store.

I interpret that to mean it takes a GraphQL query and reads as much as is available from the normalized cache (the "store").

I add a console log dumping diffQueryAgainstStore()'s return value and trigger the bug again. The log shows that, indeed, diffQueryAgainstStore() is returning the bad data.

diffQueryAgainstStore() delegates most of its work to a call to executeSelectionSet(). In the GraphQL spec, a selection set is essentially just a list of scalar fields to be retrieved on an object, as well as fields that point to other objects or arrays which in turn have their own selection sets. A selection set can also reference fragments, which are reusable definitions of selection sets, and which themselves can reference other fragments. This is all just a formal grammar for the intuitive nested structure of a GraphQL query. The grammar is recursive, and not surprisingly, the implementation of executeSelectionSet() is recursive too.

I see that executeSelectionSet() recursively builds up a nested object to satisfy the given selection set, and it merges the intermediate results from sub-selections into a running result object. In development, the result object is frozen before being returned.

I also notice that executeSelectionSet() is memoized (via this call to a wrap utility). The actual implementation is executeSelectionSetImpl(). When executeSelectionSet() is called, if it hasn't been called with those arguments before, it delegates to executeSelectionSetImpl(). Otherwise it returns the already computed value from the prior call, which is stored in a memoization cache managed by the wrap utility.

So where's the bug?

I look again at the problem commit. It changes how the intermediate results are merged together. Previously, each recursive call to executeSelectionSet() merged intermediate results in isolation. But now the entire recursive call tree, starting from the initial call by diffQueryAgainstStore() to executeSelectionSet(), shares a single instance of a class called DeepMerger.

The change looks innocuous. But could it be relevant?

DeepMerger is defined here. There's a helpful comment at the top:

These mergeDeep and mergeDeepArray utilities merge any number of objects together, sharing as much memory as possible with the source objects, while remaining careful to avoid modifying any source objects.

What does that mean? I examine the unit tests and see that it's basically a deep version of Object.assign, but with an option to provide a custom reconciler if there are conflicting values for the same key.

I read the DeepMerger code and see a private helper that shallow copies a given object or array. Interestingly, it behaves differently depending on whether the given object is frozen! That's suspicious.

I again try reproducing the bug locally, mimicking production by patching __DEV__ to false, but this time I also remove the suspicious line that checks whether an object is frozen. The bug disappears!

But why? What does that line do? Let's step back. The comment at the top of DeepMerger claims that it doesn't modify the source objects. How does it implement that? Simple. Before modifying an object (e.g., adding or replacing a field), it makes a shallow copy. Obviously, this copy is safe to modify because no other code can have a reference to it. However, there's an optimization. If we later want to further modify this copy, it would be inefficient and unnecessary to make another copy of it. After all, the point of the original copy was to be able to safely modify it. To that end, DeepMerger tracks previous copies in a Set called pastCopies. It only copies objects that aren't themselves copies.

By removing that line, DeepMerger makes copies of copies, and that fixes the bug.

Conclusion: Using a single DeepMerger instance to build up the final result is causing the bug, but only if it doesn't make copies of copies, as in production.

Eureka!

I feel like I'm quite close to nailing down this bug, but I still don't see why my insights into DeepMerger are causing a problem.

I know diffQueryAgainstStore() is returning bad data. I need to find exactly where in the recursive call tree that bad data is being introduced.

I patch my local readFromStore.ts to console log the arguments and return values of all of the functions, and I also log all of the major logic branches. (I prefer console logs for this sort of debugging, since I can jump around the logs as I see fit. Stepping through breakpoints is too slow and painful in a recursive call tree.)

Importantly, whenever I log one of the intermediate results, I also log a deep copy of it. That way, I'll see not only what the object looks like when I view the log in the browser console but also a snapshot of what it looked like at the time of the log. I decide to make the deep copies with a simple JSON.parse(JSON.stringify(obj)), because I know the result objects from a GraphQL query are JSON-serializable, and this approach has nothing fancy that could go wrong.

I trigger the bug, review the logs, and I see it. There's a call to executeSelectionSet() that returns a memoized result. I can tell it's a memoized result because there's no corresponding call to executeSelectionSetImpl(). From the deep copy snapshot I logged, I see that the result now has a bunch of extra fields on it that aren't in the selection set — including the bad data — but these extra fields weren't present at the time the result was returned. As the intermediate recursive results are merged together into the final output, these extra fields with the bad data happen to survive (aren't overwritten).

Conclusion: The bad data is coming from an intermediate result being modified after it was returned and stored in the memoization cache.

It All Comes Together

Remember earlier when I said this about DeepMerger?

If we later want to further modify this copy, it would be inefficient and unnecessary to make another copy of it. After all, the point of the original copy was to be able to safely modify it.

I was wrong about that! And that's the crux of the bug.

It goes like this:

  • The first time the app issues a particular GraphQL query to Apollo, none of the intermediate results are memoized. Every call to executeSelectionSet() falls through to the implementation executeSelectionSetImpl(), which invokes DeepMerger to merge intermediate results. The return value of each call is stored in the memoization cache as it's returned.
  • Because a single DeepMerger instance is shared through the entire recursion, its internal Set of pastCopies contains all past copies from any previous executeSelectionSetImpl() invocation.
  • At some point in the recursion, DeepMerger is invoked with an object. It recognizes this object as a copy, so it thinks it's safe to modify it (e.g., to add fields to it). But this copy is in fact already in the memoization cache from an earlier part of the recursion. Further modifying it corrupts it with additional fields that weren't output by the memoized computation.
  • Sometime later, the app issues the same query to Apollo. The corrupt (in fact, stale) data is still in the memoization cache, and it gets included in the final return value to the app.

This only happens in production because, in development, return values are frozen, and DeepMerger never reuses past copies that are frozen.

Loose Ends

Let's tidy up a few things.

What does canonizeResults even do, and why does the bug only occur when it's disabled?

Alas, I'll table that question for another day.

Why wasn't this bug reported earlier? Surely it affects other users of Apollo Client?

The bug only manifests in a specific combination of circumstances:

  • The normalized cache has to be configured with the default values for canonizeResults as well as an another option, resultCaching (which controls memoization). Some fraction of Apollo Client users will have overridden the defaults, preventing the bug.
  • The corrupted memoized data usually gets overwritten by correct data in the recursive call tree that builds up the final result. It survives only when the GraphQL query has a combination of non-fragment and fragment selection sets in a particular pattern. I suspect most Apollo Client users happen not to use that pattern.

Why does DeepMerger have different logic depending on whether the given object was frozen?

Here's an educated guess:

  • canonizeResults was defaulted to false on one branch, and the new readFromStore behavior was introduced on a separate branch in 756ab87.
  • Those branches were merged, introducing the bug.
  • Someone noticed that apps were crashing in development because DeepMerger was trying to modify frozen objects. That could/should have been a clue that the previous changes introduced a regression.
  • It wasn't obvious why it was unsafe for DeepMerger to modify previous copies. So, rather than stopping DeepMerger from modifying objects altogether, a "fix" was introduced which did so only in development (when the objects are frozen).

Why didn't we notice this when we QA'd the improvements to Constructor's checklists or used Constructor ourselves?

I can't say for sure, but I suspect we made a late change to one of our GraphQL fragments causing it to conform to the pattern required by this bug, and we didn't retest the entire UI after that change because it seemed innocuous. And day-to-day, we rarely delete tasks, so we simply didn't hit the bug ourselves!

Aftermath

I filed a detailed bug report with a simple reproduction and my analysis, which was well-received. The Apollo Client developers quickly implemented a fix.

Epilogue

It took me several days to get to the bottom of this, but it wouldn't have been possible had Apollo Client not been open source. After spending some time in the Apollo Client codebase, I came away with the impression that it's professional and of high quality. We continue to rely on it in our product. Constructor is built on open source software, and we couldn't be happier to give back to the community in this way.

Moral: When it's cheap and trivial to do, always try to reproduce a bug that appears in production... in production! 😆

Andrew Kallem
Andrew Kallem is lead engineer and co-founder at Constructor.