Building an NFT Indexer

The Goal

One of the primary features of Gallery is enabling users to display their NFTs. To facilitate this, we need a quick way to retrieve all NFTs that belong to any given wallet address. Unfortunately, the Ethereum blockchain was not designed with a magical "get me every token that person owns" command, let alone other ways to slice the data: handling NFTs that are staked or escrowed, retrieving NFTs that were created by a specific address, retrieving NFTs an address once owned in the past, etc.

Thankfully, however, the blockchain does keep track of every transaction that has ever occurred in sequential order. This allows us to start from the beginning and track every token transfer, as well as the sender and recipient of those transfers. When the whole process has caught up to the current block, we should have a good idea of who still owns which tokens. From then on we simply need to listen for new blocks and see which wallets just paper-handed their Gallery Membership Card—and other NFTs, of course.

The process of accumulating token balances by reading the sequential data produced by the blockchain is what most would call indexing, and the super villain software that performs the indexing we'll call The Indexer. Don't worry though, The Indexer may be a boss battle to build but it sure saves the day in the end. The Indexer is the Darth Vader to our original trilogy, if you will.

There are countless indexer archetypes for unearthing various tidbits from the blockchain. For the purposes of Gallery, we needed one that could retrieve all NFTs that have ever been minted, support ERC-721 and ERC-1155 tokens, and ascertain the media content represented by those NFTs. Our indexer is now running in production and is fully open source: you could run your own instance of it today, and all you’d need is a node provider like Alchemy.

Here is how we did it.

Indexing 101: The Path to the Dark Side

The goal is clear for what our indexer needs to accomplish but there are infinite ways to go about designing it. We spent some time experimenting with a few models, each with their own limitations and constraints, until we landed on a sweet spot. In order to better understand our current indexer, let's start from the beginning.

Backstory

The backend at Gallery is written entirely in the Go programming language. Golang is an eloquent language that is easy to learn and helps us iterate quickly in a cloud environment. The most widely used Ethereum client being used today, Geth, is also written in Go. We were able to take advantage of the open source code that was used to build Geth in our own codebase to save time when trying to write low-level primitives, as well as clean, high-level abstractions to interact with the blockchain. The Geth contributors put together a fantastic codebase!

On top of having sweet language support from the Ethereum community, Go also has a beautiful concurrency model that allows the indexer to run quickly without worrying about many inconsistencies that are generally associated with with concurrency.

We chose to use PostgreSQL for our persistence layer as it’s our database of choice for our existing backend. The interesting challenge around using a relational database is that NFTs are very inconsistent in schema (devs: do something!). Some NFTs have no token URI, some do. Some token URIs point to json metadata, some point to SVGs. Some NFTs refer to images, videos, or GIFs, while others point to full snippets of browser code that generate unique and interactive pieces of art on the fly. We ended up with the following schema that kept track of the most important pieces of information that we could generally expect across all NFTs in top level columns, reserving some of the more schema-less data (metadata) to be stored as JSONB, postgres’s JSON blob type:

type Token struct {
	Version      NullInt32       `json:"version"` 
	ID           DBID            `json:"id" binding:"required"`
	CreationTime CreationTime    `json:"created_at"`
	Deleted      NullBool        `json:"-"`
	LastUpdated  LastUpdatedTime `json:"last_updated"`

	Media Media `json:"media"`

	TokenType TokenType `json:"token_type"`

	Chain Chain `json:"chain"`

	Name        NullString `json:"name"`
	Description NullString `json:"description"`

	TokenURI         TokenURI                 `json:"token_uri"`
	TokenID          TokenID                  `json:"token_id"`
	Quantity         HexString                `json:"quantity"`
	OwnerAddress     EthereumAddress          `json:"owner_address"`
	OwnershipHistory []EthereumAddressAtBlock `json:"previous_owners"`
	TokenMetadata    TokenMetadata            `json:"metadata"`
	ContractAddress  EthereumAddress          `json:"contract_address"`

	ExternalURL NullString `json:"external_url"`

	BlockNumber BlockNumber `json:"block_number"`
}

The blockchain does not have one entry point for retrieving information. There are boatloads of Ethereum clients all running at once right now, each storing all or most of what’s represented by the blockchain. Instead of spinning up and managing our own cluster of Ethereum clients, we decided that it would make more sense as a small team to outsource this task to a provider that offers this exact service: Alchemy. Alchemy runs a fully managed node service so you can focus on retrieving information without thinking about how to make that information available. You can think of the service itself as a kind of generalized Ethereum activity indexer, on top of which we’re building an NFT indexer. Alchemy also makes it super easy to bounce between testnets and mainnet, and provides fantastic analytics on node traffic.

Of course, all of this code has to live somewhere. Continuing my running theme of using things we already had set up for another purpose, we decided that deploying the indexer to an App Engine instance on Google Cloud Platform was the best move (other options included AWS, Azure, etc.). App Engine makes deployment trivial and lets me focus on the code I'm writing instead of how we will get a machine to run the code in production. However, App Engine also comes with some pretty strict memory limits. The indexer can be quite memory intensive and the RAM allotted to a Go 1.16 instance was capped at 2 GB.

Young Anakin Skywalker

The first step to indexing the blockchain is figuring out how to separate the data we want from the other data that is meaningless to us, like when Jimmy converted his 10 USDC into 0.0002 ETH (he used the wrong contract). One way we can do this is by determining which event signatures match those that are emitted by ERC-721 and ERC-1155 compliant smart contracts.

Retrieving Event logs

This is easy to grab with JSON RPC methods that are exposed by the Ethereum nodes, specifically the eth_getLogs method that allows us to pull the event logs that match given event signatures from specified periods of blocks. For an ERC-721 token, this event is made up of a from address, a to address, and a TokenID. For ERC-1155 tokens there are two possible events that could represent a token transfer: the transfer of one token type, and the transfer of multiple tokens types at once. The first event includes the from address, to address, and token ID, but also includes an operator address (the address making the transaction) and the amount of that token type being transferred. The second event is very similar except the token types and amounts fields are both equal length arrays that are mapped together by index, representing the different amounts of each token type that were transferred.

Using the JSON RPC eth_getLogs method, we decided that the indexer would start at block ~5 million on the Ethereum blockchain, which is around the time when the ERC-721 specification was standardized. In the first iteration of the indexer, logs were received in groupings of 200 blocks. This interval worked fine in the first couple million blocks processed, but we soon realized that a lot can happen in 200 blocks, especially as NFTs began picking up in popularity.

Distilling Logs into Transfers

As each set of logs come in, they’re processed into a transfer model that stored who the transaction occurred between, what contract the transfer was from, what kind of contract it was, when the transfer occurred, the token ID transferred, and finally the amount transferred, which could be greater than one for ERC-1155 transactions. Retrieving this data from the logs was especially simple for ERC-721 tokens because of indexed event parameters. Any event on the blockchain can have at most three indexed parameters. These parameters can be used to search through logs and therefore are very easy to find. Fortunately, each parameter for an ERC-721 transfer event is indexed and there are only three parameters total! For ERC-1155 tokens, however, the token ID and amount being transferred are not indexed parameters in both event types. These parameters are instead included in the logs' data field which is an ABI encoded representation of the non-indexed values in an event. After doing some research on the ABI specification we were able to parse this field to fill in the transfer model.

Once each incoming log is converted into the transfer model, the transfers are pushed onto a channel, a concurrency primitive in Go, which allows lightweight threads called goroutines to communicate data instead of sharing data with locking. When something is pushed onto a channel, it is expected (in most cases) that some other part of the application will be receiving from the channel on a different goroutine. For the purpose of our service, we have a separate goroutine that is receiving transfers and filling in the necessary data such as finding token URIs for NFTs, using the token URIs to find metadata, etc., as well as keeping track of the current owner of a token, an ordered list of previous owners, and balances for ERC-1155 tokens. All of this information at the time was sent across more channels into a goroutine that stored the data in maps that were shared across goroutines, which we later discovered led to inconsistencies. Something else to note is that each piece of data needed to be tied to a block number to ensure the sequential nature of the transaction history, as well as the token ID and contract address combination that uniquely identified each NFT.

Making and Storing the Tokens

Every thirty seconds or so, access to the maps that stored data on each token was blocked while a separate thread processed each map into a token model to be inserted into the database. This process aggregated the data in each map to a single data structure that allowed us to free up all of the data we are storing in memory and put that weight onto the persistence layer. The token data structure was the conglomeration of everything we have discovered so far. It stores the most recent known owner of a token, the ownership history that we currently know for a token, the token's URI and metadata, what type of token it is, and possibly a balance if the token is an ERC-1155 token. Since we also need to ensure that data is sequentially accurate, we also check if there is any data on this NFT currently in the database, and if so, make sure we include the ownership history that we previously discovered in the token we are currently processing, make sure balances are accurate, and finally make sure the current owner is in fact the most recent owner of a token.

At a certain point, the indexer does catch up to the most recent block. It is at this point when the indexer switches to listening on a web socket for newly mined blocks. Each incoming block is processed exactly the same way as a block from a get logs request would be processed.

The Problem

After running the indexer over a couple hundred blocks, we found thousands of NFTs already appearing in the database! When digging deeper, however, we noticed that it far from perfect. Many NFTs had missing token URI fields, missing metadata fields, wrongly ordered ownership histories, and many NFTs that we had expected to appear within the interval of blocks we were testing with simply did not appear at all! Most of these inconsistencies were attributed to the use of concurrency. But a larger problem loomed ahead: how were we going to retrieve and store the massive amounts of media content that were represented by the plethora of NFTs on the blockchain?

Jedi Knight Anakin Skywalker

Knowing how many 10K PFP projects there were out there, we were concerned about the sheer amount of pictures, videos, gifs, and scripts that were waiting to be stored on our servers. It would easily cost us millions to store the memes necessary to satiate the NFT community's FOMO.

We decided instead to retrieve media content with a lazy caching model. Imagine a library with an all-knowing librarian that is initially empty. Once you ask the librarian for some information, they’ll write a book for you with the answers you need and place the book on a shelf, such that if someone else asks for the same information the librarian have the book ready to go. Over time, as more people request resources from the librarian, the library will begin to fill up, increasing the odds that the book you’re looking for will have already been requested by someone else.

In the case of Gallery's indexer, the books are the media content. The tradeoff is that the initial writing of the book can take some time, especially if its contents are large – but there are a variety of ways to optimistically pre-populate the cache if we can detect that certain NFTs are more likely to be accessed.

Finding The Missing Media

When a Gallery user makes a request for an NFT, the backend service scans the database for the token and checks to see if we already have the media content available for that NFT. If not, we try to utilize the token URI and token metadata to find the media and store your 8k, ultra-hi-res video of a cartoon cat in our GCS Bucket. Sometimes retrieving the media can be done directly through the metadata such as the Animal Coloring Book or Loot collections, which beautifully store their media directly on the blockchain with base64 encoded SVGs.

Token metadata for Animal Coloring Book
Token metadata for Animal Coloring Book

Otherwise, retrieving the media for an NFT requires an HTTP request to the internet or an IPFS request to a pin. In general, most NFTs fall into one of the following buckets: it might have ipfs:// or https:// in the token URI or metadata, or the token URI could start with data:application/json;base64, or maybe the file header when initiating the download denotes a PNG or MP4 file. When all else fails, it is probably because OpenSea is rate-limiting us, since a lot of NFT media content are created and hosted on OS! In case the initial attempt fails, or the media itself gets updated (see: post-drop reveals), we expose an endpoint to our backend service to manually call some indexer functions to fill in missing token URIs and metadata.

Supporting Innovators

Occasionally, in the name of innovation, creators will take risks by defining new formats for their NFT projects entirely. A great example of this is the Autoglyphs project by Larva Labs. This project specified within the comments of the contract how to turn the token URI into a piece of unique art. However, taking this course of action requires indexer engineers to find ways to support the schemas separately from more common ones. Indeed, we have a specific handler for Autoglyphs for interpreting and generating the encoded art. This also requires us to be active on Twitter and make a call on which projects will GMI vs. NGMI, so that we can prioritize which contracts to support. On a more serious note, given that our indexer is open-source, we look to support any and all projects, and engineers are welcome to open a Pull Request to handle specific collections.

Luckily, most projects care more about conforming to commonly used standards, like OpenSea, than being genre-defining.

Pay Your AWS Bills!

A sad truth is that many NFTs are pointing to media content that is broken or no longer available, because they reference centralized storage solutions that ended up going down over time. We encountered hundreds of thousands of NFTs through the indexing process with broken APIs, DNS resolution issues, and/or AWS S3 links to projects that stopped paying their bills. This goes to show how important the storage method is for an NFT project. On a personal note, we love seeing creators innovate and find new ways to keep everything related to an NFT entirely on-chain.

Low-Ground Anakin Skywalker

Like we mentioned previously, the purpose of channels in Go was to replace sharing data across concurrent workers with communicating data. We observed that the maps that were storing data were being shared across goroutines and required manual concurrent locking and unlocking. We decided that we would instead use channels to communicate the data required to make up these maps and isolate the maps to a single thread: the same one that turned the data into token data structures and inserted them into the database. To ensure that no incoming data interfered with the maps when they were being processed into tokens, we replaced the arbitrary thirty second interval that we had previously used to decide when to move data stored in memory into the database with a block interval that allowed the indexer to completely process a set of logs before inserting them all into the database.

Unifying Timing

Replacing the thirty second interval with a block interval and processing entire sets of logs at once instead of letting each step run independently led to the unintended, positive side-effect of removing a bottleneck. Because discovering information such as token metadata often required an HTTP request to the internet, the step where we use incoming transfers to discover data about a token often took a lot longer than grabbing logs from the blockchain. When each step was running independently, some steps were so far ahead of each other that some data were lost through the cracks. Now that the process was guaranteed to run from start to finish for a set of logs, we did not have to worry about dropping NFTs throughout the indexing process.

Speeding Things Up

In order to meet deadlines from my boss, Mike, the indexer can't take years to finish. Well, it could because Mike is forgiving and benevolent but I am not going to take advantage of him. If the indexer waits for five thousand logs to be converted into transfers and then turned into tokens after we've scrounged up everything we can on each NFT, the whole process might legitimately take years to finish. To solve this, we have each processing pipeline running in its own thread with a maximum of ~10 threads running at any given time, to ensure that the application does not run out of memory or get too far ahead of itself. We also reduced the interval of blocks that the indexer uses to pull event logs from 200 blocks to 50. Knowing that only ten pipelines can be running at once, the application never looks at more than 500 blocks at a time and ensures that it only looks at new incoming blocks once any of the sets of 50 blocks has been fully processed and inserted into the database.

Cost Considerations

Our main cost sources for the indexer are the node providers (Alchemy), cloud storage (GCS), and cloud servers that run the indexer itself (GCP). While we do store a lot of images and videos from the blockchain as the indexer is running 24/7, storage is cheap, so we pay a relatively minimal cost for GCS and GCP. Our node provider, Alchemy, has made up the bulk of our costs. There is a lot of data to be downloaded to make up every transfer log on the blockchain, and once the indexer is caught up to the most recent block, there are still new logs coming in all of the time. If you’re running your own node, we’d love to hear about your experience and the maintenance required to keep it up.

One problem we ran into was that using the geth library to subscribe to incoming logs from the blockchain was costing us a lot more than expected, by making a lot more calls to our node provider than necessary. This prompted us to switch from subscribing to a websocket with geth, to simply polling every few minutes for new blocks. Given that geth is one of the standard ways to interact with the blockchain, we were surprised by this issue. This forced us to make the explicit tradeoff to delay the indexer’s accuracy by a few minutes in order to mitigate costs, rather than having it operate perfectly in real-time.

Schema Mistakes

At first, we decided to index our database with three columns that uniquely identify every NFT in order to know whether or not we were inserting a new token into the database, or updating the token because it was previously indexed. This unique index was on the contract address column, the token ID column, and the owner address column. It is true that every NFT could be uniquely identified by these three fields, but the downside is that with ERC-721 tokens there could be a case where the database accidentally has multiple, different owner addresses. If bob.eth has NFT 0x21 with token ID 5 and sells it to jim.eth, when we go to our database there will not be a token with contract address 0x21, token ID 5, and owner address jim.eth; it will be inserted as a result, leaving the token with bob.eth as the owner address in a broken state. We could just go back and delete the other one but this left too much room for developer error. To solve this, we instead decided to create separate indexes for both ERC-1155 and ERC-721 tokens to ensure that we cannot accidentally have two ERC-721 tokens in the database.

Special Collections

The indexer not only keeps track of tokens, but also the smart contracts that mint those tokens. For the most part, a contract can also be referenced as the collection for those tokens. For example, all of the Azuki NFTs were minted from one contract and no other NFTs were minted from this contract. The contract even has some metadata attached to it that labels it as the Azuki NFT smart contract. This makes it easy on Gallery to group tokens into collections based on what contract they came from, which in turn will make it easy to find owners of other tokens from the collections you own tokens from.

Not every NFT is minted this way, however. There are many smart contracts that are able to mint a variety of tokens. The most famous examples are marketplace contracts like the Opensea Shared Storefront or the ArtBlocks main token contract. Each of these handle the separation of collections differently, so like the way we handle various token metadata differently, we have to build special handlers for these cases so that users on Gallery can differentiate the holders of Squiggles from the holders of Fidenzas on a collection page.

Out of Memory

We mentioned before that the indexer was running on a server that is limited to a maximum of 2 GB of RAM. This can be an issue when you are working with relatively unknown data. A token URI could be very large and when you are concurrently grabbing thousands of token URIs from the blockchain, memory can be eaten up quite fast. Another big unknown is how many tokens are transferred in a block. There are points in time on the blockchain with a massive influx of token transfers that can cause the indexer to give out trying to process all of them.

The hard part about running out of memory is that the error does spawn from the code. There was no panic that sent us to a specific line of code and there was no log that we could trace. The indexer would simply stop and start over again, attempt to re-index the problematic block, crash, and so on. To figure out the root of the problem, we used a tool called pprof. This tool can create charts that tell you exactly where memory is being used in your application as it runs.

With this tool, we were able to determine that the majority of memory use was caused by copies in the libraries we were using for database interaction at the insert/update step of the indexing process. This step would try to insert or update every token that it had processed in 50 blocks. By batching the tokens into consistent groups, we were able to control how big the copies were and free up memory more consistently.

We also decided to split out the endpoints that allow clients to interact with the indexer into its own service, freeing up memory usage and isolating it from certain points of failure to boost reliability. This service sits as an API layer on top of the indexer for retrieving token and contract data, as well as updating token media content and metadata.

Darth Vader

To sum up, here is how the indexer works from start to finish:

  1. Indexer spins up ten workers, each running concurrently and each one starting by sending a get logs request to Alchemy for an interval of 50 blocks and for only events related to ERC-721 and ERC-1155 tokens.
  2. In each worker, incoming logs get parsed and turned into a transfer data structure that stores who the transfer was between and some information like token ID, amount, contract address, and block number.
  3. Transfers get pushed onto a channel where a separate thread in each worker receives them and finds token URIs and metadata from the blockchain and the internet and keeps a running list of ownership histories, balances, and current owners of each token. Each of these snippets of data gets pushed onto their own separate channels that are being received from on another thread in each worker.
  4. The third and final step for a worker to finish its pipeline is to receive the data from the previous step and once all of the data has been received (all channels have been closed), use the data to create token data structures that are then inserted into the database.
  5. Once a worker completes, a new worker is spawned and runs through the pipeline for the next set of 50 blocks until the indexer has reached the most recent block.
  6. Once the most recent block has been reached by the indexing workers, the indexer polls blocks every few minutes to find more transfer logs. Every incoming log then gets pushed down the same pipeline that logs from the get logs requests were processed with.

Not Our Final Form

The indexer most likely will never be a finished project. There are still many areas we can fine-tune that are non-blocking in terms of usage but would be helpful to have nonetheless. Features like an easy way to backfill through data that was indexed improperly, better monitoring systems so that we can detect bugs before they are reported, and more ways to handle new contract and token schemas.

Other solutions exist. For those just getting started, we suggest using the Zora or Opensea API. We ultimately decided to roll our own in the short-term as it provided us the most flexibility and control once we understood our own data requirements; then, as our app evolves, we can continue to refine our engine, as opposed to submitting a request to a 3rd party provider to support a use case that’s specific to Gallery, and crossing our fingers that it gets prioritized.

Although this project is entirely open source, there are still cloud services costs associated with running an indexer like this that may act as a barrier to developers looking to level up their projects quickly without worrying about all of the blockchain internals. In the future, we may consider opening up our indexer for public use, and collaborating with more platforms to standardize a lot of the topics covered.

Thank you – and We’re Hiring!

This has been How to Build an Indexer (and become Darth Vader) – thanks for coming to my Ted Talk!

A huge thank you to my fellow engineers, Ezra, Jarrel, Robin, and Kaito for helping this indexer come to life with me.

We’re also hiring engineers who’d be down to nerd out over systems like this, and get in at the ground floor of building the next social, discovery, and self-expression layer around NFTs. Please reach out to jobs@gallery.so with your Github profile if you’re interested!

Written by Benny Conn, Blockchain Engineer @ Gallery. For more stuff like this, follow me on Twitter.

Subscribe to GALLERY
Receive the latest updates directly to your inbox.
Mint this entry as an NFT to add it to your collection.
Verification
This entry has been permanently stored onchain and signed by its creator.