top of page

How We Store and Search 30 Billion Faces

Overview of a vector database: search and add flows.
Overview of a vector database: search and add flows.

Our team of dedicated engineers has been working around the clock to enhance our product’s capabilities, and we have learned a lot along the way. We are looking forward to sharing some of the challenges and insights to a wider audience. This article is just the start of an exciting series of Engineering write-ups. We hope you find it interesting!


The Clearview AI platform has evolved significantly over the past few years, with our database growing from a few million face images to an astounding 30 billion today. As the database has grown, so has the complexity of efficiently processing, storing, and searching this massive amount of data. Our team has embraced this challenge, developing innovative solutions to tackle these technical issues on the deca-billion scale.


The process of generating an embedding vector for a given face.
The process of generating an embedding vector for a given face.

Face search is a complex process that involves several steps. First, a face image is uploaded as a probe. The image is then preprocessed, which includes alignment and cropping. Next, the preprocessed image is turned into a numeric vector through embedding extraction. We do this using neural networks, which perform significantly better compared to conventional methods. Finally, the probe’s embedding vector is compared against relevant parts of a database, and the database returns a set of potential matches if any is present, ranked by their similarity to the input face. Based on similarity thresholds and a human-in-the-loop investigative process, the results can then be used to assist in the process of identifying an individual or verifying their identity.


One definition of similarity: “cosine similarity”. The inner product of vectors of unit length can range from -1 to 1, indicating the least to the most similarity. A vector is most similar to itself.
One definition of similarity: “cosine similarity”. The inner product of vectors of unit length can range from -1 to 1, indicating the least to the most similarity. A vector is most similar to itself.

The challenge of scale comes from the comparison step - how to retrieve and operate on only the small portions of a database that are highly likely to contain potential matches? After all, we don’t want to exhaustively search through the entire database for a given probe image. Think about visiting the library for a book such as Alice in Wonderland. You would not want to walk through the entire library to look for it, but instead use the letters in the book title or author as an index to narrow your search. Of course, libraries today have more sophisticated indexing methods. Fittingly, in the database world, the concept of an index is also widely used.


In our case, the database consists of embedding vectors, and matches are not literal or exact, but instead based on similarity scores and a search threshold, i.e. how close they are to the query vector. Therefore, the index we are looking for is called a nearest-neighbor index. This methodology poses new challenges to the database world, and is at the forefront of research and development. Certain trade-offs have to be made for the index to be suitable for production and scalable use. So technically the more appropriate term to describe this is “approximate nearest-neighbor index” (ANN). This type of index and databases that use it are getting increasingly more attention, because embedding vectors are a common form of output from neural networks, whether it be for computer vision (CV) tasks, or natural language processing (NLP) with large language models (LLMs).


Within this nascent and exciting domain, most of the existing technological solutions tackle problems at the sub-billion scale in terms of data points, ranging typically from a couple of million to a hundred million. Embedding vectors are stored in computers’ precious CPU memory, instead of on disk, which is more affordable, but much slower. As the field of AI keeps growing, more recently problems at above-billion scale have pushed storage and compute limits, and more importantly, the way vectors are indexed in the first place.



UNIQUENESS OF FACE DATA

Embedding vectors representing faces have unique properties, unlike those derived in other general contexts. To begin with, they have definitive ground-truth labels: a face either belongs to a certain unique person or not; there is no middle ground. In comparison, for general vision or language tasks, an object could belong to multiple conceptual categories: a “chihuahua” is a “dog”, also an “animal”, oftentimes a “pet”. The facial recognition algorithm tries to group vectors belonging to the same person close to each other, and separate those belonging to different people afar. The high-dimensional space exhibits a very clustered statistical distribution, and the number of clusters are bound by the finite number of people in the world. We could make use of this property.


Consider all the different face images of the same person. With a good facial recognition algorithm, most of them are clustered close together. This insight led us to think to ourselves, what if we could only store a few representative vectors in memory, and the rest of the data on fast disks, like solid state drives (SSDs)? We could theoretically have a finite amount of RAM used that could index hundreds of billions of face images in the world, instead of having it grow for each image collected.


The facial recognition algorithm works impressively well in practice, but it can only do so to a certain extent based on a small face crop. As a result, if we would like to retrieve 90%+ of all potential matches in the database of a probe face image, we would need to cast a very wide net, leading to very expensive resource utilization. You see, our engineering team faced significant challenges with supporting a growing vector database approaching 30 billion faces and potentially a lot more, where there was not a simple solution in research or on the market! The institutions that made progress on such problems were typically large technology companies with extensive resources and human capital. However, we needed to be creative and achieve this in a cost-effective and strategic way. Later in this article, we will see how our new system tackles this problem.



OUR SYSTEM BEFORE

Before we amassed 30 billion face vectors, we were using existing state-of-the-art solutions and made several modifications and bindings to meet our technological needs. We called the component NNDB, short for Nearest Neighbor DataBase. It was built upon an in-memory ANN index core, orchestrating multiple index shards on a single node, and many nodes behind a “proxy”. Each search query went through the proxy, and got executed on the nodes parallelly. Each node then launched a parallel group of searches into the index shards. Results were aggregated on the node stage, and then the proxy stage, to derive the global nearest neighbors of the full database. Aiming at recovering the great majority of potential matches for a given probe face image, we had to exhaustively search through 1~2% of the entire database (hundreds of millions of vector comparisons!) activating hundreds of CPU cores for an extended period of time. Fortunately, our targeted use case was officers and investigators searching faces for potential leads and completing their due diligence. Latency demand was not as pressing. However, we needed to scale up throughput for more concurrent users.


Our NNDB before: index shards were loaded in memory across many expensive nodes. Read nodes and write nodes were maintained separately.
Our NNDB before: index shards were loaded in memory across many expensive nodes. Read nodes and write nodes were maintained separately.

More challenging was that all the vector data had to be stored in memory. Despite heavy numerical compression, the vector data still took up multiple TBs of space. In addition, due to read-write locking, we needed to add new data to a write-only node until it was full before bringing it online. This not only made loading and unloading during maintenance cycles very tedious, but also prevented us from treating the component like an actual database that would support frequent read and write operations.


Once our database surpassed a few billion vectors, the difficulties started to compound and quickly became painful. Needless to say, we needed a better alternative!



PICKING THE NEXT SOLUTION

In the past three years, a few solutions emerged that aimed to tackle billion-scale vector search problems. Due to the greater scale, CPU memory as a precious resource had to give way to cheaper SSDs. However, the same algorithms suitable for in-memory computations at smaller scales struggled with the new paradigm. The reason for this is that retrieving data from SSDs incurs a large amount of random access and throughput overhead, in addition to the already prohibitively expanding vector comparison cost. To put it plainly, if the vast majority of vector data get moved out of memory to become part of a proper on-disk database, there needs to be a more effective, and manageable indexing mechanism that reduces the search space substantially, so that only a very small portion of the entire database gets visited during each search.


In the quest by researchers studying information retrieval, they ended up discovering the power of graph-based approaches. Notably, Hierarchical Navigable Small Worlds (HNSW) is an index structure very accurate at scale. It does have the cost of taking up extra space, and needs to be stored in memory due to its access pattern. However, once created for a given dataset, learning its statistical distribution, it can serve as a compass and drive the on-disk placement and search of far more vectors. Building a huge database can be divided into two stages.


Stage 1: in-memory graph index is used to assign the buckets incoming vectors should be put into.


Stage 2: it stores those actual vectors in the assigned buckets on disk.


Searching goes through a similar path. By using a graph index as the assigner (or, coarse quantizer in other conventions), representative enough of the underlying data, the requirements mentioned earlier can be met.



ANNOUNCING THE NEW NNDB

So we solved our deca-billion scale face search problem by building our new vector database, the new NNDB! We designed our new system around the concept of “inverted file”, serving as the on-disk storage backend, which we call BucketDB. Note “inverted file” here refers to the broad indexing data structure, rather than the specific keyword-based document retrieval method. It comes along with an overly large graph index serving as the in-memory assigner. This index was carefully constructed to represent the demographic diversity of human faces. The assigner’s job is to suggest the nearest-K buckets, out of much much more, to visit during a search request, and designate a bucket to store an embedding vector during an add request. These requests, along with their assigned buckets, are passed to BucketDB for actual reads and writes.


Our new NNDB: shards are on fast SSDs on only a few nodes. There is a separate assigner index to narrow down the search. Reads and writes can be combined with minimal locking.
Our new NNDB: shards are on fast SSDs on only a few nodes. There is a separate assigner index to narrow down the search. Reads and writes can be combined with minimal locking.

This new architecture enables us to trim 80% of the compute cost, while bringing in 10x the throughput improvement, with even more room to save with more data to add. Moreover, it is more accurate in matching and vector reconstruction, blessed by lighter compression in the abundance of disk storage, and capable of gracefully handling long-tail queries. Reads and writes are unified as well with minimal locking. Because of its on-disk nature, it can be managed more like a proper database for maintenance purposes.


Notably, the BucketDB was made possible by the combination of two open source projects, Faiss and RocksDB, both of which are brilliant in their own design. Faiss is a powerful library for efficient similarity search and clustering of high-dimensional vectors. It provides the inverted file structure, various SIMD-enabled (de-)compression/comparison algorithms, and parallel bucket scanning for nearest neighbors. On the other hand, RocksDB is a high-performance embedded key-value store that provides fast, scalable, and reliable storage and access of data. We created the binding layer that enabled Faiss’s inverted file structure to read from and write to RocksDB in parallel. By leveraging the strengths of both open source projects, we were able to create a scalable vector database, a winning duo forged into one.


The use of open source projects like Faiss and RocksDB highlights the importance of collaboration and innovation in the engineering field and development of new technologies. We are grateful to the creators and contributors of these projects for their dedication and expertise, which have enabled us to build a database that adapts to our needs of efficiently storing and searching faces at the deca-billion scale and beyond. At Clearview AI, we are committed to making a difference in the world by providing groundbreaking facial recognition technology. We believe that the innovation in this space has the potential to save lives, exonerate the innocent, and make our communities safer. We are excited about the possibilities that lie ahead and look forward to continuing our journey.


___


TERENCE LIU

Vice President, Machine Learning & Research

Terence Liu is an expert in building and deploying machine learning models, neural vector search engines, and distributed computing/data platforms. In 2021, he created the facial recognition algorithm at Clearview AI that ranked US #1 and World #2 among top companies and research institutions. In 2022, he drove the research and implementation of a vector search engine that aimed to power Clearview AI’s ever-growing data volume. In achieving them, he designed multiple compute-intensive pipelines and inference platforms. Terence previously worked at Bloomberg LP and earned a Ph.D. in Computational Physics from the University of Toledo.

bottom of page