If you are a developer interested in AI, there is a good chance that you may have started working with Large Language Models such as OpenAI’s GPT-4, Google’s PaLM-2 or Meta’s LLaMa to build intelligence into your application, especially semantic search. The search space has been completely upended by the power LLMs offer to find exactly what users want based on the meaning of their queries versus matching keywords.
I’ve spent the last decade building semantic search for skills with OpenEd and Empath, and have also built a free, open source semantic legal search engine. To handle problems of this scale it’s essential to be able to find close match vectors for your supplied queries or other language amongst a large corpus of embedding vectors. Relational databases can be extended for this problem. But they have limitations in search speed and are otherwise not necessarily optimized for vector data, both in their storage mechanisms and indexing approaches.
As a result, vector databases have exploded in popularity. But I don’t think how they compare to each other is widely understood yet. To enable a more meaningful comparison of products, I put together a small open source project for a semantic search problem that was small enough to be easily usable by others, but large enough to make performance characteristics meaningful. The problem is finding skills related to jobs based on job descriptions where the skills also have descriptions. This is something that Empath does (though our core model infers skills for employees based on the employee’s digital footprint), albeit in a more complex way. This particular project uses vector distance to find the best skills for each job in a straightforward simple way (much simpler than what our actual product does). But it still needs the power of a vector database to perform sufficiently fast and accurately. The code is available here.
We will use this project to evaluate vector databases in four areas: overall feature set and capabilities, code flavor of the API, load and indexing performance, and search performance.
Products Tested
We chose the following products to test although the application could easily be extended to other ones: Pinecone, Weaviate, Milvus and Postgres pgvector. We included Pinecone because of its apparent popularity. We included Weaviate because of its robust unique feature set including GraphQL and a console featuring a query DSL. Milvus bills itself as a high performance product with extreme scalability. Finally, we included the pgvector add-on for Postgres to explore the pros and cons of extending a relational database to handle vector data.
Overview of Application
We are going to load a set of skill descriptions to a vector database. Then we will process a set of job descriptions and create a set of skills associated with each job based on the skills’ similarity as measured by vector cosine distance. We will measure the loading time of each database, and then the search time to find relevant skills for each job. Note that we are not generating embeddings vectors in this application. It is assumed that you will use your LLM of choice to generate those vectors and supply them in files as shown in the last section. The performance conclusions we draw are invariant to the embeddings vector generation method (as long as the dimensionality is close to the 512-dimension vectors used here).
Load Performance
To load each vector database, we have a set of scripts entitled load_skill_vectors_<product name>.py that we execute individually as shown below.
Pinecone
Here are the results of the skill vector loading with Pinecone. To improve load time, vectors can be loaded in batches of up to 1000 vectors.
$ python load_skill_vectors_pinecone.py
3680 vectors uploaded to the skills index in 15.825818061828613 seconds
The resulting load performance is 4.3 milliseconds per vector.
Weaviate
Similarly Weaviate can be loaded in 100 vector batches, but Weaviate handles the actual batching of the data for you. The resulting performance is similar: 5 milliseconds per vector.
Milvus
Milvus has a separate step for creating the index, which the other products do for you.
$ python load_skill_vectors_milvus.py
Create collection: skills
Result of data insert: (insert count: 3680, delete count: 0, upsert count: 0, timestamp: 445502039079780354, success count: 3680, err count: 0) in 36.308228969573975 seconds
3680 vectors uploaded in 37.53485727310181 seconds
Milvus load performance is roughly 10 millisecond per vector. Note that Milvus as hosted by Zilliz as of this writing does not allow use of cosine distance, but only L2 (Euclidean distance). All the other products use cosine distance by default. And for my applications cosine distance performs better. If you need cosine distance you will need to normalize the vectors before insertion which will add some time to the load.
UPDATE: Milvus now has cosine distance supported.
pgvector
pgvector performance in loading vectors and indexing them is roughly 9.5 milliseconds per vector. Unsurprisingly, Postgres is the laggard for loading data, as Postgres has not been optimized for storing or indexing high dimensional vector data. Postgres also requires explicit index creation.
$ python load_skill_vectors_pg.py
3680 vectors uploaded in 351.4945948123932 seconds
Index created in 0.277188777923584 seconds
Load Performance Summary
Weaviate and Pinecone lead in load performance. For my scenarios such performance is not the primary criteria, but there may be those of you for which this is a significant factor.
Search Performance
This is the most important factor for my applications, and I suspect for many of you. We will execute a single script skills_for_jobs.py which performs all of the search. We iterate through all of the jobs and search for the top ten vectors closest to the job description’s embeddings vector. The quality of the search is evaluated by comparing the vectors cosine similarity to the best skill vector as determined by Exact Nearest Neighbor search. Fortunately pgvector has an option for ENN search to help with this. So we do two pgvector searches: one for Approximate Nearest Neighbors (as we do with all the other vector DBs) and one to get the “best skill vector”.
Pinecone has an average search time of 0.88 seconds. Our adopted search quality measure is the average cosine similarity of the top ten skills retrieved. This is 0.03 in our example. Weaviate’s query time is much better at 0.12 seconds per search. The quality is similar at an average of 0.03 per vector.
Milvus searches take an average of 0.95 seconds, which is a bit surprising. It could no doubt be improved by dedicating more compute resources to search. But we are comparing default options here. The quality is unsurprisingly a bit lower, 0.028, no doubt due to the Milvus indexes using L2 not cosine distance, as discussed. Postgres search time is 0.9 seconds and the quality is the highest at 0.08.
UPDATE: Zilliz (Milvus) claims that the search time is completely dependent on network latency. I don’t believe this is valid from the same location over thousands of tests. Also search quality may be improved by using cosine distance as a metric, which is now supported.
Coding Experience
I encourage you to peruse the scripts mentioned to get a flavor of each of the different APIs, especially the skills_for_jobs.py script (which incorporates each product’s search calls). Pinecone’s API is quite simple and straightforward for the most common and simple scenarios. Weaviate’s approach of chaining methods together to set options is quite powerful, especially for moving beyond the default options. It should also appeal to JavaScript developers.
Milvus is straightforward enough. However to move between various hosting options different authentication code is required. This is not acceptable devops practice. They need to address this.
pgvector’s extensions of SQL to perform insertion or search will appeal to Postgres developers and perhaps database-oriented developers as well. But the typical data scientist may find it awkward. It is more coding effort to get simple tasks done. The python package pgvector-python can alleviate some of the pain. I did not use it to keep close to the “native pgvector experience”.
Features
The unique capabilities of Weaviate stand out here. It has a GraphQL query language for more complex queries and data retrieved (data attached to the vectors). It allows defining classes of data to be attached to vectors, which makes it more likely that it can be used in lieu of a relational database, instead of alongside one. Weaviate supports the Hierarchical Navigable Small Worlds indexing method, as well as support Product Quantization along with HNSW, which can alleviate storage requirements. Weaviate also supports “hybrid search” based on sparse embeddings (such as TF/IDF from keywords) and dense embeddings (like LLMs). Hybrid search supports scenarios where having some weight on keyword matches (not just semantic similarity from LLM embeddings vectors) is helpful. Weaviate can list all the contents of an index, but does not offer Exact Nearest Neighbor search.
Pinecone supports parallel upserts which can be useful in some scenarios, though these loading scripts shown here could be parallelized easily enough with other databases at the devops level (running multiple scripts). Pinecone supports namespaced data, which can be useful, but presumably is used most often where Weaviate uses classes for different types of data. Pinecone recently added support for hybrid search.
Pinecone also has some significant limitations. Most importantly, it is not open source. It is the only product of the one’s listed that is closed source. For some of my scenarios, I need to host the vector database myself for data privacy reasons. Also, you cannot enumerate the contents of the entire index. This can make it difficult to validate the results coming back, at least without another database alongside it to retrieve all the vectors (as we have here). Pinecone also has no Exact Nearest Neighbor search (though only Postgres has this among these products). Pinecone also offers no index choice and does not even describe their indexing method (although it is implied from their various blog posts that it is HNSW).
Milvus’ biggest unique feature is the fine-grained scalability. You can configure how much processing is devoted to loading data versus indexing data versus searching. This is certainly very powerful for scaling under large loads and guaranteeing acceptable times for load, index and query, while not spending unnecessarily on compute resources. It is also the only product that supports the new DiskANN indexing method, which should yield better results at large scale. To keep the comparison similar we stayed with the default HNSW indexing method. I did limited testing of DiskANN though out of curiosity and the results were not better. This is something I plan to investigate with larger data sizes. Milvus now has the ability to search for matches to multiple vectors which is interesting. But the primary use case for this is not quite apparent to me (RAG Fusion?) and does not appear in my example application.
Pinecone and Weaviate do index creation implicitly. Also, as mentioned changing authentication between different hosting options isn’t really acceptable. The documentation and examples are a bit scattered. You actually have to download a zip file to get all the examples. Finally Milvus does not have hybrid search, only the ability to filter vectors based on annotated attributes of those vectors, which is definitely not the same thing (though they have a blog post that implies that they think it is an acceptable substitute). This last issue of no hybrid search is probably the most serious in general and certainly for my applications.
Postgres has the obvious advantage of allowing a single store for all of your data, vectors and otherwise. You can do this with Weaviate by defining classes of objects and annotating the vectors with them. But that is not the same as allowing relational search on that data. Milvus and Pinecone allow adding attributes to vectors. But this is even further from having all of your data in a relational database. Postgres also effectively supports hybrid search because you can perform TF/IDF search (using tsvector as described here). Then weight the results from TF/IDF along with your dense embeddings as you deem appropriate for your application.
Postgres also allows Exact Nearest Neighbor search and enumerating the contents of an index. Postgres is the slowest load time, and does not have the fastest search. It also requires explicit index creation, like Milvus does. But that is the general expectation on Postgres so it might be inconsistent to have implicit index creation in this environment.
Cost
We hosted the Pinecone index on their lowest compute instance type, which they call a “pod”. It is referred to as s1.x1 and costs $70 per month. We hosted Weaviate on their “Standard — Performance-Optimized” instance, for $25 per month. We hosted Milvus on Zillig’s “Standard — Performance-Optimized” instance, for $99 per month. We hosted Postgres and pgvector on StackHero’s Heroku add-on at the 10GB disk, 2GB RAM level for $19 per month. Weaviate wins the price/performance comparison rather handily here.
Conclusions
For most of my applications, I plan to run Weaviate due to the fast search, robust features such as GraphQL query and price/performance. For applications that I need to scale to huge volumes of vectors and/or users, I will investigate tuning Milvus’ options for them on a case by case basis. When I need to mix vector data and structured data in an application that isn’t heavily sensitive to load and query speed I will use Postgres. I believe these guidelines should transfer well to most medium to large volume of vector semantic search applications.
Running the Application Yourself
The skills and skill vectors are contained in the files skill_list.csv and skill_vectors.npy. We do not supply these files in the project. If you want to reproduce the results, you can make up your own skill list, grab some open source skill descriptions from the Open Skills Network or elsewhere, or use ChatGPT to generate skill descriptions for your skills of interest. My test used Empath’s skills taxonomy known as Empath Proficiency Library, which is not open source. Once you have your skill list, store it in skill_list.csv. Then generate the embeddings vectors for them using your LLM of choice and store them as a NumPy array in the file skill_vector.npy. Both of these files should be stored in the ./data subdirectory of the project directory that you cloned this project into. If there is interest, I would consider repeating this experiment with the OSN skills content and their job descriptions.
Once you have these files we will load the skill vectors into each vector database. You will want to sign up for each of these products. Specifically Pinecone at pinecone.io, Weaviate at weaviate.com, Milvus with a Milvus hosting provider such as Zilliz. We used a heroku add-on called StackHero to get a hosted instance of postgres with pgvector support. You will get URLs for each server and credentials to login. The application expects the following environment variables to be configured for these. Specifically the environment variables are:
- PINECONE_API_KEY
- PINECONE_ENV
- PINECONE_SKILLS_INDEX
- WEAVIATE_USER
- WEAVIATE_PASSWORD
- WEAVIATE_CLUSTER
- MILVUS_URL
- MILVUS_PASSWORD
- MILVUS_API_KEY
- STACKHERO_POSTGRESQL_HOST
- STACKHERO_POSTGRESQL_ADMIN_PASSWORD
Next Steps
I plan to add some more vector databases to the mix, including Couchbase. Let me know if there are any of interest. I also may add specific skills and jobs data, such as those from the Open Skills Network. I also plan to increase the vector corpus to see if DiskANN improves performance in that scenario.
Please send me any comments or suggestions to adam@empath.net. Pull requests implementing features are appreciated.