New tool brings vector search to the scientific literature
Lauren Smith
Feb 20, 2025
data:image/s3,"s3://crabby-images/d7456/d745690f373d8864697201bdd169bde1d639fab3" alt="Papers being flipped through on a background of columns of 0s and 1s."
In fast-moving fields like machine learning, carbon reduction, and climate science, there can be hundreds of papers published each week. Many researchers find it difficult to stay on top of new developments. John Kitchin created a tool to better manage scientific literature.
Kitchin, a professor of chemical engineering, wanted a way to find relevant papers more easily than with available search engines. His tool, named litdb, is a Python package that enables users to curate a database of papers and then search it with a variety of text and semantic search options, including full text and vector search.
"Litdb helps you curate and use your collection of scientific literature," says Kitchin. It can be used to collect older articles as well as keep up with new articles.
To get started, Kitchin recommends using the Digital Object Identifier (DOI) of one paper to find other papers. References, citing papers, and related papers can be downloaded into the database. Users can also add articles from an author or run a query against OpenAlex, a free, open-source, global index of scientific papers and authors. Litdb stores the title and abstract for each paper and then generates an embedding vector for each one.
"Your litdb starts out empty. You have to add articles that are relevant to you," says Kitchin. He adds that the best way to build the database depends on your goal. "You have to compromise on breadth and depth with the database size."
Users can also include local files in their litdb. Any file that can be turned into text can be added to litdb.
Litdb helps you curate and use your collection of scientific literature.
John Kitchin, Professor, Chemical Engineering
Once a user has curated their database, they can search within it and retrieve references and related papers for each of the top results. The ollama GPT integration allows users to search and interact through natural language queries.
A user's query is converted to a high-dimensional embedding vector, and then a vector search identifies entries in the database that are similar to the query. The system downloads the citations, references, and related entries for the initial results, and then repeats the query. The iterative search continues until the user instructs it to stop or the system doesn't find any new results.
Litdb uses OpenAlex for searching the scientific literature and libSQL to store results in a local database. One of Kitchin's motivations for building litdb was an interest in working locally. He has since created a litellm integration for users who want to use a cloud large language model (LLM).
From their search results, a user can export citation strings or BibTeX entries to produce reference lists.
Kitchin notes that one limitation to litdb is that the search can only return literature and documents that are in a user's database: "If you haven't added it to your litdb, you won't find it." Kitchin suggests a mesh of approaches to cover the most likely papers and minimize the risk of missing applicable work. He follows authors; adds references, related papers, and citing papers from the most relevant papers; uses text search filters; and adds interesting papers he finds on social media, along with their references, related papers, and citing papers.
Kitchin is continuing to add features in litdb, including image search with text and image queries and audio search. He provides video tutorials on his YouTube channel.
For media inquiries, please contact Lauren Smith at lsmith2@andrew.cmu.edu.