📝 Guest Post: I Built a Deep Research with Open Source – and So Can You!
Was this email forwarded to you? Sign up here In this guest post, Stefan Webb, Developer Advocate at Zilliz, builds a lightweight “Deep Research” clone using open-source tools. With Milvus, DeepSeek, and LangChain, he prototypes an agent that can reason, plan, retrieve from Wikipedia, and write a basic report - all in a few hours, no API calls needed. Well actually, a minimally scoped agent that can reason, plan, use tools, etc. to perform research using Wikipedia. Still, not bad for a few hours of work… Unless you reside under a rock, in a cave, or in a remote mountain monastery, you will have heard about OpenAI’s release of Deep Research on Feb 2, 2025. This new product promises to revolutionize how we answer questions requiring the synthesis of large amounts of diverse information. You type in your query, select the Deep Research option, and the platform autonomously searches the web, performs reasoning on what it discovers, and synthesizes multiple sources into a coherent, fully-cited report. It takes several orders of magnitude longer to produce its output relative to a standard chatbot, but the result is more detailed, more informed, and more nuanced. How does it work?But how does this technology work, and why is Deep Research a noticeable improvement over previous attempts (like Google’s Deep Research - incoming trademark dispute alert)? We’ll leave the latter for a future post. As for the former, there is no doubt much “secret sauce” underlying Deep Research. We can glean a few details from OpenAI’s release post, which I summarize. Deep Research exploits recent advances in foundation models specialized for reasoning tasks:
Deep Research makes use of a sophisticated agentic workflow with planning, reflection, and memory:
Deep Research is trained on proprietary data, using several types of fine-tuning, which is likely a key component in its performance:
The exact design of the agentic workflow is a secret, however, we can build something ourselves based on well-established ideas about how to structure agents. One note before we begin: It is easy to be swept away by Generative AI fever, especially when a new product that seems a step-improvement is released. However, Deep Research, as OpenAI acknowledges, has limitations common to Generative AI technology. We should remember to think critically about the output in that it may contain false facts (“hallucinations”), incorrect formatting and citations, and vary significantly in quality based on the random seed. Can I build my own?Why certainly! Let’s build our own “Deep Research”, running locally and with open-source tools. We’ll be armed with just a basic knowledge of Generative AI, common sense, a couple of spare hours, a GPU, and the open-source Milvus, DeepSeek R1, and LangChain. We cannot hope to replicate OpenAI’s performance of course, but our prototype will minimally demonstrate some of the key ideas likely underlying their technology, combining advances in reasoning models with advances in agentic workflows. Importantly, and unlike OpenAI, we will be using only open-source tools, and be able to deploy our system locally - open-source certainly provides us great flexibility! We will make a few simplifying assumptions to reduce the scope of our project:
We will use Milvus for our vector database, DeepSeek R1 as our reasoning model, and LangChain to implement RAG. Let’s get started! We will use our mental model of how humans conduct research to design the agentic workflow: Define/Refine QuestionResearch starts by defining a question. We take the question to be the user’s query, however, we use our reasoning model to ensure the question is expressed in a way that is specific, clear, and focused. That is, our first step is to rewrite the prompt and extract any subqueries or subquestions. We make effective use of our foundation models specialization for reasoning, and a simple method for JSON structured output. Here is an example reasoning trace as DeepSeek refines the question “How has the cast changed over time?”:
SearchNext, we conduct a “literature review” of Wikipedia articles. For now, we read a single article and leave navigating links to a future iteration. We discovered during prototyping that link exploration can become very expensive if each link requires a call to the reasoning model. We parse the article, and store its data in our vector database, Milvus, akin to taking notes. Here is a code snippet showing how we store our Wikipedia page in Milvus using its LangChain integration:
AnalyzeThe agent returns to its questions and answers them based on the relevant information in the document. We will leave a multi-step analysis/reflection workflow for future work, as well as any critical thinking on the credibility and bias of our sources. Here is a code snippet illustrating constructing a RAG with LangChain and answering our subquestions separately. # Define the RAG chain for response generation
# Prompt the RAG for each question
SynthesizeAfter the agent has performed its research, it creates a structured outline, or rather, a skeleton, of its findings to summarize in a report. It then completes each section, filling it in with a section title and the corresponding content. We leave a more sophisticated workflow with reflection, reordering, and rewriting for a future iteration. This part of the agent involves planning, tool usage, and memory. See accompanying notebook for the full code and the saved report file for example output. ResultsOur query for testing is “How has The Simpsons changed over time?” and the data source is the Wikipedia article for “The Simpsons”. Here is one section of the generated report: Summary: What we built and what’s nextIn just a few hours, we have designed a basic agentic workflow that can reason, plan, and retrieve information from Wikipedia to generate a structured research report. While this prototype is far from OpenAI’s Deep Research, it demonstrates the power of open-source tools like Milvus, DeepSeek, and LangChain in building autonomous research agents. Of course, there’s plenty of room for improvement. Future iterations could:
Open-source gives us flexibility and control that closed source doesn’t. Whether for academic research, content synthesis, or AI-powered assistance, building our own research agents open up exciting possibilities. Stay tuned for the next post where we explore adding real-time web retrieval, multi-step reasoning, and conditional execution flow! Resources
*This post was written by Stefan Webb and originally published on Zilliz here. We thank Zilliz for their insights and ongoing support of TheSequence.You’re on the free list for TheSequence Scope and TheSequence Chat. For the full experience, become a paying subscriber to TheSequence Edge. Trusted by thousands of subscribers from the leading AI labs and universities. |
Similar newsletters
There are other similar shared emails that you might be interested in: