logo

Command Palette

Search for a command to run...

Deep Codebase Indexer

poker-application-landing

I was looking to build my own advanced coding multi agent system. This was the first step in that process.

For a LLM to answer questions from a codebase it needs right context. But just a simple semantic RAG doesn't work that good.

Here are the features i implemented in this project?:

  • Incremental indexing: We don't want to index the whole codebase when something in the codebase changes. For this, I implemented an incremental indexing strategy in which we watch files in which changes occur (using SHA-256 hashing) and only indexing them again.
  • Lexical Indexing: This is the keyword related search in the codebase. Address the need of exact-word matches.
  • Semantic Indexing: This is the meaning related search in the codebase. Done using embeddings model and stored in vector database.
  • Structural Indexing: The first two indexes treats codebase as tokens or concepts. But since code has a "Syntax" and "Structure", we can take advantage of that too. We use AST's for this.
  • Graph Based Indexing: (Still in progress) This index captures the dynamic part of the codebase, the relationships and data flows that connect disparate parts into a functioning whole. I am trying to use CPG's for this.

Based on the user query, we can give different weights to different indexed to create a good context for the LLM to answer and dig further on the query.