What is CodeGraph
CodeGraph is a tool that transforms source code into a Property Graph. It parses code’s Abstract Syntax Tree (AST), extracts entities and relationships from the code, and builds a queryable code knowledge graph.
This architecture enables AI tools to understand code structure, call chains, and dependency relationships through graph queries.
Overall Architecture
| |
Core Data Model
CodeGraph uses a three-layer data structure to represent code:
1. Nodes Table — Code Entities (Graph Vertices)
| Field | Description |
|---|---|
id | Unique identifier, format like function:hash, file:path, class:hash |
kind | Node type: file, class, function, method, constant, variable, interface, type_alias, import |
name / qualified_name | Name and fully qualified name |
file_path | Source file path |
language | Programming language |
start_line/end_line | Source code location |
docstring/signature | Documentation and function signature |
visibility/is_exported/is_async/is_static/is_abstract | Modifier attributes |
decorators/type_parameters | JSON arrays for decorators and generic parameters |
2. Edges Table — Code Relationships (Graph Edges)
| Field | Description |
|---|---|
source/target | Two related node IDs |
kind | Relationship type |
metadata | JSON containing confidence score and resolvedBy method |
line/col | Source location where the relationship occurs |
Relationship Types:
contains— File contains nodes (hierarchical structure)calls— Function/method callsreferences— Variable/type referencesimports— Module importsinstantiates— Class instantiation
3. Files Table — File Tracking
Records each file’s content_hash, node_count, modified_at, etc., for incremental updates — only re-parsing changed files.
Parsing Workflow
| |
Incremental Updates and Deletion Synchronization
CodeGraph supports automatic synchronization of code changes, including three scenarios: addition, modification, and deletion:
1. Adding Code
When adding new source files or code entities:
- Scan phase discovers new file → creates
filenode - Parse AST → extracts new
function,class,variablenodes - Establish relationships → generates
contains,callsedges - Unresolved cross-file references stored in
unresolved_refstable
2. Modifying Code
Based on content hash comparison for incremental updates:
| |
Update workflow:
| |
3. Deleting Code
Leverages SQLite’s ON DELETE CASCADE constraints for automatic cleanup:
| |
Deletion workflow:
| |
4. Practical Example
Suppose a project has the following code:
| |
Scenario A: Deleting helper() function
| |
- Deletes node
function:abc123 - Automatically deletes edge
function:def456 → function:abc123 main()function preserved, but missing call relationship tohelper
Scenario B: Modifying main() function
| |
content_hashchanges → triggers re-indexing- Re-parses to generate new nodes and edges
- Old relationships replaced with new ones
Key Technical Choices
1. SQLite + FTS5 Full-Text Search
The nodes_fts virtual table based on FTS5建立全文索引 on name, qualified_name, docstring, signature, supporting fuzzy search for code symbols.
2. Content Hash-Based Incremental Updates
files.content_hash prevents redundant parsing of unmodified files, and nodes.updated_at timestamps track node update times.
3. Confidence Mechanism
| |
Reference resolution includes confidence scores, where exact-match indicates exact matching, supporting priority across different resolution strategies.
4. Node ID Generation Strategy
- File nodes:
file:relative_path - Other nodes:
type:content_hash(e.g.,function:fcddbf01897d11636307b7ab1f47aa5c) - Import nodes:
import:hash
Query Examples
This graph structure supports rich code queries:
| |
Real-World Data Example
Using the current blog project as an example, CodeGraph generates:
| Metric | Value |
|---|---|
| Files | 74 |
| Code Nodes | 167 |
| Relationship Edges | 289 |
| Unresolved References | 0 |
Node Type Distribution:
method(50)function(41)file(19)import(16)constant(12)variable(11)interface(9)class(6)type_alias(3)
Summary
CodeGraph’s core approach is: parsing source code into a property graph stored in SQLite, combined with Tree-sitter for multi-language AST parsing, achieving efficient code comprehension and retrieval through incremental updates and full-text indexing.
This architecture provides AI code assistants with powerful context understanding capabilities, enabling:
- Fast code definition and reference location
- Understanding of function call chains and dependency relationships
- Cross-file code navigation support
- Semantic-based code search