CodeGraph Technical Architecture: Transforming Source Code into a Queryable Knowledge Graph

What is CodeGraph

CodeGraph is a tool that transforms source code into a Property Graph. It parses code’s Abstract Syntax Tree (AST), extracts entities and relationships from the code, and builds a queryable code knowledge graph.

This architecture enables AI tools to understand code structure, call chains, and dependency relationships through graph queries.

Overall Architecture

1
2
3
Source Files → Parser (Tree-sitter) → AST → Extractor → SQLite Database
                                                          ↓
                                                    Graph Query Engine

Core Data Model

CodeGraph uses a three-layer data structure to represent code:

1. Nodes Table — Code Entities (Graph Vertices)

Field	Description
`id`	Unique identifier, format like `function:hash`, `file:path`, `class:hash`
`kind`	Node type: `file`, `class`, `function`, `method`, `constant`, `variable`, `interface`, `type_alias`, `import`
`name` / `qualified_name`	Name and fully qualified name
`file_path`	Source file path
`language`	Programming language
`start_line/end_line`	Source code location
`docstring/signature`	Documentation and function signature
`visibility/is_exported/is_async/is_static/is_abstract`	Modifier attributes
`decorators/type_parameters`	JSON arrays for decorators and generic parameters

2. Edges Table — Code Relationships (Graph Edges)

Field	Description
`source/target`	Two related node IDs
`kind`	Relationship type
`metadata`	JSON containing `confidence` score and `resolvedBy` method
`line/col`	Source location where the relationship occurs

Relationship Types:

contains — File contains nodes (hierarchical structure)
calls — Function/method calls
references — Variable/type references
imports — Module imports
instantiates — Class instantiation

3. Files Table — File Tracking

Records each file’s content_hash, node_count, modified_at, etc., for incremental updates — only re-parsing changed files.

Parsing Workflow

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
1. Scan project directory, traverse all source files
2. Calculate content_hash for each file, compare with files table
   → Skip unmodified files (incremental indexing)
3. For files needing updates, use Tree-sitter to parse and generate AST
4. Extract from AST:
   - Nodes (classes, functions, methods, variables, constants, interfaces, etc.)
   - Edges (call relationships, references, containment, imports)
5. Store unresolved references in unresolved_refs table
6. Attempt to resolve unresolved_refs (cross-file reference resolution)
7. Write to SQLite database

Incremental Updates and Deletion Synchronization

CodeGraph supports automatic synchronization of code changes, including three scenarios: addition, modification, and deletion:

1. Adding Code

When adding new source files or code entities:

Scan phase discovers new file → creates file node
Parse AST → extracts new function, class, variable nodes
Establish relationships → generates contains, calls edges
Unresolved cross-file references stored in unresolved_refs table

2. Modifying Code

Based on content hash comparison for incremental updates:

1
2
3
4
-- files table tracks hash for each file
content_hash TEXT NOT NULL,  -- SHA-256 hash of file content
modified_at INTEGER NOT NULL, -- File modification time
indexed_at INTEGER NOT NULL   -- Last indexing time

Update workflow:

1
2
3
4
5
6
1. Calculate current file's content_hash
2. Compare with stored hash in files table
3. If hash differs → file has been modified
4. Delete old nodes and edges for that file (cascade delete)
5. Re-parse file, generate new nodes and edges
6. Update files table with new hash and timestamp

3. Deleting Code

Leverages SQLite’s ON DELETE CASCADE constraints for automatic cleanup:

1
2
3
4
5
6
-- edges table foreign key constraints
FOREIGN KEY (source) REFERENCES nodes(id) ON DELETE CASCADE
FOREIGN KEY (target) REFERENCES nodes(id) ON DELETE CASCADE

-- unresolved_refs table foreign key constraints
FOREIGN KEY (from_node_id) REFERENCES nodes(id) ON DELETE CASCADE

Deletion workflow:

1
2
3
4
5
6
1. Scan discovers file deleted → remove from files table
2. Delete all nodes for that file (cascade deletes edges)
3. Or: a function in a file is deleted
   → Delete the function node
   → Automatically delete all edges containing that node
   → Automatically delete related unresolved_refs

4. Practical Example

Suppose a project has the following code:

1
2
3
4
5
6
# utils.py (already indexed)
def helper():    # node_id: function:abc123
    pass

def main():      # node_id: function:def456
    helper()     # edge: function:def456 → function:abc123 (calls)

Scenario A: Deleting helper() function

1
2
3
# Re-index after deletion
git rm utils.py  # or manually delete function
codegraph --update

Deletes node function:abc123
Automatically deletes edge function:def456 → function:abc123
main() function preserved, but missing call relationship to helper

Scenario B: Modifying main() function

1
2
def main():      # same node_id: function:def456
    print("updated")

content_hash changes → triggers re-indexing
Re-parses to generate new nodes and edges
Old relationships replaced with new ones

Key Technical Choices

1. SQLite + FTS5 Full-Text Search

The nodes_fts virtual table based on FTS5建立全文索引 on name, qualified_name, docstring, signature, supporting fuzzy search for code symbols.

2. Content Hash-Based Incremental Updates

files.content_hash prevents redundant parsing of unmodified files, and nodes.updated_at timestamps track node update times.

3. Confidence Mechanism

1
{"confidence": 0.9, "resolvedBy": "exact-match"}

Reference resolution includes confidence scores, where exact-match indicates exact matching, supporting priority across different resolution strategies.

4. Node ID Generation Strategy

File nodes: file:relative_path
Other nodes: type:content_hash (e.g., function:fcddbf01897d11636307b7ab1f47aa5c)
Import nodes: import:hash

Query Examples

This graph structure supports rich code queries:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
-- Find all functions called by a specific function
SELECT t.name FROM edges e JOIN nodes t ON e.target = t.id 
WHERE e.source = 'function:xxx' AND e.kind = 'calls'

-- Full-text search for code symbols
SELECT * FROM nodes WHERE nodes MATCH 'search'

-- Find all nodes contained in a file
SELECT n.name, n.kind FROM edges e JOIN nodes n ON e.target = n.id 
WHERE e.source = 'file:xxx' AND e.kind = 'contains'

-- Find who calls a specific method
SELECT s.name FROM edges e JOIN nodes s ON e.source = s.id 
WHERE e.target = 'method:xxx' AND e.kind = 'calls'

Real-World Data Example

Using the current blog project as an example, CodeGraph generates:

Metric	Value
Files	74
Code Nodes	167
Relationship Edges	289
Unresolved References	0

Node Type Distribution:

method (50)
function (41)
file (19)
import (16)
constant (12)
variable (11)
interface (9)
class (6)
type_alias (3)

Summary

CodeGraph’s core approach is: parsing source code into a property graph stored in SQLite, combined with Tree-sitter for multi-language AST parsing, achieving efficient code comprehension and retrieval through incremental updates and full-text indexing.

This architecture provides AI code assistants with powerful context understanding capabilities, enabling:

Fast code definition and reference location
Understanding of function call chains and dependency relationships
Cross-file code navigation support
Semantic-based code search