CodeGraph Technical Architecture: Transforming Source Code into a Queryable Knowledge Graph

What is CodeGraph

CodeGraph is a tool that transforms source code into a Property Graph. It parses code’s Abstract Syntax Tree (AST), extracts entities and relationships from the code, and builds a queryable code knowledge graph.

This architecture enables AI tools to understand code structure, call chains, and dependency relationships through graph queries.


Overall Architecture

1
2
3
Source Files → Parser (Tree-sitter) → AST → Extractor → SQLite Database
                                                    Graph Query Engine

Core Data Model

CodeGraph uses a three-layer data structure to represent code:

1. Nodes Table — Code Entities (Graph Vertices)

FieldDescription
idUnique identifier, format like function:hash, file:path, class:hash
kindNode type: file, class, function, method, constant, variable, interface, type_alias, import
name / qualified_nameName and fully qualified name
file_pathSource file path
languageProgramming language
start_line/end_lineSource code location
docstring/signatureDocumentation and function signature
visibility/is_exported/is_async/is_static/is_abstractModifier attributes
decorators/type_parametersJSON arrays for decorators and generic parameters

2. Edges Table — Code Relationships (Graph Edges)

FieldDescription
source/targetTwo related node IDs
kindRelationship type
metadataJSON containing confidence score and resolvedBy method
line/colSource location where the relationship occurs

Relationship Types:

  • contains — File contains nodes (hierarchical structure)
  • calls — Function/method calls
  • references — Variable/type references
  • imports — Module imports
  • instantiates — Class instantiation

3. Files Table — File Tracking

Records each file’s content_hash, node_count, modified_at, etc., for incremental updates — only re-parsing changed files.


Parsing Workflow

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
1. Scan project directory, traverse all source files
2. Calculate content_hash for each file, compare with files table
    Skip unmodified files (incremental indexing)
3. For files needing updates, use Tree-sitter to parse and generate AST
4. Extract from AST:
   - Nodes (classes, functions, methods, variables, constants, interfaces, etc.)
   - Edges (call relationships, references, containment, imports)
5. Store unresolved references in unresolved_refs table
6. Attempt to resolve unresolved_refs (cross-file reference resolution)
7. Write to SQLite database

Incremental Updates and Deletion Synchronization

CodeGraph supports automatic synchronization of code changes, including three scenarios: addition, modification, and deletion:

1. Adding Code

When adding new source files or code entities:

  • Scan phase discovers new file → creates file node
  • Parse AST → extracts new function, class, variable nodes
  • Establish relationships → generates contains, calls edges
  • Unresolved cross-file references stored in unresolved_refs table

2. Modifying Code

Based on content hash comparison for incremental updates:

1
2
3
4
-- files table tracks hash for each file
content_hash TEXT NOT NULL,  -- SHA-256 hash of file content
modified_at INTEGER NOT NULL, -- File modification time
indexed_at INTEGER NOT NULL   -- Last indexing time

Update workflow:

1
2
3
4
5
6
1. Calculate current file's content_hash
2. Compare with stored hash in files table
3. If hash differs → file has been modified
4. Delete old nodes and edges for that file (cascade delete)
5. Re-parse file, generate new nodes and edges
6. Update files table with new hash and timestamp

3. Deleting Code

Leverages SQLite’s ON DELETE CASCADE constraints for automatic cleanup:

1
2
3
4
5
6
-- edges table foreign key constraints
FOREIGN KEY (source) REFERENCES nodes(id) ON DELETE CASCADE
FOREIGN KEY (target) REFERENCES nodes(id) ON DELETE CASCADE

-- unresolved_refs table foreign key constraints
FOREIGN KEY (from_node_id) REFERENCES nodes(id) ON DELETE CASCADE

Deletion workflow:

1
2
3
4
5
6
1. Scan discovers file deleted → remove from files table
2. Delete all nodes for that file (cascade deletes edges)
3. Or: a function in a file is deleted
   → Delete the function node
   → Automatically delete all edges containing that node
   → Automatically delete related unresolved_refs

4. Practical Example

Suppose a project has the following code:

1
2
3
4
5
6
# utils.py (already indexed)
def helper():    # node_id: function:abc123
    pass

def main():      # node_id: function:def456
    helper()     # edge: function:def456 → function:abc123 (calls)

Scenario A: Deleting helper() function

1
2
3
# Re-index after deletion
git rm utils.py  # or manually delete function
codegraph --update
  • Deletes node function:abc123
  • Automatically deletes edge function:def456 → function:abc123
  • main() function preserved, but missing call relationship to helper

Scenario B: Modifying main() function

1
2
def main():      # same node_id: function:def456
    print("updated")
  • content_hash changes → triggers re-indexing
  • Re-parses to generate new nodes and edges
  • Old relationships replaced with new ones

Key Technical Choices

The nodes_fts virtual table based on FTS5建立全文索引 on name, qualified_name, docstring, signature, supporting fuzzy search for code symbols.

2. Content Hash-Based Incremental Updates

files.content_hash prevents redundant parsing of unmodified files, and nodes.updated_at timestamps track node update times.

3. Confidence Mechanism

1
{"confidence": 0.9, "resolvedBy": "exact-match"}

Reference resolution includes confidence scores, where exact-match indicates exact matching, supporting priority across different resolution strategies.

4. Node ID Generation Strategy

  • File nodes: file:relative_path
  • Other nodes: type:content_hash (e.g., function:fcddbf01897d11636307b7ab1f47aa5c)
  • Import nodes: import:hash

Query Examples

This graph structure supports rich code queries:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
-- Find all functions called by a specific function
SELECT t.name FROM edges e JOIN nodes t ON e.target = t.id 
WHERE e.source = 'function:xxx' AND e.kind = 'calls'

-- Full-text search for code symbols
SELECT * FROM nodes WHERE nodes MATCH 'search'

-- Find all nodes contained in a file
SELECT n.name, n.kind FROM edges e JOIN nodes n ON e.target = n.id 
WHERE e.source = 'file:xxx' AND e.kind = 'contains'

-- Find who calls a specific method
SELECT s.name FROM edges e JOIN nodes s ON e.source = s.id 
WHERE e.target = 'method:xxx' AND e.kind = 'calls'

Real-World Data Example

Using the current blog project as an example, CodeGraph generates:

MetricValue
Files74
Code Nodes167
Relationship Edges289
Unresolved References0

Node Type Distribution:

  • method (50)
  • function (41)
  • file (19)
  • import (16)
  • constant (12)
  • variable (11)
  • interface (9)
  • class (6)
  • type_alias (3)

Summary

CodeGraph’s core approach is: parsing source code into a property graph stored in SQLite, combined with Tree-sitter for multi-language AST parsing, achieving efficient code comprehension and retrieval through incremental updates and full-text indexing.

This architecture provides AI code assistants with powerful context understanding capabilities, enabling:

  • Fast code definition and reference location
  • Understanding of function call chains and dependency relationships
  • Cross-file code navigation support
  • Semantic-based code search