Full Content
# NSG RAG Index System - Complete Guide
**Date:** 2026-01-29
**Purpose:** Enable Claude to search codebases before coding (Ready-Aim-Fire)
---
## Why This Exists
Without access to the codebase, Claude guesses at solutions and chases symptoms for hours. With RAG indexing, Claude can search the code first, understand the architecture, then make one correct fix.
**Example:** ATS2 phone number bug. Instead of 3 hours debugging, Claude searched `phone numericString substring`, found 4 files with the same code, and fixed all in 5 minutes.
---
## Architecture Overview
```
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β api.nsgia.com β
β (nginx β gunicorn) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β api_proxy_app.py β
β (Flask main app) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β index_search_api.py (routes) β
β ββββββββββββββββ¬βββββββββββββββ¬βββββββββββββββ β
β β index_db.py βindex_vector.pyβ index_auth.pyβ β
β β (FTS5 search)β(semantic) β (magic keys) β β
β ββββββββββββββββ΄βββββββββββββββ΄βββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β /indexes/ (SQLite DBs + registry.db) β
β βββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ β
β β aburi.dbβchats.db β2tbu.db βats2.db β2025dl.dbβ β
β βββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
```
---
## Index Schema Types
The system supports three schema types, auto-detected by `index_vector.py`:
| Schema | Tables | Use Case |
|--------|--------|----------|
| **Standard (files)** | files, file_lookup, file_vectors, vector_map | Small codebases (<1000 files) |
| **Chunked** | files, file_lookup, file_chunks, chunk_vectors, chunk_map | Large archives, PDFs, documents |
| **Chats** | messages, conversations, message_vectors | Claude conversation exports |
**Schema detection logic:**
- `chunk_map` table exists OR vector_table = 'chunk_vectors' β chunked
- `messages` table exists β chats
- Otherwise β standard files
---
## Key Components
### 1. Registry Database
**Location:** `/var/www/api.nsgia.com/indexes/registry.db`
```sql
CREATE TABLE index_registry (
id TEXT PRIMARY KEY, -- 'ats2', 'aburi', '2025downloads'
name TEXT NOT NULL, -- 'ATS2 Applicant Tracking'
db_path TEXT NOT NULL, -- '/var/www/api.nsgia.com/indexes/ats2.db'
index_type TEXT, -- 'website', 'filesystem', 'conversations'
fts_table TEXT NOT NULL, -- 'files' or 'messages'
lookup_table TEXT, -- 'file_lookup' or 'conversations'
vector_table TEXT, -- 'file_vectors', 'chunk_vectors', 'message_vectors'
vector_enabled INTEGER, -- 0 or 1
vector_type TEXT, -- 'vss' (use this, not numpy)
status TEXT DEFAULT 'active',
owner TEXT,
created_at TEXT
);
```
**Key field for chunked indexes:** Set `vector_table = 'chunk_vectors'` to signal chunked schema.
### 2. Permissions (MySQL)
**Database:** `arubi`
**Tables:** `index_users`, `index_permissions`
```sql
-- Users with magic keys
CREATE TABLE index_users (
id INT PRIMARY KEY AUTO_INCREMENT,
username VARCHAR(50),
password_hash VARCHAR(255),
email VARCHAR(100),
magic_key VARCHAR(64) UNIQUE, -- 64-char hex key for API access
active BOOLEAN DEFAULT TRUE
);
-- Which indexes each user can access
CREATE TABLE index_permissions (
id INT PRIMARY KEY AUTO_INCREMENT,
user_id INT,
index_id VARCHAR(50), -- matches registry.id
granted_at DATETIME
);
```
**User IDs:** 1 = api (main), 3 = claude, 4 = jey
**Claude's magic key:**
```
c46478591f96d3197bd135340531f5d53738494eed2bd4b5cc16a1eaf6d3f384
```
### 3. Standard Index Schema
For codebases and smaller file collections:
```sql
-- FTS5 for keyword search
CREATE VIRTUAL TABLE files USING fts5(
filepath, filename, content,
tokenize="porter unicode61"
);
-- Fast filepath lookup
CREATE TABLE file_lookup (
rowid INTEGER PRIMARY KEY,
filepath TEXT
);
CREATE INDEX idx_filepath ON file_lookup(filepath);
-- Vector search (384 dim for all-MiniLM-L6-v2)
CREATE VIRTUAL TABLE file_vectors USING vss0(embedding(384));
CREATE TABLE vector_map (
vector_rowid INTEGER PRIMARY KEY,
filepath TEXT
);
```
### 4. Chunked Index Schema
For large archives with PDFs, documents, and mixed content:
```sql
-- FTS5 for keyword search (full content)
CREATE VIRTUAL TABLE files USING fts5(
filepath, filename, content,
tokenize="porter unicode61"
);
-- File metadata
CREATE TABLE file_lookup (
rowid INTEGER PRIMARY KEY,
filepath TEXT,
file_type TEXT, -- 'text', 'pdf', 'docx', 'xlsx', 'binary'
has_content INTEGER -- 1 if extracted, 0 for binary
);
CREATE INDEX idx_filepath ON file_lookup(filepath);
-- Chunk storage
CREATE TABLE file_chunks (
id INTEGER PRIMARY KEY,
filepath TEXT,
chunk_index INTEGER,
chunk_text TEXT
);
CREATE INDEX idx_chunk_filepath ON file_chunks(filepath);
-- VSS vectors for chunks
CREATE VIRTUAL TABLE chunk_vectors USING vss0(embedding(384));
-- Vector to chunk mapping
CREATE TABLE chunk_map (
vector_rowid INTEGER PRIMARY KEY,
filepath TEXT,
chunk_index INTEGER,
chunk_text TEXT
);
CREATE INDEX idx_chunk_map_filepath ON chunk_map(filepath);
```
### 5. Lightweight Metadata Table (Optional)
For fast browsing without loading full content:
```sql
CREATE TABLE file_meta AS
SELECT
fl.filepath,
fl.filename,
fl.file_type,
fl.has_content,
(SELECT COUNT(*) FROM file_chunks fc WHERE fc.filepath = fl.filepath) as chunk_count
FROM file_lookup fl;
CREATE INDEX idx_meta_filepath ON file_meta(filepath);
CREATE INDEX idx_meta_type ON file_meta(file_type);
```
---
## How To Add A New Index
### Option A: Standard Index (Codebases)
For source code, small file collections:
```bash
cd /var/www/api.nsgia.com
source api_venv/bin/activate
python3 << 'EOF'
import sqlite3
import sqlite_vss
import json
import os
from sentence_transformers import SentenceTransformer
# CONFIGURE THESE
INDEX_ID = 'myindex'
INDEX_NAME = 'My Index Name'
DB_PATH = '/var/www/api.nsgia.com/indexes/myindex.db'
SOURCE = '/var/www/html/mysite.nsgia.com'
SKIP_DIRS = {'vendor', 'node_modules', '.git', 'storage', '__pycache__'}
FILE_EXTENSIONS = {'.php', '.js', '.css', '.py', '.html'}
print("Loading embedding model...")
model = SentenceTransformer('all-MiniLM-L6-v2')
conn = sqlite3.connect(DB_PATH)
conn.enable_load_extension(True)
sqlite_vss.load(conn)
# Create tables
conn.execute('CREATE VIRTUAL TABLE files USING fts5(filepath, filename, content, tokenize="porter unicode61")')
conn.execute('CREATE TABLE file_lookup (rowid INTEGER PRIMARY KEY, filepath TEXT)')
conn.execute('CREATE INDEX idx_filepath ON file_lookup(filepath)')
conn.execute('CREATE VIRTUAL TABLE file_vectors USING vss0(embedding(384))')
conn.execute('CREATE TABLE vector_map (vector_rowid INTEGER PRIMARY KEY, filepath TEXT)')
count = 0
for root, dirs, files in os.walk(SOURCE):
dirs[:] = [d for d in dirs if d not in SKIP_DIRS]
for f in files:
if any(f.endswith(e) for e in FILE_EXTENSIONS):
path = os.path.join(root, f)
rel = os.path.relpath(path, SOURCE)
try:
content = open(path, errors='ignore').read()
conn.execute('INSERT INTO files (filepath, filename, content) VALUES (?,?,?)', (rel, f, content))
rowid = conn.execute('SELECT last_insert_rowid()').fetchone()[0]
conn.execute('INSERT INTO file_lookup (rowid, filepath) VALUES (?,?)', (rowid, rel))
count += 1
emb = model.encode(content[:8000])
conn.execute('INSERT INTO file_vectors(rowid, embedding) VALUES (?,?)', [count, json.dumps(emb.tolist())])
conn.execute('INSERT INTO vector_map (vector_rowid, filepath) VALUES (?,?)', [count, rel])
if count % 50 == 0:
conn.commit()
print(f" {count} files...")
except Exception as e:
print(f"Error: {rel}: {e}")
conn.commit()
print(f'Indexed {count} files')
conn.close()
# Add to registry
reg = sqlite3.connect('/var/www/api.nsgia.com/indexes/registry.db')
reg.execute("""
INSERT OR REPLACE INTO index_registry
(id, name, db_path, index_type, fts_table, lookup_table, vector_table, vector_enabled, vector_type, status)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
""", (INDEX_ID, INDEX_NAME, DB_PATH, 'website', 'files', 'file_lookup', 'file_vectors', 1, 'vss', 'active'))
reg.commit()
reg.close()
EOF
```
### Option B: Chunked Index (Large Archives)
For large file collections with PDFs, documents:
See `/var/www/api.nsgia.com/index_2025downloads.py` for complete example.
**Key differences:**
- Extracts text from PDFs (pdfplumber), DOCX (python-docx), XLSX (openpyxl)
- Chunks content into 4000-char segments with 500-char overlap
- Embeds each chunk separately for granular semantic search
- Binary files (images, video) indexed by path only
**Chunking parameters:**
```python
CHUNK_SIZE = 4000 # chars per chunk
CHUNK_OVERLAP = 500 # overlap for context continuity
MIN_CHUNK_SIZE = 200 # skip tiny fragments
```
### Step 2: Add MySQL Permissions
```bash
mysql -u root -p arubi -e "
INSERT INTO index_permissions (user_id, index_id, granted_at)
VALUES
(1, 'NEW_INDEX_ID', NOW()),
(3, 'NEW_INDEX_ID', NOW()),
(4, 'NEW_INDEX_ID', NOW());
"
```
### Step 3: Restart API
```bash
pkill -f "gunicorn.*40003"
sleep 2
cd /var/www/api.nsgia.com
gunicorn --bind 127.0.0.1:40003 --workers 1 --timeout 90 api_proxy_app:app --daemon
```
### Step 4: Test
```bash
# List indexes
curl -s "https://api.nsgia.com/index/k/c46478591f96d3197bd135340531f5d53738494eed2bd4b5cc16a1eaf6d3f384/plain/list"
# Keyword search
curl -s "https://api.nsgia.com/index/k/c46478591f96d3197bd135340531f5d53738494eed2bd4b5cc16a1eaf6d3f384/plain/search/INDEX_ID/searchterm/10"
# Semantic search
curl -s "https://api.nsgia.com/index/k/c46478591f96d3197bd135340531f5d53738494eed2bd4b5cc16a1eaf6d3f384/plain/vector/search/INDEX_ID/searchterm/10"
# Get file content
curl -s "https://api.nsgia.com/index/k/c46478591f96d3197bd135340531f5d53738494eed2bd4b5cc16a1eaf6d3f384/content/INDEX_ID/path/to/file.php"
```
---
## API Endpoints
**Base URL:** `https://api.nsgia.com/index/k/{magic_key}/`
| Endpoint | Purpose |
|----------|---------|
| `plain/list` | List all indexes user can access |
| `plain/search/{query}/{limit}` | Search ALL indexes (FTS5) |
| `plain/search/{index_id}/{query}/{limit}` | Search specific index (FTS5) |
| `plain/vector/search/{query}/{limit}` | Semantic search ALL indexes |
| `plain/vector/search/{index_id}/{query}/{limit}` | Semantic search specific index |
| `content/{index_id}/{filepath}` | Get full file content |
| `vector/status` | Check vector status per index |
**Note:** Claude's web_fetch tool has permission restrictions. Use bash_tool with curl instead:
```bash
curl -s 'https://api.nsgia.com/index/k/c46478591f96d3197bd135340531f5d53738494eed2bd4b5cc16a1eaf6d3f384/plain/search/ats2/phone/10'
```
---
## Current Indexes (2026-01-29)
| ID | Name | Files | Schema | Vector | Size | Source |
|----|------|-------|--------|--------|------|--------|
| api | API Codebase | 31 | standard | vss | ~2MB | /var/www/api.nsgia.com |
| ats2 | ATS2 Applicant Tracking | 444 | standard | vss | ~50MB | /var/www/html/ats2.nsgia.com |
| aburi | Aburi Website | ~5K | standard | vss | ~100MB | /var/www/html/aburi.nsgia.com |
| chats | Claude Conversations | 71K msgs | chats | vss | ~500MB | Claude exports |
| 2tbu | 2TB USB Archive | 6M | standard | no | ~2GB | /media/mh/2TBU |
| 2025downloads | 2025 Downloads | ~100K+ | chunked | vss | ~1.5GB+ | /media/mh/dropbox-backup/Downloads |
---
## Key Files on Server
```
/var/www/api.nsgia.com/
βββ api_proxy_app.py # Main Flask app
βββ index_search_api.py # HTTP routes
βββ index_db.py # FTS5 search, schema detection
βββ index_vector.py # Semantic search (standard + chunked)
βββ index_auth.py # Magic key auth via MySQL
βββ index_api.py # Script: index API codebase
βββ index_2025downloads.py # Script: chunked index for Downloads
βββ api_venv/ # Python venv
β βββ (flask, gunicorn, sentence-transformers, sqlite-vss, pdfplumber, python-docx, openpyxl)
βββ indexes/
β βββ registry.db # Index registry (SQLite)
β βββ api.db # API codebase index
β βββ ats2.db # ATS2 index
β βββ aburi.db # Aburi index
β βββ claude_chats.db # Conversations index
β βββ 2tbu_v2.db # 2TB USB index
β βββ 2025downloads.db # Downloads chunked index
βββ templates/
βββ index_search_key.html # Browser UI
```
---
## Troubleshooting
### "Access denied" on search
Index not in MySQL permissions:
```bash
mysql -u root -p arubi -e "SELECT * FROM index_permissions WHERE index_id='your_index';"
# If empty, add permissions (see Step 2)
```
### Index not appearing in list
Not in registry.db:
```bash
python3 -c "import sqlite3; print([r[0] for r in sqlite3.connect('/var/www/api.nsgia.com/indexes/registry.db').execute('SELECT id FROM index_registry')])"
```
API not restarted:
```bash
pkill -f "gunicorn.*40003" && sleep 2 && cd /var/www/api.nsgia.com && gunicorn --bind 127.0.0.1:40003 --workers 1 --timeout 90 api_proxy_app:app --daemon
```
### Venv corrupted
Symptom: `ModuleNotFoundError: No module named 'encodings'`
Fix:
```bash
cd /var/www/api.nsgia.com
rm -rf api_venv
python3 -m venv api_venv
source api_venv/bin/activate
pip install flask gunicorn requests mysql-connector-python sentence-transformers sqlite-vss pdfplumber python-docx openpyxl numpy
```
### Long-running indexing
Run with low priority:
```bash
nice -n 19 python3 index_script.py > indexing.log 2>&1 &
```
Monitor progress:
```bash
ls -lh /var/www/api.nsgia.com/indexes/yourindex.db
```
---
## Lesson Learned
**Before indexing:** 3+ hours debugging ATS2 document upload, chasing symptoms
**After indexing:** 5 minutes to find phone number bug across 4 files
**The pattern:** Index the codebase β Search before coding β Understand architecture β Fix once
This is Ready-Aim-Fire applied to unfamiliar codebases.