# NSG RAG Complete Guide 1-29-2023

# NSG RAG Complete Guide 1-29-2023 2026-01-29 03:01:32

Conversation ID

2a69c8d3-51fe-48ad-b249-452c6e7cdfb8

Source

Created

2026-01-29 03:01:32

Content Length

characters

Full Content

# NSG RAG Index System - Complete Guide **Date:** 2026-01-29 **Purpose:** Enable Claude to search codebases before coding (Ready-Aim-Fire) --- ## Why This Exists Without access to the codebase, Claude guesses at solutions and chases symptoms for hours. With RAG indexing, Claude can search the code first, understand the architecture, then make one correct fix. **Example:** ATS2 phone number bug. Instead of 3 hours debugging, Claude searched `phone numericString substring`, found 4 files with the same code, and fixed all in 5 minutes. --- ## Architecture Overview ``` ┌─────────────────────────────────────────────────────────────┐ │ api.nsgia.com │ │ (nginx → gunicorn) │ ├─────────────────────────────────────────────────────────────┤ │ api_proxy_app.py │ │ (Flask main app) │ ├─────────────────────────────────────────────────────────────┤ │ index_search_api.py (routes) │ │ ┌──────────────┬──────────────┬──────────────┐ │ │ │ index_db.py │index_vector.py│ index_auth.py│ │ │ │ (FTS5 search)│(semantic) │ (magic keys) │ │ │ └──────────────┴──────────────┴──────────────┘ │ ├─────────────────────────────────────────────────────────────┤ │ /indexes/ (SQLite DBs + registry.db) │ │ ┌─────────┬─────────┬─────────┬─────────┬─────────┐ │ │ │ aburi.db│chats.db │2tbu.db │ats2.db │2025dl.db│ │ │ └─────────┴─────────┴─────────┴─────────┴─────────┘ │ └─────────────────────────────────────────────────────────────┘ ``` --- ## Index Schema Types The system supports three schema types, auto-detected by `index_vector.py`: | Schema | Tables | Use Case | |--------|--------|----------| | **Standard (files)** | files, file_lookup, file_vectors, vector_map | Small codebases (<1000 files) | | **Chunked** | files, file_lookup, file_chunks, chunk_vectors, chunk_map | Large archives, PDFs, documents | | **Chats** | messages, conversations, message_vectors | Claude conversation exports | **Schema detection logic:** - `chunk_map` table exists OR vector_table = 'chunk_vectors' → chunked - `messages` table exists → chats - Otherwise → standard files --- ## Key Components ### 1. Registry Database **Location:** `/var/www/api.nsgia.com/indexes/registry.db` ```sql CREATE TABLE index_registry ( id TEXT PRIMARY KEY, -- 'ats2', 'aburi', '2025downloads' name TEXT NOT NULL, -- 'ATS2 Applicant Tracking' db_path TEXT NOT NULL, -- '/var/www/api.nsgia.com/indexes/ats2.db' index_type TEXT, -- 'website', 'filesystem', 'conversations' fts_table TEXT NOT NULL, -- 'files' or 'messages' lookup_table TEXT, -- 'file_lookup' or 'conversations' vector_table TEXT, -- 'file_vectors', 'chunk_vectors', 'message_vectors' vector_enabled INTEGER, -- 0 or 1 vector_type TEXT, -- 'vss' (use this, not numpy) status TEXT DEFAULT 'active', owner TEXT, created_at TEXT ); ``` **Key field for chunked indexes:** Set `vector_table = 'chunk_vectors'` to signal chunked schema. ### 2. Permissions (MySQL) **Database:** `arubi` **Tables:** `index_users`, `index_permissions` ```sql -- Users with magic keys CREATE TABLE index_users ( id INT PRIMARY KEY AUTO_INCREMENT, username VARCHAR(50), password_hash VARCHAR(255), email VARCHAR(100), magic_key VARCHAR(64) UNIQUE, -- 64-char hex key for API access active BOOLEAN DEFAULT TRUE ); -- Which indexes each user can access CREATE TABLE index_permissions ( id INT PRIMARY KEY AUTO_INCREMENT, user_id INT, index_id VARCHAR(50), -- matches registry.id granted_at DATETIME ); ``` **User IDs:** 1 = api (main), 3 = claude, 4 = jey **Claude's magic key:** ``` c46478591f96d3197bd135340531f5d53738494eed2bd4b5cc16a1eaf6d3f384 ``` ### 3. Standard Index Schema For codebases and smaller file collections: ```sql -- FTS5 for keyword search CREATE VIRTUAL TABLE files USING fts5( filepath, filename, content, tokenize="porter unicode61" ); -- Fast filepath lookup CREATE TABLE file_lookup ( rowid INTEGER PRIMARY KEY, filepath TEXT ); CREATE INDEX idx_filepath ON file_lookup(filepath); -- Vector search (384 dim for all-MiniLM-L6-v2) CREATE VIRTUAL TABLE file_vectors USING vss0(embedding(384)); CREATE TABLE vector_map ( vector_rowid INTEGER PRIMARY KEY, filepath TEXT ); ``` ### 4. Chunked Index Schema For large archives with PDFs, documents, and mixed content: ```sql -- FTS5 for keyword search (full content) CREATE VIRTUAL TABLE files USING fts5( filepath, filename, content, tokenize="porter unicode61" ); -- File metadata CREATE TABLE file_lookup ( rowid INTEGER PRIMARY KEY, filepath TEXT, file_type TEXT, -- 'text', 'pdf', 'docx', 'xlsx', 'binary' has_content INTEGER -- 1 if extracted, 0 for binary ); CREATE INDEX idx_filepath ON file_lookup(filepath); -- Chunk storage CREATE TABLE file_chunks ( id INTEGER PRIMARY KEY, filepath TEXT, chunk_index INTEGER, chunk_text TEXT ); CREATE INDEX idx_chunk_filepath ON file_chunks(filepath); -- VSS vectors for chunks CREATE VIRTUAL TABLE chunk_vectors USING vss0(embedding(384)); -- Vector to chunk mapping CREATE TABLE chunk_map ( vector_rowid INTEGER PRIMARY KEY, filepath TEXT, chunk_index INTEGER, chunk_text TEXT ); CREATE INDEX idx_chunk_map_filepath ON chunk_map(filepath); ``` ### 5. Lightweight Metadata Table (Optional) For fast browsing without loading full content: ```sql CREATE TABLE file_meta AS SELECT fl.filepath, fl.filename, fl.file_type, fl.has_content, (SELECT COUNT(*) FROM file_chunks fc WHERE fc.filepath = fl.filepath) as chunk_count FROM file_lookup fl; CREATE INDEX idx_meta_filepath ON file_meta(filepath); CREATE INDEX idx_meta_type ON file_meta(file_type); ``` --- ## How To Add A New Index ### Option A: Standard Index (Codebases) For source code, small file collections: ```bash cd /var/www/api.nsgia.com source api_venv/bin/activate python3 << 'EOF' import sqlite3 import sqlite_vss import json import os from sentence_transformers import SentenceTransformer # CONFIGURE THESE INDEX_ID = 'myindex' INDEX_NAME = 'My Index Name' DB_PATH = '/var/www/api.nsgia.com/indexes/myindex.db' SOURCE = '/var/www/html/mysite.nsgia.com' SKIP_DIRS = {'vendor', 'node_modules', '.git', 'storage', '__pycache__'} FILE_EXTENSIONS = {'.php', '.js', '.css', '.py', '.html'} print("Loading embedding model...") model = SentenceTransformer('all-MiniLM-L6-v2') conn = sqlite3.connect(DB_PATH) conn.enable_load_extension(True) sqlite_vss.load(conn) # Create tables conn.execute('CREATE VIRTUAL TABLE files USING fts5(filepath, filename, content, tokenize="porter unicode61")') conn.execute('CREATE TABLE file_lookup (rowid INTEGER PRIMARY KEY, filepath TEXT)') conn.execute('CREATE INDEX idx_filepath ON file_lookup(filepath)') conn.execute('CREATE VIRTUAL TABLE file_vectors USING vss0(embedding(384))') conn.execute('CREATE TABLE vector_map (vector_rowid INTEGER PRIMARY KEY, filepath TEXT)') count = 0 for root, dirs, files in os.walk(SOURCE): dirs[:] = [d for d in dirs if d not in SKIP_DIRS] for f in files: if any(f.endswith(e) for e in FILE_EXTENSIONS): path = os.path.join(root, f) rel = os.path.relpath(path, SOURCE) try: content = open(path, errors='ignore').read() conn.execute('INSERT INTO files (filepath, filename, content) VALUES (?,?,?)', (rel, f, content)) rowid = conn.execute('SELECT last_insert_rowid()').fetchone()[0] conn.execute('INSERT INTO file_lookup (rowid, filepath) VALUES (?,?)', (rowid, rel)) count += 1 emb = model.encode(content[:8000]) conn.execute('INSERT INTO file_vectors(rowid, embedding) VALUES (?,?)', [count, json.dumps(emb.tolist())]) conn.execute('INSERT INTO vector_map (vector_rowid, filepath) VALUES (?,?)', [count, rel]) if count % 50 == 0: conn.commit() print(f" {count} files...") except Exception as e: print(f"Error: {rel}: {e}") conn.commit() print(f'Indexed {count} files') conn.close() # Add to registry reg = sqlite3.connect('/var/www/api.nsgia.com/indexes/registry.db') reg.execute(""" INSERT OR REPLACE INTO index_registry (id, name, db_path, index_type, fts_table, lookup_table, vector_table, vector_enabled, vector_type, status) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?) """, (INDEX_ID, INDEX_NAME, DB_PATH, 'website', 'files', 'file_lookup', 'file_vectors', 1, 'vss', 'active')) reg.commit() reg.close() EOF ``` ### Option B: Chunked Index (Large Archives) For large file collections with PDFs, documents: See `/var/www/api.nsgia.com/index_2025downloads.py` for complete example. **Key differences:** - Extracts text from PDFs (pdfplumber), DOCX (python-docx), XLSX (openpyxl) - Chunks content into 4000-char segments with 500-char overlap - Embeds each chunk separately for granular semantic search - Binary files (images, video) indexed by path only **Chunking parameters:** ```python CHUNK_SIZE = 4000 # chars per chunk CHUNK_OVERLAP = 500 # overlap for context continuity MIN_CHUNK_SIZE = 200 # skip tiny fragments ``` ### Step 2: Add MySQL Permissions ```bash mysql -u root -p arubi -e " INSERT INTO index_permissions (user_id, index_id, granted_at) VALUES (1, 'NEW_INDEX_ID', NOW()), (3, 'NEW_INDEX_ID', NOW()), (4, 'NEW_INDEX_ID', NOW()); " ``` ### Step 3: Restart API ```bash pkill -f "gunicorn.*40003" sleep 2 cd /var/www/api.nsgia.com gunicorn --bind 127.0.0.1:40003 --workers 1 --timeout 90 api_proxy_app:app --daemon ``` ### Step 4: Test ```bash # List indexes curl -s "https://api.nsgia.com/index/k/c46478591f96d3197bd135340531f5d53738494eed2bd4b5cc16a1eaf6d3f384/plain/list" # Keyword search curl -s "https://api.nsgia.com/index/k/c46478591f96d3197bd135340531f5d53738494eed2bd4b5cc16a1eaf6d3f384/plain/search/INDEX_ID/searchterm/10" # Semantic search curl -s "https://api.nsgia.com/index/k/c46478591f96d3197bd135340531f5d53738494eed2bd4b5cc16a1eaf6d3f384/plain/vector/search/INDEX_ID/searchterm/10" # Get file content curl -s "https://api.nsgia.com/index/k/c46478591f96d3197bd135340531f5d53738494eed2bd4b5cc16a1eaf6d3f384/content/INDEX_ID/path/to/file.php" ``` --- ## API Endpoints **Base URL:** `https://api.nsgia.com/index/k/{magic_key}/` | Endpoint | Purpose | |----------|---------| | `plain/list` | List all indexes user can access | | `plain/search/{query}/{limit}` | Search ALL indexes (FTS5) | | `plain/search/{index_id}/{query}/{limit}` | Search specific index (FTS5) | | `plain/vector/search/{query}/{limit}` | Semantic search ALL indexes | | `plain/vector/search/{index_id}/{query}/{limit}` | Semantic search specific index | | `content/{index_id}/{filepath}` | Get full file content | | `vector/status` | Check vector status per index | **Note:** Claude's web_fetch tool has permission restrictions. Use bash_tool with curl instead: ```bash curl -s 'https://api.nsgia.com/index/k/c46478591f96d3197bd135340531f5d53738494eed2bd4b5cc16a1eaf6d3f384/plain/search/ats2/phone/10' ``` --- ## Current Indexes (2026-01-29) | ID | Name | Files | Schema | Vector | Size | Source | |----|------|-------|--------|--------|------|--------| | api | API Codebase | 31 | standard | vss | ~2MB | /var/www/api.nsgia.com | | ats2 | ATS2 Applicant Tracking | 444 | standard | vss | ~50MB | /var/www/html/ats2.nsgia.com | | aburi | Aburi Website | ~5K | standard | vss | ~100MB | /var/www/html/aburi.nsgia.com | | chats | Claude Conversations | 71K msgs | chats | vss | ~500MB | Claude exports | | 2tbu | 2TB USB Archive | 6M | standard | no | ~2GB | /media/mh/2TBU | | 2025downloads | 2025 Downloads | ~100K+ | chunked | vss | ~1.5GB+ | /media/mh/dropbox-backup/Downloads | --- ## Key Files on Server ``` /var/www/api.nsgia.com/ ├── api_proxy_app.py # Main Flask app ├── index_search_api.py # HTTP routes ├── index_db.py # FTS5 search, schema detection ├── index_vector.py # Semantic search (standard + chunked) ├── index_auth.py # Magic key auth via MySQL ├── index_api.py # Script: index API codebase ├── index_2025downloads.py # Script: chunked index for Downloads ├── api_venv/ # Python venv │ └── (flask, gunicorn, sentence-transformers, sqlite-vss, pdfplumber, python-docx, openpyxl) ├── indexes/ │ ├── registry.db # Index registry (SQLite) │ ├── api.db # API codebase index │ ├── ats2.db # ATS2 index │ ├── aburi.db # Aburi index │ ├── claude_chats.db # Conversations index │ ├── 2tbu_v2.db # 2TB USB index │ └── 2025downloads.db # Downloads chunked index └── templates/ └── index_search_key.html # Browser UI ``` --- ## Troubleshooting ### "Access denied" on search Index not in MySQL permissions: ```bash mysql -u root -p arubi -e "SELECT * FROM index_permissions WHERE index_id='your_index';" # If empty, add permissions (see Step 2) ``` ### Index not appearing in list Not in registry.db: ```bash python3 -c "import sqlite3; print([r[0] for r in sqlite3.connect('/var/www/api.nsgia.com/indexes/registry.db').execute('SELECT id FROM index_registry')])" ``` API not restarted: ```bash pkill -f "gunicorn.*40003" && sleep 2 && cd /var/www/api.nsgia.com && gunicorn --bind 127.0.0.1:40003 --workers 1 --timeout 90 api_proxy_app:app --daemon ``` ### Venv corrupted Symptom: `ModuleNotFoundError: No module named 'encodings'` Fix: ```bash cd /var/www/api.nsgia.com rm -rf api_venv python3 -m venv api_venv source api_venv/bin/activate pip install flask gunicorn requests mysql-connector-python sentence-transformers sqlite-vss pdfplumber python-docx openpyxl numpy ``` ### Long-running indexing Run with low priority: ```bash nice -n 19 python3 index_script.py > indexing.log 2>&1 & ``` Monitor progress: ```bash ls -lh /var/www/api.nsgia.com/indexes/yourindex.db ``` --- ## Lesson Learned **Before indexing:** 3+ hours debugging ATS2 document upload, chasing symptoms **After indexing:** 5 minutes to find phone number bug across 4 files **The pattern:** Index the codebase → Search before coding → Understand architecture → Fix once This is Ready-Aim-Fire applied to unfamiliar codebases.

This is a permanent link that never expires. No authentication required for viewing. Served via NSGIA Memory System for public sharing.