Research Document Processing
An enterprise-grade research document processing platform with semantic search, RAG-based Q&A, and hierarchical chunking for 1000+ page documents.
Key Features
Process large PDFs (1000+ pages) with batched extraction using PyMuPDF
Hierarchical chunking that preserves document structure (sections, subsections, paragraphs)
Semantic search using natural language queries with metadata filtering
RAG-based chat — ask questions and get answers with citations to source documents
Single-document and cross-paper summaries powered by GPT-4
Novelty analysis to identify what's new compared to existing corpus
Real-time ingestion progress tracking with Celery background workers
Configurable embeddings — switch between local SentenceTransformers and OpenAI
ChromaDB vector store with persistence and metadata filtering
Full-stack with Next.js frontend and FastAPI backend
Docker Compose for one-command deployment of all services
Resumable processing — failed jobs can be retried without re-processing
Architecture
┌─────────────────────────────────────────────────────────────────┐
│ Frontend (Next.js + React) │
│ Upload | Library | Search | Chat | Summaries │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Backend (FastAPI) │
│ Routes → Services → Repositories → Database │
└─────────────────────────────────────────────────────────────────┘
│ │ │
▼ ▼ ▼
┌─────────┐ ┌─────────────────┐ ┌──────────────┐
│ Redis │ ──── │ Celery Workers │ │ PostgreSQL │
└─────────┘ └─────────────────┘ └──────────────┘
│
▼
┌─────────────────┐
│ ChromaDB │
│ (Vector Store) │
└─────────────────┘API Endpoints
| Method | Endpoint | Description |
|---|---|---|
| POST | /api/v1/documents/upload | Upload PDF document |
| GET | /api/v1/documents | List all documents |
| GET | /api/v1/documents/:id | Get document details |
| POST | /api/v1/search | Semantic search across documents |
| POST | /api/v1/chat | RAG chat with citations |
| POST | /api/v1/summarize/document | Single document summary |
| POST | /api/v1/summarize/cross | Cross-paper summary |
| POST | /api/v1/summarize/analyze/last-document | Novelty analysis |
Processing Pipeline
PDF Upload & Extraction
Documents are uploaded via the Next.js frontend. PyMuPDF extracts text while preserving page boundaries and structural elements.
Hierarchical Chunking
Text is split into hierarchical chunks (sections → subsections → paragraphs) with configurable overlap, preserving document structure and context.
Embedding Generation
Each chunk is embedded using SentenceTransformers (local) or OpenAI embeddings and stored in ChromaDB with rich metadata for filtered retrieval.
Semantic Search & RAG
Natural language queries retrieve the most relevant chunks via vector similarity. GPT-4 synthesizes answers with precise citations back to source documents.
Summarization & Analysis
Generate single-document summaries, cross-paper comparisons, or novelty analysis to identify new contributions relative to the existing corpus.
Design Decisions
ChromaDB over FAISS
Built-in persistence, metadata filtering, and a simpler API for production use.
Hierarchical Chunking
Preserves document structure for better context and more accurate retrieval.
Celery for Background Jobs
Production-ready task queue with retries, progress tracking, and scalability.
Configurable Embeddings
Switch between free local models and higher-quality OpenAI embeddings per use case.