Research Document Processing

An enterprise-grade research document processing platform with semantic search, RAG-based Q&A, and hierarchical chunking for 1000+ page documents.

PythonFastAPINext.jsReactOpenAI GPT-4ChromaDBCeleryRedisPostgreSQLSQLAlchemyTailwind CSSDocker

Key Features

Process large PDFs (1000+ pages) with batched extraction using PyMuPDF

Hierarchical chunking that preserves document structure (sections, subsections, paragraphs)

Semantic search using natural language queries with metadata filtering

RAG-based chat — ask questions and get answers with citations to source documents

Single-document and cross-paper summaries powered by GPT-4

Novelty analysis to identify what's new compared to existing corpus

Real-time ingestion progress tracking with Celery background workers

Configurable embeddings — switch between local SentenceTransformers and OpenAI

ChromaDB vector store with persistence and metadata filtering

Full-stack with Next.js frontend and FastAPI backend

Docker Compose for one-command deployment of all services

Resumable processing — failed jobs can be retried without re-processing

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                     Frontend (Next.js + React)                  │
│   Upload | Library | Search | Chat | Summaries                  │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                      Backend (FastAPI)                           │
│   Routes → Services → Repositories → Database                   │
└─────────────────────────────────────────────────────────────────┘
         │                    │                     │
         ▼                    ▼                     ▼
    ┌─────────┐      ┌─────────────────┐    ┌──────────────┐
    │  Redis   │ ──── │  Celery Workers │    │  PostgreSQL  │
    └─────────┘      └─────────────────┘    └──────────────┘
                              │
                              ▼
                     ┌─────────────────┐
                     │    ChromaDB      │
                     │  (Vector Store)  │
                     └─────────────────┘

API Endpoints

Method	Endpoint	Description
POST	/api/v1/documents/upload	Upload PDF document
GET	/api/v1/documents	List all documents
GET	/api/v1/documents/:id	Get document details
POST	/api/v1/search	Semantic search across documents
POST	/api/v1/chat	RAG chat with citations
POST	/api/v1/summarize/document	Single document summary
POST	/api/v1/summarize/cross	Cross-paper summary
POST	/api/v1/summarize/analyze/last-document	Novelty analysis

Processing Pipeline

PDF Upload & Extraction

Documents are uploaded via the Next.js frontend. PyMuPDF extracts text while preserving page boundaries and structural elements.

Hierarchical Chunking

Text is split into hierarchical chunks (sections → subsections → paragraphs) with configurable overlap, preserving document structure and context.

Embedding Generation

Each chunk is embedded using SentenceTransformers (local) or OpenAI embeddings and stored in ChromaDB with rich metadata for filtered retrieval.

Semantic Search & RAG

Natural language queries retrieve the most relevant chunks via vector similarity. GPT-4 synthesizes answers with precise citations back to source documents.

Summarization & Analysis

Generate single-document summaries, cross-paper comparisons, or novelty analysis to identify new contributions relative to the existing corpus.

Design Decisions

ChromaDB over FAISS

Built-in persistence, metadata filtering, and a simpler API for production use.

Hierarchical Chunking

Preserves document structure for better context and more accurate retrieval.

Celery for Background Jobs

Production-ready task queue with retries, progress tracking, and scalability.

Configurable Embeddings

Switch between free local models and higher-quality OpenAI embeddings per use case.

View Source Code See All Projects