Research Document Processing

An enterprise-grade research document processing platform with semantic search, RAG-based Q&A, and hierarchical chunking for 1000+ page documents.

PythonFastAPINext.jsReactOpenAI GPT-4ChromaDBCeleryRedisPostgreSQLSQLAlchemyTailwind CSSDocker

Key Features

Process large PDFs (1000+ pages) with batched extraction using PyMuPDF

Hierarchical chunking that preserves document structure (sections, subsections, paragraphs)

Semantic search using natural language queries with metadata filtering

RAG-based chat — ask questions and get answers with citations to source documents

Single-document and cross-paper summaries powered by GPT-4

Novelty analysis to identify what's new compared to existing corpus

Real-time ingestion progress tracking with Celery background workers

Configurable embeddings — switch between local SentenceTransformers and OpenAI

ChromaDB vector store with persistence and metadata filtering

Full-stack with Next.js frontend and FastAPI backend

Docker Compose for one-command deployment of all services

Resumable processing — failed jobs can be retried without re-processing

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                     Frontend (Next.js + React)                  │
│   Upload | Library | Search | Chat | Summaries                  │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                      Backend (FastAPI)                           │
│   Routes → Services → Repositories → Database                   │
└─────────────────────────────────────────────────────────────────┘
         │                    │                     │
         ▼                    ▼                     ▼
    ┌─────────┐      ┌─────────────────┐    ┌──────────────┐
    │  Redis   │ ──── │  Celery Workers │    │  PostgreSQL  │
    └─────────┘      └─────────────────┘    └──────────────┘
                              │
                              ▼
                     ┌─────────────────┐
                     │    ChromaDB      │
                     │  (Vector Store)  │
                     └─────────────────┘

API Endpoints

MethodEndpointDescription
POST/api/v1/documents/uploadUpload PDF document
GET/api/v1/documentsList all documents
GET/api/v1/documents/:idGet document details
POST/api/v1/searchSemantic search across documents
POST/api/v1/chatRAG chat with citations
POST/api/v1/summarize/documentSingle document summary
POST/api/v1/summarize/crossCross-paper summary
POST/api/v1/summarize/analyze/last-documentNovelty analysis

Processing Pipeline

1

PDF Upload & Extraction

Documents are uploaded via the Next.js frontend. PyMuPDF extracts text while preserving page boundaries and structural elements.

2

Hierarchical Chunking

Text is split into hierarchical chunks (sections → subsections → paragraphs) with configurable overlap, preserving document structure and context.

3

Embedding Generation

Each chunk is embedded using SentenceTransformers (local) or OpenAI embeddings and stored in ChromaDB with rich metadata for filtered retrieval.

4

Semantic Search & RAG

Natural language queries retrieve the most relevant chunks via vector similarity. GPT-4 synthesizes answers with precise citations back to source documents.

5

Summarization & Analysis

Generate single-document summaries, cross-paper comparisons, or novelty analysis to identify new contributions relative to the existing corpus.

Design Decisions

ChromaDB over FAISS

Built-in persistence, metadata filtering, and a simpler API for production use.

Hierarchical Chunking

Preserves document structure for better context and more accurate retrieval.

Celery for Background Jobs

Production-ready task queue with retries, progress tracking, and scalability.

Configurable Embeddings

Switch between free local models and higher-quality OpenAI embeddings per use case.