First Ingestion & Semantic Search
This guide walks through your first end-to-end ingestion and semantic search workflow using the Hedera Guardian AI Toolkit.
At this stage, your infrastructure services should already be running:
You will complete the end-to-end workflow in five steps.
Step 1 — Add Your Documents
Place your methodology files into:
data/input/documents/
Supported formats:
You may add:
Full methodologies (100+ pages)
Each file will be processed independently and indexed into the vector database.
Step 2 — Run Document Ingestion
From the root of the repository:
Standard profile:
docker compose run --rm document-ingestion-worker
docker compose -f docker-compose.yml -f docker-compose.gpu.yml run --rm document-ingestion-worker
Low-memory profile:
docker compose -f docker-compose.yml -f docker-compose.low-memory.yml run --rm document-ingestion-worker
Step 3 — Verify Ingestion
Once ingestion completes:
A Qdrant collection named methodology_documents will exist
The collection will contain one point per chunk
You can verify via:
http://localhost:6333/dashboard
Each stored record contains:
LaTeX formulas (when applicable)
Step 4 — Connect Your MCP Client
Your MCP Server should already be running at:
http://localhost:9000
To verify:
npx @modelcontextprotocol/inspector --server-url http://localhost:9000/mcp
You should see available tools including:
Using an MCP-compatible AI client (e.g., Claude Desktop):
Ask a grounded question such as:
What are the applicability conditions defined in VM42?
What quantification approaches are defined in this methodology?
What data parameters must be monitored during the crediting period?
The AI will:
Call the semantic search tool
Filter by metadata if necessary
Generate a response grounded in document content
Responses may:
Extract structured data from tables
Suggest next steps (such as schema generation)
What Happens During Ingestion
When ingestion starts, the system:
Discovers documents in the input folder
Creates a collection in Qdrant (if it does not exist)
Processes each document in parallel
Splits documents into contextual chunks
Converts chunks into vector embeddings
Stores embeddings and metadata in Qdrant
Advanced Processing Steps
The ingestion pipeline includes:
Repaired if split across layout blocks
Processed using OCR when necessary
Table Normalization
Multi-page tables are detected
Split tables are merged into single logical units
Each stored chunk includes metadata such as:
These metadata flags allow targeted semantic filtering during search.
How Semantic Search Works
The toolkit uses hybrid retrieval:
Dense embeddings (semantic meaning)
Sparse retrieval (keyword matching)
Chunks are ranked using reciprocal rank fusion.
The result:
Reduced hallucination risk
The AI does not use internet knowledge.
It operates only on your ingested documents.
What Success Looks Like
You have successfully completed this stage when:
Qdrant contains chunk records
The MCP server exposes tools
Your AI client can answer methodology-specific questions
Responses reflect actual document content
At this point, you have transformed static PDFs into a searchable knowledge base.
What Comes Next
Now that you can search methodologies semantically, the next step is:
Generating Guardian-compatible schemas
Extracting formulas into structured definitions
Beginning structured digitization workflows
Proceed to:
Schema & Formula Generation