head-side-virusFirst Ingestion and Semantic Search

This guide walks through your first end-to-end ingestion and semantic search workflow using the Hedera Guardian AI Toolkit.


Overview

At this stage, your infrastructure services should already be running:

  • Qdrant (vector database)

  • MCP Server

You will complete the end-to-end workflow in five steps.

1

Step 1 — Add Your Documents

Place your methodology files into:

data/input/documents/

Supported formats:

  • PDF

  • DOCX

You may add:

  • Full methodologies (100+ pages)

  • Templates

  • Supporting documentation

Each file will be processed independently and indexed into the vector database.

2

Step 2 — Run Document Ingestion

From the root of the repository:

Standard profile:

docker compose run --rm document-ingestion-worker

GPU profile (if configured):

docker compose -f docker-compose.yml -f docker-compose.gpu.yml run --rm document-ingestion-worker

Low-memory profile:

docker compose -f docker-compose.yml -f docker-compose.low-memory.yml run --rm document-ingestion-worker
3

Step 3 — Verify Ingestion

Once ingestion completes:

  • A Qdrant collection named methodology_documents will exist

  • The collection will contain one point per chunk

You can verify via:

http://localhost:6333/dashboard

Each stored record contains:

  • The chunk text

  • Embedding vectors

  • Metadata

  • LaTeX formulas (when applicable)

4

Step 4 — Connect Your MCP Client

Your MCP Server should already be running at:

http://localhost:9000

To verify:

npx @modelcontextprotocol/inspector --server-url http://localhost:9000/mcp

You should see available tools including:

  • Semantic search tools

  • Schema builder tools

5

Using an MCP-compatible AI client (e.g., Claude Desktop):

Ask a grounded question such as:

  • What are the applicability conditions defined in VM42?

  • What quantification approaches are defined in this methodology?

  • What data parameters must be monitored during the crediting period?

The AI will:

  1. Call the semantic search tool

  2. Retrieve relevant chunks

  3. Filter by metadata if necessary

  4. Generate a response grounded in document content

Responses may:

  • Cite sections

  • Extract structured data from tables

  • Present formulas

  • Suggest next steps (such as schema generation)


What Happens During Ingestion

When ingestion starts, the system:

  1. Discovers documents in the input folder

  2. Creates a collection in Qdrant (if it does not exist)

  3. Processes each document in parallel

  4. Extracts:

    • Text

    • Section structure

    • Tables

    • Formulas

  5. Performs post-processing

  6. Splits documents into contextual chunks

  7. Converts chunks into vector embeddings

  8. Stores embeddings and metadata in Qdrant


Advanced Processing Steps

The ingestion pipeline includes:

Formula Recognition

  • Formulas are detected

  • Converted to LaTeX

  • Repaired if split across layout blocks

  • Processed using OCR when necessary

Table Normalization

  • Multi-page tables are detected

  • Split tables are merged into single logical units

Metadata Enrichment

Each stored chunk includes metadata such as:

  • Chunk ID

  • Heading

  • Document structure path

  • Page number

  • Source filename

  • has_formula flag

  • has_table flag

These metadata flags allow targeted semantic filtering during search.


How Semantic Search Works

The toolkit uses hybrid retrieval:

  • Dense embeddings (semantic meaning)

  • Sparse retrieval (keyword matching)

Chunks are ranked using reciprocal rank fusion.

The result:

  • High recall

  • High precision

  • Context-aware retrieval

  • Reduced hallucination risk

The AI does not use internet knowledge. It operates only on your ingested documents.


What Success Looks Like

You have successfully completed this stage when:

  • Documents are indexed

  • Qdrant contains chunk records

  • The MCP server exposes tools

  • Your AI client can answer methodology-specific questions

  • Responses reflect actual document content

At this point, you have transformed static PDFs into a searchable knowledge base.


What Comes Next

Now that you can search methodologies semantically, the next step is:

  • Generating Guardian-compatible schemas

  • Extracting formulas into structured definitions

  • Beginning structured digitization workflows

Proceed to:

Schema & Formula Generation

Was this helpful?