# First Ingestion and Semantic Search

## First Ingestion & Semantic Search

This guide walks through your first end-to-end ingestion and semantic search workflow using the Hedera Guardian AI Toolkit.

***

### Overview

At this stage, your infrastructure services should already be running:

* Qdrant (vector database)
* MCP Server

You will complete the end-to-end workflow in five steps.

{% stepper %}
{% step %}

### Step 1 — Add Your Documents

Place your methodology files into:

```
data/input/documents/
```

Supported formats:

* PDF
* DOCX

You may add:

* Full methodologies (100+ pages)
* Templates
* Supporting documentation

Each file will be processed independently and indexed into the vector database.
{% endstep %}

{% step %}

### Step 2 — Run Document Ingestion

From the root of the repository:

#### Standard profile:

```
docker compose run --rm document-ingestion-worker
```

#### GPU profile (if configured):

```
docker compose -f docker-compose.yml -f docker-compose.gpu.yml run --rm document-ingestion-worker
```

#### Low-memory profile:

```
docker compose -f docker-compose.yml -f docker-compose.low-memory.yml run --rm document-ingestion-worker
```

{% endstep %}

{% step %}

### Step 3 — Verify Ingestion

Once ingestion completes:

* A Qdrant collection named `methodology_documents` will exist
* The collection will contain one point per chunk

You can verify via:

```
http://localhost:6333/dashboard
```

Each stored record contains:

* The chunk text
* Embedding vectors
* Metadata
* LaTeX formulas (when applicable)
  {% endstep %}

{% step %}

### Step 4 — Connect Your MCP Client

Your MCP Server should already be running at:

```
http://localhost:9000
```

To verify:

```
npx @modelcontextprotocol/inspector --server-url http://localhost:9000/mcp
```

You should see available tools including:

* Semantic search tools
* Schema builder tools
  {% endstep %}

{% step %}

### Step 5 — Perform Your First Semantic Search

Using an MCP-compatible AI client (e.g., Claude Desktop):

Ask a grounded question such as:

* What are the applicability conditions defined in VM42?
* What quantification approaches are defined in this methodology?
* What data parameters must be monitored during the crediting period?

The AI will:

1. Call the semantic search tool
2. Retrieve relevant chunks
3. Filter by metadata if necessary
4. Generate a response grounded in document content

Responses may:

* Cite sections
* Extract structured data from tables
* Present formulas
* Suggest next steps (such as schema generation)
  {% endstep %}
  {% endstepper %}

***

## What Happens During Ingestion

When ingestion starts, the system:

1. Discovers documents in the input folder
2. Creates a collection in Qdrant (if it does not exist)
3. Processes each document in parallel
4. Extracts:
   * Text
   * Section structure
   * Tables
   * Formulas
5. Performs post-processing
6. Splits documents into contextual chunks
7. Converts chunks into vector embeddings
8. Stores embeddings and metadata in Qdrant

***

### Advanced Processing Steps

The ingestion pipeline includes:

#### Formula Recognition

* Formulas are detected
* Converted to LaTeX
* Repaired if split across layout blocks
* Processed using OCR when necessary

#### Table Normalization

* Multi-page tables are detected
* Split tables are merged into single logical units

#### Metadata Enrichment

Each stored chunk includes metadata such as:

* Chunk ID
* Heading
* Document structure path
* Page number
* Source filename
* has\_formula flag
* has\_table flag

These metadata flags allow targeted semantic filtering during search.

***

## How Semantic Search Works

The toolkit uses hybrid retrieval:

* Dense embeddings (semantic meaning)
* Sparse retrieval (keyword matching)

Chunks are ranked using reciprocal rank fusion.

The result:

* High recall
* High precision
* Context-aware retrieval
* Reduced hallucination risk

The AI does not use internet knowledge.\
It operates only on your ingested documents.

***

## What Success Looks Like

You have successfully completed this stage when:

* Documents are indexed
* Qdrant contains chunk records
* The MCP server exposes tools
* Your AI client can answer methodology-specific questions
* Responses reflect actual document content

At this point, you have transformed static PDFs into a searchable knowledge base.

***

## What Comes Next

Now that you can search methodologies semantically, the next step is:

* Generating Guardian-compatible schemas
* Extracting formulas into structured definitions
* Beginning structured digitization workflows

Proceed to:

**Schema & Formula Generation**


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://guardian.hedera.com/guardian-3.5.0/ai-toolkit/first-ingestion-and-semantic-search.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
