MarChat Technical Documentation

Maritime shipboard AI assistant with RAG, multi-mode chat, and pre-arrival form automation

About MarChat
MarChat is a shipboard AI assistant that combines semantic search over maritime regulations with intelligent form filling for port documentation. It is designed to run locally, with no cloud dependency, making it suitable for use in environments with limited or no internet connectivity.

What is MarChat?

MarChat is a maritime AI assistant purpose-built for shipboard use. It provides Retrieval-Augmented Generation (RAG) over maritime regulations including SOLAS, MARPOL, STCW, and other IMO conventions. The system supports multi-mode chat with context-aware system prompts tailored to different operational scenarios, and includes a pre-arrival form auto-fill engine that populates port documentation from natural language context and vessel profile data.

All processing runs locally through Ollama, ensuring that sensitive vessel and operational data never leaves the shipboard network. The hybrid search pipeline combines semantic understanding with keyword matching, delivering precise results even for highly technical regulatory queries.

Hybrid RAG

Semantic search via Qdrant vector database
BM25 keyword matching for exact terms
Reciprocal Rank Fusion (RRF) merging
Cross-encoder reranking for precision

Multi-Mode Chat

Regulatory mode: IMO convention expert
Medical mode: IMGS maritime medical
General mode: Operations assistant
Forms mode: Document specialist
Research mode: Academic paper assistant

Smart Chunking

IMO convention hierarchy detection
Convention > Annex > Chapter > Regulation > Paragraph
Cascading metadata inheritance
Precise citation generation

Pre-Arrival Forms

Excel and DOCX template processing
{{placeholder}} pattern detection
LLM-assisted auto-fill from context
Vessel profile pre-population

Multi-Collection

Separate knowledge bases per domain
Maritime, medical, legal, research, technical
Schema-driven metadata validation
Independent search configurations

Local LLM

Ollama integration (gemma3:12b)
mxbai-embed-large embeddings (1024-dim)
No cloud dependency or API keys required
Data stays on the local machine

Architecture

Component Overview

MarChat follows a standard client-server architecture with a Next.js frontend communicating with a FastAPI backend. The backend orchestrates all AI operations including embedding generation, vector search, BM25 retrieval, cross-encoder reranking, and LLM inference. Persistent storage is split between Qdrant (vector embeddings and metadata) and SQLite (relational data such as documents, collections, queries, and form templates).

Component	Technology	Port	Role
Frontend	Next.js 14.2	3005	User interface, chat, document management, form filling
Backend	FastAPI (Python 3.12)	8005	API server, RAG pipeline, LLM orchestration, form processing
Vector DB	Qdrant	6333	Vector storage, semantic search, metadata filtering
LLM Runtime	Ollama	11434	Local inference (gemma3:12b), embeddings (mxbai-embed-large)
Relational DB	SQLite	N/A	Documents, collections, queries, form templates, vessel profiles

Data Flow: RAG Query

When a user submits a query, the system executes a multi-stage retrieval pipeline that combines semantic and keyword search, fuses the results, and reranks them before generating a response.

RAG Query Pipeline

1. User Query | 2. Embed query using mxbai-embed-large (1024-dim vector) | 3. Parallel Search: |--- Semantic Search: query vector -> Qdrant cosine similarity (top_k x 3) |--- BM25 Search: tokenized query -> BM25 ranking (top_k x 3) | 4. Reciprocal Rank Fusion (RRF): merge both result sets | 5. Cross-Encoder Reranking: score (query, candidate) pairs | Filter by MIN_RERANK_SCORE = -2.0 | 6. Select top_k results as context | 7. Build prompt: system prompt (mode-specific) + context chunks + user query | 8. LLM Generation: Ollama gemma3:12b -> answer with citations | 9. Return answer + retrieved chunks + response time

Data Flow: Form Auto-Fill

The form auto-fill pipeline processes uploaded templates, extracts fields, and uses the LLM to generate values from user-provided context.

Form Auto-Fill Pipeline

1. Upload Template (Excel or DOCX) | 2. Extract Fields: scan for {{placeholder}} patterns | Detect highlighted cells (yellow) in Excel | Infer field types (date, number, boolean, text) | 3. User provides natural language context | (voyage details, port info, vessel data) | 4. Build LLM prompt: field list + context + vessel profile | 5. LLM generates JSON: { "field_name": "value", ... } | 6. Fill template: replace {{placeholders}} with values | 7. Save and return filled document for download

Installation

Prerequisites

Docker & Docker Compose — for containerized deployment
Ollama — local LLM runtime with the following models:
- gemma3:12b — language model for chat and form filling
- mxbai-embed-large — embedding model (1024 dimensions)
Node.js 18+ — for frontend development mode
Python 3.12+ — for backend development mode

Docker Compose (Recommended)

git clone https://github.com/SL-Mar/marchat.git
cd marchat
docker compose up -d

Pre-Install Ollama Models

Before starting the application, ensure both required models are available in your Ollama instance:

ollama pull gemma3:12b
ollama pull mxbai-embed-large

Development Mode

For local development with hot-reload, use the launch script:

chmod +x launch.sh
./launch.sh

This starts the backend (FastAPI on port 8005) and frontend (Next.js on port 3005) in development mode with automatic reloading on file changes.

Environment Variables

Create a .env file in the project root with the following configuration:

# Core Services
QDRANT_URL=http://localhost:6333
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_MODEL=gemma3:12b
EMBEDDING_MODEL=mxbai-embed-large

# Database
DATABASE_URL=sqlite:///./data/marchat.db

# Optional: OpenAI fallback (leave empty for local-only)
OPENAI_API_KEY=

Important
Ollama must be running with both models pulled before starting the application. The backend will fail to initialize embeddings if mxbai-embed-large is not available, and chat/RAG queries will fail without gemma3:12b.

Verifying Installation

After starting the application, verify that all services are healthy:

# Check backend health endpoint
curl http://localhost:8005/health

# Expected response:
{
  "status": "healthy",
  "services": {
    "qdrant": "connected",
    "ollama": "connected",
    "database": "connected"
  }
}

Tech Stack

Backend

Technology	Version	Purpose
FastAPI	0.110+	Async API framework with automatic OpenAPI docs
Python	3.12	Runtime environment
SQLAlchemy	2.0	ORM for relational database operations
Pydantic	2.x	Request/response validation and serialization
openpyxl	3.1+	Excel file reading and writing
python-docx	1.1+	DOCX file reading and writing
sentence-transformers	2.x	Cross-encoder reranking model

Frontend

Technology	Version	Purpose
Next.js	14.2	React framework with server-side rendering
React	18.3	UI component library
Tailwind CSS	3.x	Utility-first CSS framework
Axios	1.x	HTTP client for API requests
react-markdown	9.x	Markdown rendering in chat responses
react-pdf	7.x	PDF preview in document viewer

Storage

Technology	Purpose	Details
Qdrant	Vector database	Stores embeddings and metadata, supports cosine similarity search
SQLite	Relational database	Documents, collections, queries, forms, vessel profiles
BM25 pickle index	Keyword search index	Persisted at `data/bm25_index.pkl`, rebuilt on indexing

AI Models

Model	Provider	Purpose	Details
gemma3:12b	Ollama	Language model	Chat, RAG answer generation, form auto-fill
mxbai-embed-large	Ollama	Embeddings	1024-dimensional vectors for semantic search
cross-encoder/ms-marco-MiniLM-L-6-v2	sentence-transformers	Reranking	Query-document relevance scoring for result reranking

Hybrid Search

The hybrid search pipeline is the core retrieval mechanism in MarChat. It combines two fundamentally different search strategies — semantic vector search and BM25 keyword matching — to achieve robust retrieval across the full spectrum of maritime regulation queries. This is critical because maritime queries often mix natural language questions ("What are the fire safety requirements?") with precise technical references ("Regulation 3-1, Chapter II-2").

Hybrid Search Pipeline

Step 1: Semantic Search query -> embed(query) using mxbai-embed-large -> Qdrant cosine similarity search -> retrieve top_k x 3 candidates Step 2: BM25 Keyword Search query -> lowercase tokenization -> BM25 ranking against indexed chunks -> retrieve top_k x 3 candidates Step 3: Reciprocal Rank Fusion (RRF) For each document d appearing in any result list: RRF_score(d) = SUM( 1 / (k + rank_i(d)) ) Where: k = 60 (smoothing constant) rank_i(d) = rank of document d in result list i Select top_k x 2 candidates by RRF score Step 4: Cross-Encoder Reranking For each candidate c in fused results: relevance_score = cross_encoder.predict(query, c.text) Filter: keep only candidates where score >= MIN_RERANK_SCORE (-2.0) Sort by relevance_score descending Return top_k results

Step 1: Semantic Search

The user's query is encoded into a 1024-dimensional vector using the mxbai-embed-large model via Ollama. This vector is then used to perform a cosine similarity search against the Qdrant vector database. Semantic search excels at understanding the meaning behind a query, finding relevant content even when the exact words differ. For example, "fire protection in engine rooms" will match chunks about "fire detection systems in machinery spaces" because the semantic meaning is similar.

To ensure a broad candidate pool, the semantic search retrieves top_k x 3 results, which are later narrowed down by fusion and reranking.

Step 2: BM25 Keyword Search

In parallel, the query is tokenized using simple lowercase word splitting and matched against the BM25 inverted index. BM25 is a probabilistic ranking function that scores documents based on term frequency, inverse document frequency, and document length normalization. This approach is critical for capturing exact regulation numbers, specific technical terms, and abbreviations that semantic search might miss. For example, "Reg. 14.1.2.3" or "SOLAS II-2" are best matched by keyword search.

Like semantic search, BM25 retrieves top_k x 3 results to provide ample candidates for fusion.

Step 3: Reciprocal Rank Fusion (RRF)

The results from both search methods are combined using Reciprocal Rank Fusion. RRF is a rank-based fusion method that does not require score normalization across different retrieval systems. Each document receives a score based on its rank position in each result list, with the constant k = 60 dampening the influence of high-ranking positions. Documents appearing in both lists receive combined scores, naturally boosting results that are relevant by both criteria.

After fusion, the top top_k x 2 candidates are passed forward for reranking.

Step 4: Cross-Encoder Reranking

The fused candidates are reranked using a cross-encoder model (cross-encoder/ms-marco-MiniLM-L-6-v2). Unlike bi-encoder similarity (used in semantic search), the cross-encoder processes the query and each candidate document jointly, allowing it to capture fine-grained interactions between the query and document text. This produces much more accurate relevance scores at the cost of higher computational overhead — which is why it is applied only to the already-filtered candidate set.

Documents scoring below MIN_RERANK_SCORE = -2.0 are filtered out, and the remaining results are returned sorted by relevance score, limited to top_k.

Why Hybrid?
The hybrid approach combines the strengths of semantic understanding (captures meaning) with exact keyword matching (captures specific regulation numbers and technical terms), resulting in significantly better retrieval than either method alone. In testing on maritime regulation queries, hybrid search with reranking consistently outperformed standalone semantic or keyword search.

BM25 Engine

The BM25 engine provides keyword-based retrieval as a complement to semantic vector search. It maintains a persistent inverted index that is incrementally updated as new documents are indexed.

Tokenization

The engine uses a simple tokenization strategy: text is converted to lowercase and split on whitespace and punctuation boundaries. This straightforward approach works well for maritime regulatory text, which contains many technical terms and abbreviations that benefit from exact matching.

Persistence

The BM25 index is serialized using Python's pickle module and stored at data/bm25_index.pkl. This allows the index to persist across application restarts without requiring re-indexing of all documents.

Incremental Indexing

When new documents are indexed, the BM25 index is updated incrementally rather than rebuilt from scratch. New chunk texts and their corresponding IDs are appended to the existing index, and the BM25 statistics (term frequencies, document lengths) are recalculated.

Key Methods

Method	Description
`index_chunks(chunks, chunk_ids)`	Add new chunks to the BM25 index. Tokenizes each chunk and updates the inverted index and IDF statistics.
`search(query, top_k)`	Tokenize the query and rank all indexed chunks by BM25 score. Returns the top_k results with scores and chunk IDs.
`clear_index()`	Remove all entries from the index and delete the pickle file. Used when a collection is deleted or a full re-index is needed.

# BM25 index location
data/bm25_index.pkl

# Index is loaded at startup if the file exists
# Re-created automatically if missing or corrupted

Smart Chunking

Maritime regulations follow a strict hierarchical structure defined by the International Maritime Organization (IMO). MarChat's smart chunking engine detects this hierarchy during document processing and preserves it as metadata on each chunk, enabling precise citations in RAG responses.

IMO Hierarchy Detection

The chunker uses regular expression patterns to detect hierarchy levels in the document text. As it processes each line, it maintains a cascading state machine that tracks the current position in the hierarchy.

# Hierarchy detection regex patterns

convention:  (SOLAS|MARPOL|STCW|COLREG|LOADLINE|TONNAGE|SFV|STP|SAR)
annex:       ANNEX\s+[IVX]+|ANNEX\s+[0-9]+
chapter:     CHAPTER\s+\w+
regulation:  Regulation\s+\d+|Reg\.\s*\d+
paragraph:   \d+\.\d+(\.\d+)*

Processing Steps

Split PDF by pages — Extract text from each page of the uploaded PDF document.
Process line-by-line — Scan each line for hierarchy markers using the regex patterns above.
Detect convention from filename — If the filename contains a convention name (e.g., SOLAS_consolidated.pdf), set the convention metadata automatically.
Maintain hierarchical state — Track the current position as: convention → annex → chapter → regulation → paragraph. Each new detection updates the corresponding level and all chunks inherit the full hierarchy path.
Build chunks at threshold — Accumulate text until the chunk reaches 2400 characters, then split with a 100-word overlap to preserve context across chunk boundaries.
Cascading reset — When a higher-level marker is detected (e.g., a new annex), all lower levels (chapter, regulation, paragraph) are reset. This prevents stale metadata from carrying over into a new section.

Metadata Output

Each chunk is stored with the following metadata fields:

Field	Example Value	Description
`convention`	SOLAS	IMO convention name
`annex`	ANNEX I	Annex number (Roman or Arabic)
`chapter`	CHAPTER II-1	Chapter identifier
`regulation`	Regulation 3-1	Regulation number
`paragraph`	2.1	Paragraph number (dotted notation)
`page`	145	Source page number in PDF

Citation Format
Hierarchical chunking preserves regulatory structure in metadata, enabling precise citations like "SOLAS > Chapter II-1 > Regulation 3-1 > Para. 2.1, Page 145" in RAG responses. This level of precision is critical for maritime regulatory compliance.

Cross-Encoder Reranking

After the Reciprocal Rank Fusion step produces a merged candidate set, the cross-encoder model provides a final precision pass by jointly scoring each (query, document) pair.

Model Details

Property	Value
Model	`cross-encoder/ms-marco-MiniLM-L-6-v2`
Input	(query, candidate_text) pairs
Output	Relevance score (range: approximately -10 to +10)
Threshold	`MIN_RERANK_SCORE = -2.0`

How It Works

Unlike bi-encoder models (which encode query and document independently and compare their vectors), a cross-encoder processes the query and document text together through the same transformer pass. This allows the model to attend across both inputs simultaneously, capturing word-level interactions between the query and the document. The result is significantly more accurate relevance judgments, at the cost of being computationally more expensive (which is why it is only applied to the already-narrowed candidate set, not the entire corpus).

Filtering

After scoring, any candidate with a relevance score below -2.0 is removed from the results. This threshold was determined empirically to filter out genuinely irrelevant content while retaining borderline-relevant chunks that may contain useful context. The remaining candidates are sorted by score in descending order and truncated to top_k.

Multi-Mode Chat

MarChat supports five specialized chat modes, each with a tailored system prompt that controls the LLM's behavior, citation format, and response structure. The mode is selected per-query, allowing users to switch between different operational contexts within the same session.

Mode	Description	Citation Format
`regulatory`	IMO convention expert. Provides precise regulatory answers with full hierarchy citations.	Convention > Chapter > Regulation > Paragraph, Page
`medical`	Maritime medical assistant based on International Medical Guide for Ships (IMGS).	TMAS protocol references, IMGS section numbers
`general`	General operations assistant for practical maritime questions.	Practical advice format with source references
`forms`	Pre-arrival document specialist that outputs structured JSON for form filling.	JSON field mapping: { "field": "value" }
`research`	Academic paper assistant for scholarly analysis and literature review.	Source, page, academic tone with proper attribution

Example: Regulatory Mode Query

# Request
POST /api/rag/query
{
  "query": "What are the requirements for fire detection in machinery spaces?",
  "mode": "regulatory",
  "collection_name": "maritime_v1",
  "top_k": 5
}

# Response includes citations like:
# "According to SOLAS > Chapter II-2 > Regulation 7 > Para. 2.1 (Page 203),
#  fixed fire detection and fire alarm systems shall be provided
#  in machinery spaces of category A..."

System Prompts

Each chat mode uses a carefully crafted system prompt that defines the LLM's persona, response structure, citation requirements, and content guidelines. The system prompt is prepended to the user's query along with the retrieved context chunks.

Prompt Structure

Every system prompt follows a common structure:

Role definition — who the assistant is and its area of expertise
Citation requirements — how to reference source material
Response format — expected structure, tone, and length
Content guidelines — what to include, what to avoid
Fallback behavior — how to handle queries outside the knowledge base

Regulatory Mode Prompt (Example)

You are a maritime regulatory expert specializing in IMO conventions.
Answer questions using ONLY the provided context from maritime regulations.

CITATION FORMAT (MANDATORY):
- Always cite the specific regulation using the full hierarchy:
  Convention > Chapter > Regulation > Paragraph, Page
- Example: "SOLAS > Chapter II-1 > Regulation 3-1 > Para. 2.1, Page 145"

RESPONSE STRUCTURE:
1. Direct answer to the question
2. Relevant regulatory text with citations
3. Additional context or related regulations if applicable

IMPORTANT:
- If the context does not contain sufficient information, state this clearly
- Do not fabricate or assume regulatory content
- Use exact text from regulations when possible

Forms Mode Prompt (Example)

You are a pre-arrival documentation specialist.
Given the form fields and context provided, generate a JSON object
mapping each field name to its appropriate value.

OUTPUT FORMAT:
Return ONLY a valid JSON object with field names as keys:
{
  "vessel_name": "MV Pacific Star",
  "imo_number": "9123456",
  "eta": "2026-03-15T08:00:00",
  "last_port": "Singapore"
}

IMPORTANT:
- Use the vessel profile data when available
- Infer values from the natural language context
- Use ISO 8601 format for dates
- Leave unknown fields as empty strings

Mode-Specific Behaviors

Regulatory mode enforces hierarchical citations and uses exact regulatory text whenever available in the context.
Medical mode includes safety disclaimers and references to TMAS (Telemedical Assistance Service) protocols for remote medical guidance.
General mode provides practical, actionable advice while still referencing source material.
Forms mode outputs strictly valid JSON, mapping field names to inferred values from the provided context.
Research mode adopts an academic tone, tracks sources carefully, and provides page-level citations.

LLM Providers

MarChat uses a factory pattern to abstract LLM provider details, allowing the system to switch between different backends without changing application logic.

Provider Factory

# Factory function
def get_llm_provider(provider_type: str, **kwargs) -> LLMProvider:
    """
    Returns an LLM provider instance based on the specified type.

    Args:
        provider_type: "ollama" or "openai"
        **kwargs: Provider-specific configuration

    Returns:
        LLMProvider instance with generate() and embed() methods
    """
    if provider_type == "ollama":
        return OllamaProvider(**kwargs)
    elif provider_type == "openai":
        return OpenAIProvider(**kwargs)
    else:
        raise ValueError(f"Unknown provider: {provider_type}")

Ollama Provider (Default)

The Ollama provider communicates with a locally running Ollama instance via its HTTP API. It is the default and recommended provider for MarChat, as it keeps all data on the local machine.

Parameter	Default	Description
`model`	gemma3:12b	Chat and generation model
`temperature`	0.3	Controls randomness (lower = more deterministic)
`num_predict`	2048	Maximum tokens in response
`repeat_penalty`	1.1	Penalizes token repetition
`top_k`	40	Top-K sampling parameter
`top_p`	0.9	Nucleus sampling parameter
`stop`	[]	Stop token sequences

# Ollama API call for chat generation
POST http://localhost:11434/api/chat
{
  "model": "gemma3:12b",
  "messages": [
    {"role": "system", "content": "...system prompt..."},
    {"role": "user", "content": "...context + query..."}
  ],
  "options": {
    "temperature": 0.3,
    "num_predict": 2048,
    "repeat_penalty": 1.1,
    "top_k": 40,
    "top_p": 0.9
  },
  "stream": false
}

OpenAI Provider (Optional)

The OpenAI provider is available as a fallback for environments where Ollama is not available. It requires an OPENAI_API_KEY environment variable and uses the openai.ChatCompletion API. Note that using OpenAI sends data to external servers, which may not be appropriate for sensitive vessel information.

Template Processing

The template processing engine handles Excel (.xlsx) and DOCX files used for pre-arrival port documentation. It detects form fields using placeholder patterns and highlighted cells, then provides a structured representation of the template's fields for filling.

Field Detection

The engine searches for two types of field markers in uploaded templates:

Placeholder patterns — {{field_name}} markers in cell values or document text
Highlighted cells — Yellow-highlighted cells in Excel spreadsheets (common in port authority templates)

Field Type Inference

Field types are automatically inferred from the field name using keyword matching:

Inferred Type	Keywords in Field Name	Example Fields
`date`	date, eta, etd, arrival, departure, dob	{{eta}}, {{date_of_birth}}, {{departure_date}}
`number`	tonnage, draft, length, breadth, crew_count	{{gross_tonnage}}, {{forward_draft}}
`boolean`	has_, is_, valid_	{{has_dangerous_goods}}, {{is_chartered}}
`text`	(default for all others)	{{vessel_name}}, {{last_port}}, {{agent_name}}

FormField Data Model

@dataclass
class FormField:
    name: str               # Field identifier (e.g., "vessel_name")
    field_type: str         # text, date, number, boolean, list
    location: str           # Cell reference (Excel) or placeholder position (DOCX)
    required: bool          # Whether the field is mandatory
    default_value: Optional[str]  # Pre-populated value if available
    description: str        # Human-readable field description

Fill Process

Load template — Open the uploaded Excel or DOCX file using openpyxl or python-docx.
Find placeholders — Scan all cells (Excel) or paragraphs (DOCX) for {{...}} patterns.
Replace with values — Substitute each placeholder with the corresponding value from the fill data.
Save filled document — Write the filled template to disk and make it available for download.

# Example: Excel placeholder detection
for sheet in workbook.sheetnames:
    ws = workbook[sheet]
    for row in ws.iter_rows():
        for cell in row:
            if cell.value and "{{" in str(cell.value):
                # Extract field name from {{field_name}}
                match = re.findall(r'\{\{(\w+)\}\}', str(cell.value))
                for field_name in match:
                    fields.append(FormField(
                        name=field_name,
                        field_type=infer_type(field_name),
                        location=f"{sheet}!{cell.coordinate}",
                        required=True,
                        default_value=None,
                        description=field_name.replace("_", " ").title()
                    ))

Auto-Fill Engine

The auto-fill engine uses the LLM to intelligently populate form fields from unstructured natural language context. Instead of manually filling each field, the user provides a free-text description of the voyage, vessel, and port information, and the LLM extracts the relevant values.

How It Works

Extract field list — The system retrieves all detected fields from the uploaded template.
Build prompt — A structured prompt is constructed containing the field list, the user's natural language context, and optionally the vessel profile data.
LLM generation — The forms mode system prompt instructs the LLM to output a JSON object mapping each field name to its inferred value.
Parse response — The engine handles markdown code block wrapping (strips ```json delimiters if present) and parses the JSON.
Fill template — The parsed values are applied to the template placeholders.

Example Context Input

We are MV Pacific Star, IMO 9123456, Panama flag, arriving at
Rotterdam on March 15th 2026 at approximately 0800 local time.
Coming from Singapore via Suez Canal. Gross tonnage 45,230.
We have 22 crew members on board, all healthy. No dangerous
goods. Last port state control inspection was in Singapore on
February 28th, no deficiencies found. Agent is MaritimePort BV.

Example LLM Output

{
  "vessel_name": "MV Pacific Star",
  "imo_number": "9123456",
  "flag_state": "Panama",
  "port_of_arrival": "Rotterdam",
  "eta": "2026-03-15T08:00:00",
  "last_port": "Singapore",
  "gross_tonnage": "45230",
  "crew_count": "22",
  "has_dangerous_goods": "No",
  "last_psc_inspection_port": "Singapore",
  "last_psc_inspection_date": "2026-02-28",
  "psc_deficiencies": "None",
  "agent_name": "MaritimePort BV"
}

Vessel Profile Pre-Population

If a vessel profile is associated with the fill request, its static fields (vessel name, IMO number, flag state, call sign, MMSI, tonnage, etc.) are automatically included in the prompt context. This reduces the amount of information the user needs to provide and ensures consistency across multiple form fills.

JSON Output Handling
The auto-fill engine uses the forms mode system prompt to ensure structured JSON output. It also includes fallback parsing that handles cases where the LLM wraps the JSON in markdown code blocks (```json ... ```), which is a common behavior for instruction-following models.

Vessel Profiles

Vessel profiles store reusable vessel information that can be pre-populated into forms, eliminating the need to re-enter static data for every port call.

Profile Fields

Field	Type	Description
`vessel_name`	string	Full vessel name (e.g., "MV Pacific Star")
`imo_number`	string	7-digit IMO identification number
`flag_state`	string	Flag state / country of registration
`call_sign`	string	Radio call sign
`mmsi`	string	Maritime Mobile Service Identity (9 digits)
`gross_tonnage`	float	Gross tonnage (GT)
`net_tonnage`	float	Net tonnage (NT)
`deadweight`	float	Deadweight tonnage (DWT)
`vessel_type`	string	Type of vessel (bulk carrier, tanker, container, etc.)
`year_built`	integer	Year of construction
`owner`	string	Registered owner
`operator`	string	Ship operator / manager
`classification_society`	string	Classification society (Lloyd's, DNV, BV, etc.)
`extra_fields_json`	JSON	Extensible key-value store for additional vessel-specific data

The extra_fields_json column allows storing any additional vessel-specific data without schema changes. This is useful for fields that vary by vessel type, flag state requirements, or specific port authority needs.

RAG API

Endpoints for executing RAG queries against indexed document collections.

POST /api/rag/query

Execute a RAG query against a specified collection. Runs the full hybrid search pipeline (semantic + BM25 + RRF + cross-encoder reranking) and generates an LLM response with citations.

Request Body

{
  "query": "What are the requirements for fire detection in machinery spaces?",
  "mode": "regulatory",
  "collection_name": "maritime_v1",
  "top_k": 5
}

Parameter	Type	Required	Description
`query`	string	Yes	The user's question or search query
`mode`	string	No	Chat mode: regulatory, medical, general, forms, research. Default: regulatory
`collection_name`	string	No	Target collection for search. Default: first available collection
`top_k`	integer	No	Number of context chunks to retrieve. Default: 5

Response

{
  "answer": "According to SOLAS > Chapter II-2 > Regulation 7 > Para. 2.1 (Page 203), fixed fire detection and fire alarm systems shall be provided in machinery spaces of category A. The system must be capable of rapidly detecting the onset of fire...",
  "retrieved_chunks": [
    {
      "text": "2.1 A fixed fire detection and fire alarm system...",
      "metadata": {
        "convention": "SOLAS",
        "chapter": "CHAPTER II-2",
        "regulation": "Regulation 7",
        "paragraph": "2.1",
        "page": 203
      },
      "score": 0.847
    }
  ],
  "total_chunks_retrieved": 5,
  "response_time_ms": 2340,
  "mode": "regulatory"
}

GET /api/rag/history?limit=50

Retrieve the query history with answers, modes, and response times. Useful for reviewing past queries and analyzing system performance.

Query Parameters

Parameter	Type	Required	Description
`limit`	integer	No	Maximum number of history entries to return. Default: 50

Response

[
  {
    "id": 42,
    "query_text": "Fire detection in machinery spaces",
    "answer_text": "According to SOLAS...",
    "mode": "regulatory",
    "response_time_ms": 2340,
    "created_at": "2026-02-07T14:30:00Z"
  }
]

Documents API

Endpoints for uploading, indexing, and managing PDF documents within collections.

POST /api/documents/upload

Upload a PDF document to a collection. Optionally triggers automatic indexing (chunking + embedding + vector storage).

Request (multipart/form-data)

Field	Type	Required	Description
`file`	file	Yes	PDF file to upload
`document_type`	string	No	Type of document (convention, circular, guideline)
`collection_name`	string	No	Target collection name. Default: default collection
`auto_index`	boolean	No	Automatically index after upload. Default: true

Response

{
  "id": 7,
  "title": "SOLAS_Consolidated_2024",
  "filename": "SOLAS_Consolidated_2024.pdf",
  "status": "indexing",
  "collection_id": 1,
  "total_chunks": 0,
  "created_at": "2026-02-07T10:15:00Z"
}

POST /api/documents/{id}/index

Manually trigger indexing for an uploaded document. Processes the PDF through smart chunking, generates embeddings, stores vectors in Qdrant, and updates the BM25 index.

GET /api/documents

List all uploaded documents with their status, collection assignment, and chunk count.

GET /api/documents/{id}

Retrieve details for a specific document including metadata, chunk count, and indexing status.

DELETE /api/documents/{id}

Delete a document and all its associated chunks from both Qdrant and the BM25 index.

GET /api/documents/{id}/preview

Retrieve a preview of the document's first few pages. Returns extracted text for display in the frontend document viewer.

Collections API

Endpoints for managing document collections. Each collection maps to a Qdrant collection and has its own schema, search configuration, and document set.

GET /api/collections

List all collections with document counts and schema information.

Response

[
  {
    "id": 1,
    "name": "Maritime Regulations",
    "description": "SOLAS, MARPOL, STCW and other IMO conventions",
    "schema_id": "maritime_v1",
    "qdrant_collection_name": "maritime_v1",
    "document_count": 12,
    "created_at": "2026-01-15T09:00:00Z"
  }
]

POST /api/collections

Create a new collection with a specified schema. This creates the corresponding Qdrant collection with the correct vector dimension (1024) and distance metric (cosine).

Request Body

{
  "name": "Medical Guidelines",
  "description": "IMGS and maritime medical references",
  "schema_id": "medical_v1"
}

GET /api/collections/{id}

Retrieve details for a specific collection including its schema, search configuration, and associated documents.

DELETE /api/collections/{id}

Delete a collection and all its associated documents, chunks, and Qdrant vectors.

GET /api/collections/schemas/list

List all available collection schemas with their metadata field definitions and chunking configurations.

Available Schemas

Schema ID	Description	Optimized For
`maritime_v1`	IMO conventions and maritime regulations	Hierarchical regulatory text (SOLAS, MARPOL, STCW)
`legal_v1`	Legal documents and contracts	Clause-based legal text with section references
`medical_v1`	Medical guidelines and protocols	IMGS, medical procedures, treatment guidelines
`technical_v1`	Technical manuals and documentation	Equipment manuals, maintenance procedures, specifications

Chat API

Endpoints for the multi-mode chat system. Chat sessions maintain conversation history for context continuity.

POST /api/chat/message

Send a message in a chat session. The system applies the specified mode's system prompt and optionally retrieves RAG context before generating a response.

Request Body

{
  "session_id": "sess_abc123",
  "message": "What are the minimum rest hour requirements for watchkeepers?",
  "mode": "regulatory",
  "use_rag": true,
  "collection_name": "maritime_v1"
}

Parameter	Type	Required	Description
`session_id`	string	No	Chat session identifier. Auto-generated if not provided.
`message`	string	Yes	User's message text
`mode`	string	No	Chat mode. Default: general
`use_rag`	boolean	No	Whether to retrieve context from the document collection. Default: true
`collection_name`	string	No	Collection to search for RAG context. Required if use_rag is true.

Response

{
  "session_id": "sess_abc123",
  "response": "According to STCW > Section A-VIII/1 > Para. 2 (Page 312), the minimum rest period for seafarers assigned to watchkeeping duties is:\n\n1. A minimum of 10 hours of rest in any 24-hour period\n2. A minimum of 77 hours in any 7-day period\n\nThe rest period may be divided into no more than two periods, one of which shall be at least 6 hours in length...",
  "mode": "regulatory",
  "sources": [
    {
      "convention": "STCW",
      "chapter": "Section A-VIII/1",
      "paragraph": "2",
      "page": 312
    }
  ]
}

GET /api/chat/history/{session_id}

Retrieve the full conversation history for a chat session, including both user messages and assistant responses.

DELETE /api/chat/history/{session_id}

Delete all messages in a chat session.

GET /api/chat/sessions

List all active chat sessions with their most recent message and mode.

Forms API

Endpoints for managing form templates, filling forms (manually or with LLM assistance), and managing vessel profiles.

Template Management

POST /api/forms/templates/upload

Upload a form template (Excel or DOCX). The system automatically detects {{placeholder}} fields and infers their types.

Request (multipart/form-data)

Field	Type	Required	Description
`file`	file	Yes	Excel (.xlsx) or Word (.docx) template file
`name`	string	No	Template display name. Defaults to filename.

Response

{
  "id": 3,
  "name": "Rotterdam Pre-Arrival Form",
  "file_type": "xlsx",
  "field_count": 28,
  "fields": [
    {"name": "vessel_name", "field_type": "text", "location": "Sheet1!B3"},
    {"name": "imo_number", "field_type": "text", "location": "Sheet1!B4"},
    {"name": "eta", "field_type": "date", "location": "Sheet1!B7"},
    {"name": "gross_tonnage", "field_type": "number", "location": "Sheet1!B10"},
    {"name": "has_dangerous_goods", "field_type": "boolean", "location": "Sheet1!B15"}
  ],
  "created_at": "2026-02-07T11:00:00Z"
}

GET /api/forms/templates

List all uploaded form templates with field counts and file types.

GET /api/forms/templates/{id}

Retrieve details for a specific template including its full field list.

DELETE /api/forms/templates/{id}

Delete a form template and its associated file.

Form Filling

POST /api/forms/fill

Fill a template with manually provided field values. Replaces {{placeholder}} patterns with the specified values.

Request Body

{
  "template_id": 3,
  "field_values": {
    "vessel_name": "MV Pacific Star",
    "imo_number": "9123456",
    "eta": "2026-03-15T08:00:00",
    "gross_tonnage": "45230",
    "has_dangerous_goods": "No"
  },
  "vessel_profile_id": 1
}

POST /api/forms/fill/auto

Fill a template using LLM-assisted auto-fill. The LLM extracts field values from the provided natural language context and optional vessel profile.

Request Body

{
  "template_id": 3,
  "context": "We are MV Pacific Star, IMO 9123456, Panama flag, arriving at Rotterdam on March 15th 2026 at approximately 0800 local time. Coming from Singapore via Suez Canal. Gross tonnage 45,230. 22 crew members, all healthy. No dangerous goods.",
  "vessel_profile_id": 1
}

Response

{
  "id": 15,
  "template_id": 3,
  "vessel_profile_id": 1,
  "field_values": {
    "vessel_name": "MV Pacific Star",
    "imo_number": "9123456",
    "flag_state": "Panama",
    "port_of_arrival": "Rotterdam",
    "eta": "2026-03-15T08:00:00",
    "last_port": "Singapore",
    "gross_tonnage": "45230",
    "crew_count": "22",
    "has_dangerous_goods": "No"
  },
  "fields_filled": 9,
  "fields_total": 28,
  "created_at": "2026-02-07T11:30:00Z"
}

GET /api/forms/filled

List all filled forms with their template association and fill completion percentage.

GET /api/forms/filled/{id}/download

Download the filled form document (Excel or DOCX) with all placeholders replaced by their values.

Vessel Profile Management

POST /api/forms/vessels

Create a new vessel profile with static vessel information for form pre-population.

Request Body

{
  "vessel_name": "MV Pacific Star",
  "imo_number": "9123456",
  "flag_state": "Panama",
  "call_sign": "3FXY7",
  "mmsi": "354123456",
  "gross_tonnage": 45230,
  "net_tonnage": 22615,
  "deadweight": 52000,
  "vessel_type": "Bulk Carrier",
  "year_built": 2018,
  "owner": "Pacific Shipping Ltd",
  "operator": "StarBulk Management",
  "classification_society": "Lloyd's Register"
}

GET /api/forms/vessels

List all vessel profiles.

GET /api/forms/vessels/{id}

Retrieve a specific vessel profile with all fields including extra_fields_json.

Database Schema

MarChat uses SQLite for relational data storage with 11 tables organized into RAG-related and form-related groups. The database file is located at data/marchat.db.

RAG Tables

DocumentSchema

Defines the metadata structure and chunking strategy for a collection type.

Column	Type	Description
`id`	INTEGER PK	Auto-increment primary key
`schema_id`	VARCHAR UNIQUE	Schema identifier (e.g., "maritime_v1")
`name`	VARCHAR	Human-readable schema name
`metadata_fields_json`	TEXT	JSON array of metadata field definitions
`hierarchy_config_json`	TEXT	JSON object defining hierarchy detection rules
`chunking_strategy_json`	TEXT	JSON object with chunk_size, overlap, and strategy type

Collection

Represents a document collection backed by a Qdrant vector collection.

Column	Type	Description
`id`	INTEGER PK	Auto-increment primary key
`name`	VARCHAR	Collection display name
`description`	TEXT	Collection description
`schema_id`	VARCHAR FK	References DocumentSchema.schema_id
`qdrant_collection_name`	VARCHAR	Corresponding Qdrant collection name
`document_count`	INTEGER	Number of documents in this collection
`created_at`	DATETIME	Creation timestamp

CollectionSearchConfig

Per-collection search configuration for the hybrid pipeline.

Column	Type	Description
`id`	INTEGER PK	Auto-increment primary key
`collection_id`	INTEGER FK	References Collection.id
`use_hybrid`	BOOLEAN	Enable hybrid search (semantic + BM25). Default: true
`semantic_weight`	FLOAT	Weight for semantic search results in fusion
`keyword_weight`	FLOAT	Weight for BM25 keyword results in fusion
`top_k`	INTEGER	Number of results to return. Default: 5
`use_reranking`	BOOLEAN	Enable cross-encoder reranking. Default: true

Document

Represents an uploaded PDF document.

Column	Type	Description
`id`	INTEGER PK	Auto-increment primary key
`title`	VARCHAR	Document title (derived from filename)
`filename`	VARCHAR	Original uploaded filename
`file_path`	VARCHAR	Path to stored file on disk
`collection_id`	INTEGER FK	References Collection.id
`status`	VARCHAR	Processing status: uploaded, indexing, indexed, error
`total_chunks`	INTEGER	Number of chunks generated from this document
`created_at`	DATETIME	Upload timestamp

Query

Stores RAG query history for analytics and debugging.

Column	Type	Description
`id`	INTEGER PK	Auto-increment primary key
`query_text`	TEXT	The user's original query
`answer_text`	TEXT	The LLM-generated answer
`mode`	VARCHAR	Chat mode used for this query
`response_time_ms`	INTEGER	Total response time in milliseconds
`created_at`	DATETIME	Query timestamp

Form Tables

VesselProfile

Stores reusable vessel information for form pre-population.

Column	Type	Description
`id`	INTEGER PK	Auto-increment primary key
`vessel_name`	VARCHAR	Full vessel name
`imo_number`	VARCHAR UNIQUE	7-digit IMO number
`flag_state`	VARCHAR	Flag state
`call_sign`	VARCHAR	Radio call sign
`mmsi`	VARCHAR	MMSI number
`gross_tonnage`	FLOAT	Gross tonnage (GT)
`net_tonnage`	FLOAT	Net tonnage (NT)
`deadweight`	FLOAT	Deadweight tonnage (DWT)
`vessel_type`	VARCHAR	Type of vessel
`year_built`	INTEGER	Year of construction
`owner`	VARCHAR	Registered owner
`operator`	VARCHAR	Ship operator
`classification_society`	VARCHAR	Classification society
`extra_fields_json`	TEXT	Extensible JSON key-value store

FormTemplate

Stores uploaded form templates with their detected fields.

Column	Type	Description
`id`	INTEGER PK	Auto-increment primary key
`name`	VARCHAR	Template display name
`file_path`	VARCHAR	Path to stored template file
`file_type`	VARCHAR	File type: xlsx or docx
`fields_json`	TEXT	JSON array of detected FormField objects
`created_at`	DATETIME	Upload timestamp

FilledForm

Stores completed form instances with their field values.

Column	Type	Description
`id`	INTEGER PK	Auto-increment primary key
`template_id`	INTEGER FK	References FormTemplate.id
`vessel_profile_id`	INTEGER FK	References VesselProfile.id (optional)
`field_values_json`	TEXT	JSON object mapping field names to filled values
`file_path`	VARCHAR	Path to the filled output file
`created_at`	DATETIME	Fill timestamp

Settings_DB

Key-value store for application settings.

Column	Type	Description
`key`	VARCHAR PK	Setting key (e.g., "default_model", "embedding_model")
`value`	TEXT	Setting value

Collection Schemas

Collection schemas define how documents are chunked, what metadata fields are extracted, and how the hierarchy detection works for each collection type. MarChat ships with four built-in schemas optimized for different document types.

System Schemas

Schema	Metadata Fields	Chunking Strategy	Chunk Size
`maritime_v1`	convention, annex, chapter, regulation, paragraph	Hierarchical	800 chars
`legal_v1`	document_type, section, clause	Paragraph-based	600 chars
`medical_v1`	guideline_type, specialty, section	Sentence-based	600 chars
`technical_v1`	doc_type, chapter, section	Hierarchical	700 chars

Schema Configuration Example

# maritime_v1 schema definition
{
  "schema_id": "maritime_v1",
  "name": "Maritime Regulations",
  "metadata_fields": [
    {"name": "convention", "type": "string", "required": true},
    {"name": "annex", "type": "string", "required": false},
    {"name": "chapter", "type": "string", "required": false},
    {"name": "regulation", "type": "string", "required": false},
    {"name": "paragraph", "type": "string", "required": false},
    {"name": "page", "type": "integer", "required": true}
  ],
  "hierarchy_config": {
    "levels": ["convention", "annex", "chapter", "regulation", "paragraph"],
    "patterns": {
      "convention": "(SOLAS|MARPOL|STCW|COLREG|LOADLINE|TONNAGE|SFV|STP|SAR)",
      "annex": "ANNEX\\s+[IVX]+|ANNEX\\s+[0-9]+",
      "chapter": "CHAPTER\\s+\\w+",
      "regulation": "Regulation\\s+\\d+|Reg\\.\\s*\\d+",
      "paragraph": "\\d+\\.\\d+(\\.\\d+)*"
    },
    "cascade_reset": true
  },
  "chunking_strategy": {
    "type": "hierarchical",
    "chunk_size": 800,
    "overlap": 100,
    "respect_boundaries": true
  }
}

Schema Validation Rules

schema_id pattern — Must match [a-z_]+_v[0-9]+ (lowercase with version suffix)
name length — Minimum 3 characters
regex patterns — Must be valid regular expressions (tested at schema creation time)
chunk_size — Must be at least 100 characters
overlap — Must be less than chunk_size
metadata fields — At least one field must be defined per schema

Validation Steps

This section provides a comprehensive validation checklist for verifying that all MarChat components are functioning correctly. Follow these steps after initial installation or after making configuration changes.

1. Service Health

GET /health returns all services green (qdrant: connected, ollama: connected, database: connected)
GET /api/settings/health shows service connectivity status with version information
Ollama responds with the correct model (gemma3:12b) when queried via /api/tags
Qdrant has the correct vector dimension (1024) for all created collections

# Verify service health
curl http://localhost:8005/health

# Verify Ollama models
curl http://localhost:11434/api/tags | python -m json.tool

# Verify Qdrant collections
curl http://localhost:6333/collections | python -m json.tool

2. Document Processing

Upload a PDF document → status changes from uploaded to indexing to indexed
total_chunks is greater than 0 after indexing completes
BM25 index is updated (data/bm25_index.pkl file modified timestamp changes)
Qdrant collection has a matching point count (number of vectors equals total chunks)

# Upload and verify
curl -X POST http://localhost:8005/api/documents/upload \
  -F "file=@SOLAS_Chapter_II-2.pdf" \
  -F "collection_name=maritime_v1"

# Check document status
curl http://localhost:8005/api/documents/7

3. RAG Query Pipeline

Semantic search returns relevant chunks based on meaning
BM25 search returns keyword-matching chunks for exact terms (regulation numbers, abbreviations)
RRF fusion combines both result sets, boosting documents that appear in both
Cross-encoder reranking improves precision by filtering low-relevance candidates
Generated answer includes proper citations with hierarchy metadata
Response time is logged in the query history table

# Test RAG query
curl -X POST http://localhost:8005/api/rag/query \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What are the fire detection requirements for machinery spaces?",
    "mode": "regulatory",
    "collection_name": "maritime_v1",
    "top_k": 5
  }'

4. Multi-Mode Chat

Each mode produces appropriately formatted responses matching its system prompt
Regulatory mode includes full convention hierarchy citations (Convention > Chapter > Regulation > Para, Page)
Medical mode references IMGS protocols and includes appropriate safety disclaimers
General mode provides practical operational advice with source references
Forms mode outputs valid JSON that can be parsed by the auto-fill engine
Research mode uses academic tone with page-level citations and source attribution

5. Pre-Arrival Forms

Template upload extracts the correct field count from the uploaded Excel or DOCX file
{{placeholder}} patterns are correctly detected in both Excel cells and DOCX paragraphs
Manual fill replaces all placeholders with the provided values
Auto-fill produces valid JSON from natural language context
Downloaded file has all placeholder values filled correctly
Vessel profile data is correctly pre-populated into form fields

# Test template upload
curl -X POST http://localhost:8005/api/forms/templates/upload \
  -F "file=@Rotterdam_PreArrival.xlsx" \
  -F "name=Rotterdam Pre-Arrival"

# Test auto-fill
curl -X POST http://localhost:8005/api/forms/fill/auto \
  -H "Content-Type: application/json" \
  -d '{
    "template_id": 3,
    "context": "MV Pacific Star, IMO 9123456, arriving Rotterdam March 15th 2026",
    "vessel_profile_id": 1
  }'

6. Collection Management

Create collection with schema → corresponding Qdrant collection is created with correct vector config (1024 dim, cosine distance)
Upload document to specific collection → chunks are indexed in the correct Qdrant collection
Query with collection_name parameter → search is scoped to only that collection
Delete collection → removes both Qdrant vectors and all related database records (documents, chunks, search config)

Validation Status
All validation steps have been verified during Phase 1 (RAG + Multi-Mode Chat) and Phase 2 (Pre-Arrival Forms) development. The hybrid search pipeline consistently outperforms single-method retrieval on maritime regulation queries, with cross-encoder reranking providing measurable precision improvements.