Back to Main Site

MarChat Technical Documentation

Maritime shipboard AI assistant with RAG, multi-mode chat, and pre-arrival form automation

About MarChat
MarChat is a shipboard AI assistant that combines semantic search over maritime regulations with intelligent form filling for port documentation. It is designed to run locally, with no cloud dependency, making it suitable for use in environments with limited or no internet connectivity.

What is MarChat?

MarChat is a maritime AI assistant purpose-built for shipboard use. It provides Retrieval-Augmented Generation (RAG) over maritime regulations including SOLAS, MARPOL, STCW, and other IMO conventions. The system supports multi-mode chat with context-aware system prompts tailored to different operational scenarios, and includes a pre-arrival form auto-fill engine that populates port documentation from natural language context and vessel profile data.

All processing runs locally through Ollama, ensuring that sensitive vessel and operational data never leaves the shipboard network. The hybrid search pipeline combines semantic understanding with keyword matching, delivering precise results even for highly technical regulatory queries.

Hybrid RAG

  • Semantic search via Qdrant vector database
  • BM25 keyword matching for exact terms
  • Reciprocal Rank Fusion (RRF) merging
  • Cross-encoder reranking for precision

Multi-Mode Chat

  • Regulatory mode: IMO convention expert
  • Medical mode: IMGS maritime medical
  • General mode: Operations assistant
  • Forms mode: Document specialist
  • Research mode: Academic paper assistant

Smart Chunking

  • IMO convention hierarchy detection
  • Convention > Annex > Chapter > Regulation > Paragraph
  • Cascading metadata inheritance
  • Precise citation generation

Pre-Arrival Forms

  • Excel and DOCX template processing
  • {{placeholder}} pattern detection
  • LLM-assisted auto-fill from context
  • Vessel profile pre-population

Multi-Collection

  • Separate knowledge bases per domain
  • Maritime, medical, legal, research, technical
  • Schema-driven metadata validation
  • Independent search configurations

Local LLM

  • Ollama integration (gemma3:12b)
  • mxbai-embed-large embeddings (1024-dim)
  • No cloud dependency or API keys required
  • Data stays on the local machine

Architecture

Component Overview

MarChat follows a standard client-server architecture with a Next.js frontend communicating with a FastAPI backend. The backend orchestrates all AI operations including embedding generation, vector search, BM25 retrieval, cross-encoder reranking, and LLM inference. Persistent storage is split between Qdrant (vector embeddings and metadata) and SQLite (relational data such as documents, collections, queries, and form templates).

Component Technology Port Role
Frontend Next.js 14.2 3005 User interface, chat, document management, form filling
Backend FastAPI (Python 3.12) 8005 API server, RAG pipeline, LLM orchestration, form processing
Vector DB Qdrant 6333 Vector storage, semantic search, metadata filtering
LLM Runtime Ollama 11434 Local inference (gemma3:12b), embeddings (mxbai-embed-large)
Relational DB SQLite N/A Documents, collections, queries, form templates, vessel profiles

Data Flow: RAG Query

When a user submits a query, the system executes a multi-stage retrieval pipeline that combines semantic and keyword search, fuses the results, and reranks them before generating a response.

RAG Query Pipeline
1. User Query | 2. Embed query using mxbai-embed-large (1024-dim vector) | 3. Parallel Search: |--- Semantic Search: query vector -> Qdrant cosine similarity (top_k x 3) |--- BM25 Search: tokenized query -> BM25 ranking (top_k x 3) | 4. Reciprocal Rank Fusion (RRF): merge both result sets | 5. Cross-Encoder Reranking: score (query, candidate) pairs | Filter by MIN_RERANK_SCORE = -2.0 | 6. Select top_k results as context | 7. Build prompt: system prompt (mode-specific) + context chunks + user query | 8. LLM Generation: Ollama gemma3:12b -> answer with citations | 9. Return answer + retrieved chunks + response time

Data Flow: Form Auto-Fill

The form auto-fill pipeline processes uploaded templates, extracts fields, and uses the LLM to generate values from user-provided context.

Form Auto-Fill Pipeline
1. Upload Template (Excel or DOCX) | 2. Extract Fields: scan for {{placeholder}} patterns | Detect highlighted cells (yellow) in Excel | Infer field types (date, number, boolean, text) | 3. User provides natural language context | (voyage details, port info, vessel data) | 4. Build LLM prompt: field list + context + vessel profile | 5. LLM generates JSON: { "field_name": "value", ... } | 6. Fill template: replace {{placeholders}} with values | 7. Save and return filled document for download

Installation

Prerequisites

  • Docker & Docker Compose — for containerized deployment
  • Ollama — local LLM runtime with the following models:
    • gemma3:12b — language model for chat and form filling
    • mxbai-embed-large — embedding model (1024 dimensions)
  • Node.js 18+ — for frontend development mode
  • Python 3.12+ — for backend development mode

Docker Compose (Recommended)

git clone https://github.com/SL-Mar/marchat.git
cd marchat
docker compose up -d

Pre-Install Ollama Models

Before starting the application, ensure both required models are available in your Ollama instance:

ollama pull gemma3:12b
ollama pull mxbai-embed-large

Development Mode

For local development with hot-reload, use the launch script:

chmod +x launch.sh
./launch.sh

This starts the backend (FastAPI on port 8005) and frontend (Next.js on port 3005) in development mode with automatic reloading on file changes.

Environment Variables

Create a .env file in the project root with the following configuration:

# Core Services
QDRANT_URL=http://localhost:6333
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_MODEL=gemma3:12b
EMBEDDING_MODEL=mxbai-embed-large

# Database
DATABASE_URL=sqlite:///./data/marchat.db

# Optional: OpenAI fallback (leave empty for local-only)
OPENAI_API_KEY=
Important
Ollama must be running with both models pulled before starting the application. The backend will fail to initialize embeddings if mxbai-embed-large is not available, and chat/RAG queries will fail without gemma3:12b.

Verifying Installation

After starting the application, verify that all services are healthy:

# Check backend health endpoint
curl http://localhost:8005/health

# Expected response:
{
  "status": "healthy",
  "services": {
    "qdrant": "connected",
    "ollama": "connected",
    "database": "connected"
  }
}

Tech Stack

Backend

Technology Version Purpose
FastAPI 0.110+ Async API framework with automatic OpenAPI docs
Python 3.12 Runtime environment
SQLAlchemy 2.0 ORM for relational database operations
Pydantic 2.x Request/response validation and serialization
openpyxl 3.1+ Excel file reading and writing
python-docx 1.1+ DOCX file reading and writing
sentence-transformers 2.x Cross-encoder reranking model

Frontend

Technology Version Purpose
Next.js 14.2 React framework with server-side rendering
React 18.3 UI component library
Tailwind CSS 3.x Utility-first CSS framework
Axios 1.x HTTP client for API requests
react-markdown 9.x Markdown rendering in chat responses
react-pdf 7.x PDF preview in document viewer

Storage

Technology Purpose Details
Qdrant Vector database Stores embeddings and metadata, supports cosine similarity search
SQLite Relational database Documents, collections, queries, forms, vessel profiles
BM25 pickle index Keyword search index Persisted at data/bm25_index.pkl, rebuilt on indexing

AI Models

Model Provider Purpose Details
gemma3:12b Ollama Language model Chat, RAG answer generation, form auto-fill
mxbai-embed-large Ollama Embeddings 1024-dimensional vectors for semantic search
cross-encoder/ms-marco-MiniLM-L-6-v2 sentence-transformers Reranking Query-document relevance scoring for result reranking

BM25 Engine

The BM25 engine provides keyword-based retrieval as a complement to semantic vector search. It maintains a persistent inverted index that is incrementally updated as new documents are indexed.

Tokenization

The engine uses a simple tokenization strategy: text is converted to lowercase and split on whitespace and punctuation boundaries. This straightforward approach works well for maritime regulatory text, which contains many technical terms and abbreviations that benefit from exact matching.

Persistence

The BM25 index is serialized using Python's pickle module and stored at data/bm25_index.pkl. This allows the index to persist across application restarts without requiring re-indexing of all documents.

Incremental Indexing

When new documents are indexed, the BM25 index is updated incrementally rather than rebuilt from scratch. New chunk texts and their corresponding IDs are appended to the existing index, and the BM25 statistics (term frequencies, document lengths) are recalculated.

Key Methods

Method Description
index_chunks(chunks, chunk_ids) Add new chunks to the BM25 index. Tokenizes each chunk and updates the inverted index and IDF statistics.
search(query, top_k) Tokenize the query and rank all indexed chunks by BM25 score. Returns the top_k results with scores and chunk IDs.
clear_index() Remove all entries from the index and delete the pickle file. Used when a collection is deleted or a full re-index is needed.
# BM25 index location
data/bm25_index.pkl

# Index is loaded at startup if the file exists
# Re-created automatically if missing or corrupted

Smart Chunking

Maritime regulations follow a strict hierarchical structure defined by the International Maritime Organization (IMO). MarChat's smart chunking engine detects this hierarchy during document processing and preserves it as metadata on each chunk, enabling precise citations in RAG responses.

IMO Hierarchy Detection

The chunker uses regular expression patterns to detect hierarchy levels in the document text. As it processes each line, it maintains a cascading state machine that tracks the current position in the hierarchy.

# Hierarchy detection regex patterns

convention:  (SOLAS|MARPOL|STCW|COLREG|LOADLINE|TONNAGE|SFV|STP|SAR)
annex:       ANNEX\s+[IVX]+|ANNEX\s+[0-9]+
chapter:     CHAPTER\s+\w+
regulation:  Regulation\s+\d+|Reg\.\s*\d+
paragraph:   \d+\.\d+(\.\d+)*

Processing Steps

  1. Split PDF by pages — Extract text from each page of the uploaded PDF document.
  2. Process line-by-line — Scan each line for hierarchy markers using the regex patterns above.
  3. Detect convention from filename — If the filename contains a convention name (e.g., SOLAS_consolidated.pdf), set the convention metadata automatically.
  4. Maintain hierarchical state — Track the current position as: convention → annex → chapter → regulation → paragraph. Each new detection updates the corresponding level and all chunks inherit the full hierarchy path.
  5. Build chunks at threshold — Accumulate text until the chunk reaches 2400 characters, then split with a 100-word overlap to preserve context across chunk boundaries.
  6. Cascading reset — When a higher-level marker is detected (e.g., a new annex), all lower levels (chapter, regulation, paragraph) are reset. This prevents stale metadata from carrying over into a new section.

Metadata Output

Each chunk is stored with the following metadata fields:

Field Example Value Description
convention SOLAS IMO convention name
annex ANNEX I Annex number (Roman or Arabic)
chapter CHAPTER II-1 Chapter identifier
regulation Regulation 3-1 Regulation number
paragraph 2.1 Paragraph number (dotted notation)
page 145 Source page number in PDF
Citation Format
Hierarchical chunking preserves regulatory structure in metadata, enabling precise citations like "SOLAS > Chapter II-1 > Regulation 3-1 > Para. 2.1, Page 145" in RAG responses. This level of precision is critical for maritime regulatory compliance.

Cross-Encoder Reranking

After the Reciprocal Rank Fusion step produces a merged candidate set, the cross-encoder model provides a final precision pass by jointly scoring each (query, document) pair.

Model Details

Property Value
Model cross-encoder/ms-marco-MiniLM-L-6-v2
Input (query, candidate_text) pairs
Output Relevance score (range: approximately -10 to +10)
Threshold MIN_RERANK_SCORE = -2.0

How It Works

Unlike bi-encoder models (which encode query and document independently and compare their vectors), a cross-encoder processes the query and document text together through the same transformer pass. This allows the model to attend across both inputs simultaneously, capturing word-level interactions between the query and the document. The result is significantly more accurate relevance judgments, at the cost of being computationally more expensive (which is why it is only applied to the already-narrowed candidate set, not the entire corpus).

Filtering

After scoring, any candidate with a relevance score below -2.0 is removed from the results. This threshold was determined empirically to filter out genuinely irrelevant content while retaining borderline-relevant chunks that may contain useful context. The remaining candidates are sorted by score in descending order and truncated to top_k.

Multi-Mode Chat

MarChat supports five specialized chat modes, each with a tailored system prompt that controls the LLM's behavior, citation format, and response structure. The mode is selected per-query, allowing users to switch between different operational contexts within the same session.

Mode Description Citation Format
regulatory IMO convention expert. Provides precise regulatory answers with full hierarchy citations. Convention > Chapter > Regulation > Paragraph, Page
medical Maritime medical assistant based on International Medical Guide for Ships (IMGS). TMAS protocol references, IMGS section numbers
general General operations assistant for practical maritime questions. Practical advice format with source references
forms Pre-arrival document specialist that outputs structured JSON for form filling. JSON field mapping: { "field": "value" }
research Academic paper assistant for scholarly analysis and literature review. Source, page, academic tone with proper attribution

Example: Regulatory Mode Query

# Request
POST /api/rag/query
{
  "query": "What are the requirements for fire detection in machinery spaces?",
  "mode": "regulatory",
  "collection_name": "maritime_v1",
  "top_k": 5
}

# Response includes citations like:
# "According to SOLAS > Chapter II-2 > Regulation 7 > Para. 2.1 (Page 203),
#  fixed fire detection and fire alarm systems shall be provided
#  in machinery spaces of category A..."

System Prompts

Each chat mode uses a carefully crafted system prompt that defines the LLM's persona, response structure, citation requirements, and content guidelines. The system prompt is prepended to the user's query along with the retrieved context chunks.

Prompt Structure

Every system prompt follows a common structure:

  1. Role definition — who the assistant is and its area of expertise
  2. Citation requirements — how to reference source material
  3. Response format — expected structure, tone, and length
  4. Content guidelines — what to include, what to avoid
  5. Fallback behavior — how to handle queries outside the knowledge base

Regulatory Mode Prompt (Example)

You are a maritime regulatory expert specializing in IMO conventions.
Answer questions using ONLY the provided context from maritime regulations.

CITATION FORMAT (MANDATORY):
- Always cite the specific regulation using the full hierarchy:
  Convention > Chapter > Regulation > Paragraph, Page
- Example: "SOLAS > Chapter II-1 > Regulation 3-1 > Para. 2.1, Page 145"

RESPONSE STRUCTURE:
1. Direct answer to the question
2. Relevant regulatory text with citations
3. Additional context or related regulations if applicable

IMPORTANT:
- If the context does not contain sufficient information, state this clearly
- Do not fabricate or assume regulatory content
- Use exact text from regulations when possible

Forms Mode Prompt (Example)

You are a pre-arrival documentation specialist.
Given the form fields and context provided, generate a JSON object
mapping each field name to its appropriate value.

OUTPUT FORMAT:
Return ONLY a valid JSON object with field names as keys:
{
  "vessel_name": "MV Pacific Star",
  "imo_number": "9123456",
  "eta": "2026-03-15T08:00:00",
  "last_port": "Singapore"
}

IMPORTANT:
- Use the vessel profile data when available
- Infer values from the natural language context
- Use ISO 8601 format for dates
- Leave unknown fields as empty strings

Mode-Specific Behaviors

  • Regulatory mode enforces hierarchical citations and uses exact regulatory text whenever available in the context.
  • Medical mode includes safety disclaimers and references to TMAS (Telemedical Assistance Service) protocols for remote medical guidance.
  • General mode provides practical, actionable advice while still referencing source material.
  • Forms mode outputs strictly valid JSON, mapping field names to inferred values from the provided context.
  • Research mode adopts an academic tone, tracks sources carefully, and provides page-level citations.

LLM Providers

MarChat uses a factory pattern to abstract LLM provider details, allowing the system to switch between different backends without changing application logic.

Provider Factory

# Factory function
def get_llm_provider(provider_type: str, **kwargs) -> LLMProvider:
    """
    Returns an LLM provider instance based on the specified type.

    Args:
        provider_type: "ollama" or "openai"
        **kwargs: Provider-specific configuration

    Returns:
        LLMProvider instance with generate() and embed() methods
    """
    if provider_type == "ollama":
        return OllamaProvider(**kwargs)
    elif provider_type == "openai":
        return OpenAIProvider(**kwargs)
    else:
        raise ValueError(f"Unknown provider: {provider_type}")

Ollama Provider (Default)

The Ollama provider communicates with a locally running Ollama instance via its HTTP API. It is the default and recommended provider for MarChat, as it keeps all data on the local machine.

Parameter Default Description
model gemma3:12b Chat and generation model
temperature 0.3 Controls randomness (lower = more deterministic)
num_predict 2048 Maximum tokens in response
repeat_penalty 1.1 Penalizes token repetition
top_k 40 Top-K sampling parameter
top_p 0.9 Nucleus sampling parameter
stop [] Stop token sequences
# Ollama API call for chat generation
POST http://localhost:11434/api/chat
{
  "model": "gemma3:12b",
  "messages": [
    {"role": "system", "content": "...system prompt..."},
    {"role": "user", "content": "...context + query..."}
  ],
  "options": {
    "temperature": 0.3,
    "num_predict": 2048,
    "repeat_penalty": 1.1,
    "top_k": 40,
    "top_p": 0.9
  },
  "stream": false
}

OpenAI Provider (Optional)

The OpenAI provider is available as a fallback for environments where Ollama is not available. It requires an OPENAI_API_KEY environment variable and uses the openai.ChatCompletion API. Note that using OpenAI sends data to external servers, which may not be appropriate for sensitive vessel information.

Template Processing

The template processing engine handles Excel (.xlsx) and DOCX files used for pre-arrival port documentation. It detects form fields using placeholder patterns and highlighted cells, then provides a structured representation of the template's fields for filling.

Field Detection

The engine searches for two types of field markers in uploaded templates:

  • Placeholder patterns{{field_name}} markers in cell values or document text
  • Highlighted cells — Yellow-highlighted cells in Excel spreadsheets (common in port authority templates)

Field Type Inference

Field types are automatically inferred from the field name using keyword matching:

Inferred Type Keywords in Field Name Example Fields
date date, eta, etd, arrival, departure, dob {{eta}}, {{date_of_birth}}, {{departure_date}}
number tonnage, draft, length, breadth, crew_count {{gross_tonnage}}, {{forward_draft}}
boolean has_, is_, valid_ {{has_dangerous_goods}}, {{is_chartered}}
text (default for all others) {{vessel_name}}, {{last_port}}, {{agent_name}}

FormField Data Model

@dataclass
class FormField:
    name: str               # Field identifier (e.g., "vessel_name")
    field_type: str         # text, date, number, boolean, list
    location: str           # Cell reference (Excel) or placeholder position (DOCX)
    required: bool          # Whether the field is mandatory
    default_value: Optional[str]  # Pre-populated value if available
    description: str        # Human-readable field description

Fill Process

  1. Load template — Open the uploaded Excel or DOCX file using openpyxl or python-docx.
  2. Find placeholders — Scan all cells (Excel) or paragraphs (DOCX) for {{...}} patterns.
  3. Replace with values — Substitute each placeholder with the corresponding value from the fill data.
  4. Save filled document — Write the filled template to disk and make it available for download.
# Example: Excel placeholder detection
for sheet in workbook.sheetnames:
    ws = workbook[sheet]
    for row in ws.iter_rows():
        for cell in row:
            if cell.value and "{{" in str(cell.value):
                # Extract field name from {{field_name}}
                match = re.findall(r'\{\{(\w+)\}\}', str(cell.value))
                for field_name in match:
                    fields.append(FormField(
                        name=field_name,
                        field_type=infer_type(field_name),
                        location=f"{sheet}!{cell.coordinate}",
                        required=True,
                        default_value=None,
                        description=field_name.replace("_", " ").title()
                    ))

Auto-Fill Engine

The auto-fill engine uses the LLM to intelligently populate form fields from unstructured natural language context. Instead of manually filling each field, the user provides a free-text description of the voyage, vessel, and port information, and the LLM extracts the relevant values.

How It Works

  1. Extract field list — The system retrieves all detected fields from the uploaded template.
  2. Build prompt — A structured prompt is constructed containing the field list, the user's natural language context, and optionally the vessel profile data.
  3. LLM generation — The forms mode system prompt instructs the LLM to output a JSON object mapping each field name to its inferred value.
  4. Parse response — The engine handles markdown code block wrapping (strips ```json delimiters if present) and parses the JSON.
  5. Fill template — The parsed values are applied to the template placeholders.

Example Context Input

We are MV Pacific Star, IMO 9123456, Panama flag, arriving at
Rotterdam on March 15th 2026 at approximately 0800 local time.
Coming from Singapore via Suez Canal. Gross tonnage 45,230.
We have 22 crew members on board, all healthy. No dangerous
goods. Last port state control inspection was in Singapore on
February 28th, no deficiencies found. Agent is MaritimePort BV.

Example LLM Output

{
  "vessel_name": "MV Pacific Star",
  "imo_number": "9123456",
  "flag_state": "Panama",
  "port_of_arrival": "Rotterdam",
  "eta": "2026-03-15T08:00:00",
  "last_port": "Singapore",
  "gross_tonnage": "45230",
  "crew_count": "22",
  "has_dangerous_goods": "No",
  "last_psc_inspection_port": "Singapore",
  "last_psc_inspection_date": "2026-02-28",
  "psc_deficiencies": "None",
  "agent_name": "MaritimePort BV"
}

Vessel Profile Pre-Population

If a vessel profile is associated with the fill request, its static fields (vessel name, IMO number, flag state, call sign, MMSI, tonnage, etc.) are automatically included in the prompt context. This reduces the amount of information the user needs to provide and ensures consistency across multiple form fills.

JSON Output Handling
The auto-fill engine uses the forms mode system prompt to ensure structured JSON output. It also includes fallback parsing that handles cases where the LLM wraps the JSON in markdown code blocks (```json ... ```), which is a common behavior for instruction-following models.

Vessel Profiles

Vessel profiles store reusable vessel information that can be pre-populated into forms, eliminating the need to re-enter static data for every port call.

Profile Fields

Field Type Description
vessel_name string Full vessel name (e.g., "MV Pacific Star")
imo_number string 7-digit IMO identification number
flag_state string Flag state / country of registration
call_sign string Radio call sign
mmsi string Maritime Mobile Service Identity (9 digits)
gross_tonnage float Gross tonnage (GT)
net_tonnage float Net tonnage (NT)
deadweight float Deadweight tonnage (DWT)
vessel_type string Type of vessel (bulk carrier, tanker, container, etc.)
year_built integer Year of construction
owner string Registered owner
operator string Ship operator / manager
classification_society string Classification society (Lloyd's, DNV, BV, etc.)
extra_fields_json JSON Extensible key-value store for additional vessel-specific data

The extra_fields_json column allows storing any additional vessel-specific data without schema changes. This is useful for fields that vary by vessel type, flag state requirements, or specific port authority needs.

RAG API

Endpoints for executing RAG queries against indexed document collections.

POST /api/rag/query

Execute a RAG query against a specified collection. Runs the full hybrid search pipeline (semantic + BM25 + RRF + cross-encoder reranking) and generates an LLM response with citations.

Request Body

{
  "query": "What are the requirements for fire detection in machinery spaces?",
  "mode": "regulatory",
  "collection_name": "maritime_v1",
  "top_k": 5
}
Parameter Type Required Description
query string Yes The user's question or search query
mode string No Chat mode: regulatory, medical, general, forms, research. Default: regulatory
collection_name string No Target collection for search. Default: first available collection
top_k integer No Number of context chunks to retrieve. Default: 5

Response

{
  "answer": "According to SOLAS > Chapter II-2 > Regulation 7 > Para. 2.1 (Page 203), fixed fire detection and fire alarm systems shall be provided in machinery spaces of category A. The system must be capable of rapidly detecting the onset of fire...",
  "retrieved_chunks": [
    {
      "text": "2.1 A fixed fire detection and fire alarm system...",
      "metadata": {
        "convention": "SOLAS",
        "chapter": "CHAPTER II-2",
        "regulation": "Regulation 7",
        "paragraph": "2.1",
        "page": 203
      },
      "score": 0.847
    }
  ],
  "total_chunks_retrieved": 5,
  "response_time_ms": 2340,
  "mode": "regulatory"
}
GET /api/rag/history?limit=50

Retrieve the query history with answers, modes, and response times. Useful for reviewing past queries and analyzing system performance.

Query Parameters

Parameter Type Required Description
limit integer No Maximum number of history entries to return. Default: 50

Response

[
  {
    "id": 42,
    "query_text": "Fire detection in machinery spaces",
    "answer_text": "According to SOLAS...",
    "mode": "regulatory",
    "response_time_ms": 2340,
    "created_at": "2026-02-07T14:30:00Z"
  }
]

Documents API

Endpoints for uploading, indexing, and managing PDF documents within collections.

POST /api/documents/upload

Upload a PDF document to a collection. Optionally triggers automatic indexing (chunking + embedding + vector storage).

Request (multipart/form-data)

Field Type Required Description
file file Yes PDF file to upload
document_type string No Type of document (convention, circular, guideline)
collection_name string No Target collection name. Default: default collection
auto_index boolean No Automatically index after upload. Default: true

Response

{
  "id": 7,
  "title": "SOLAS_Consolidated_2024",
  "filename": "SOLAS_Consolidated_2024.pdf",
  "status": "indexing",
  "collection_id": 1,
  "total_chunks": 0,
  "created_at": "2026-02-07T10:15:00Z"
}
POST /api/documents/{id}/index

Manually trigger indexing for an uploaded document. Processes the PDF through smart chunking, generates embeddings, stores vectors in Qdrant, and updates the BM25 index.

GET /api/documents

List all uploaded documents with their status, collection assignment, and chunk count.

GET /api/documents/{id}

Retrieve details for a specific document including metadata, chunk count, and indexing status.

DELETE /api/documents/{id}

Delete a document and all its associated chunks from both Qdrant and the BM25 index.

GET /api/documents/{id}/preview

Retrieve a preview of the document's first few pages. Returns extracted text for display in the frontend document viewer.

Collections API

Endpoints for managing document collections. Each collection maps to a Qdrant collection and has its own schema, search configuration, and document set.

GET /api/collections

List all collections with document counts and schema information.

Response

[
  {
    "id": 1,
    "name": "Maritime Regulations",
    "description": "SOLAS, MARPOL, STCW and other IMO conventions",
    "schema_id": "maritime_v1",
    "qdrant_collection_name": "maritime_v1",
    "document_count": 12,
    "created_at": "2026-01-15T09:00:00Z"
  }
]
POST /api/collections

Create a new collection with a specified schema. This creates the corresponding Qdrant collection with the correct vector dimension (1024) and distance metric (cosine).

Request Body

{
  "name": "Medical Guidelines",
  "description": "IMGS and maritime medical references",
  "schema_id": "medical_v1"
}
GET /api/collections/{id}

Retrieve details for a specific collection including its schema, search configuration, and associated documents.

DELETE /api/collections/{id}

Delete a collection and all its associated documents, chunks, and Qdrant vectors.

GET /api/collections/schemas/list

List all available collection schemas with their metadata field definitions and chunking configurations.

Available Schemas

Schema ID Description Optimized For
maritime_v1 IMO conventions and maritime regulations Hierarchical regulatory text (SOLAS, MARPOL, STCW)
legal_v1 Legal documents and contracts Clause-based legal text with section references
medical_v1 Medical guidelines and protocols IMGS, medical procedures, treatment guidelines
technical_v1 Technical manuals and documentation Equipment manuals, maintenance procedures, specifications

Chat API

Endpoints for the multi-mode chat system. Chat sessions maintain conversation history for context continuity.

POST /api/chat/message

Send a message in a chat session. The system applies the specified mode's system prompt and optionally retrieves RAG context before generating a response.

Request Body

{
  "session_id": "sess_abc123",
  "message": "What are the minimum rest hour requirements for watchkeepers?",
  "mode": "regulatory",
  "use_rag": true,
  "collection_name": "maritime_v1"
}
Parameter Type Required Description
session_id string No Chat session identifier. Auto-generated if not provided.
message string Yes User's message text
mode string No Chat mode. Default: general
use_rag boolean No Whether to retrieve context from the document collection. Default: true
collection_name string No Collection to search for RAG context. Required if use_rag is true.

Response

{
  "session_id": "sess_abc123",
  "response": "According to STCW > Section A-VIII/1 > Para. 2 (Page 312), the minimum rest period for seafarers assigned to watchkeeping duties is:\n\n1. A minimum of 10 hours of rest in any 24-hour period\n2. A minimum of 77 hours in any 7-day period\n\nThe rest period may be divided into no more than two periods, one of which shall be at least 6 hours in length...",
  "mode": "regulatory",
  "sources": [
    {
      "convention": "STCW",
      "chapter": "Section A-VIII/1",
      "paragraph": "2",
      "page": 312
    }
  ]
}
GET /api/chat/history/{session_id}

Retrieve the full conversation history for a chat session, including both user messages and assistant responses.

DELETE /api/chat/history/{session_id}

Delete all messages in a chat session.

GET /api/chat/sessions

List all active chat sessions with their most recent message and mode.

Forms API

Endpoints for managing form templates, filling forms (manually or with LLM assistance), and managing vessel profiles.

Template Management

POST /api/forms/templates/upload

Upload a form template (Excel or DOCX). The system automatically detects {{placeholder}} fields and infers their types.

Request (multipart/form-data)

Field Type Required Description
file file Yes Excel (.xlsx) or Word (.docx) template file
name string No Template display name. Defaults to filename.

Response

{
  "id": 3,
  "name": "Rotterdam Pre-Arrival Form",
  "file_type": "xlsx",
  "field_count": 28,
  "fields": [
    {"name": "vessel_name", "field_type": "text", "location": "Sheet1!B3"},
    {"name": "imo_number", "field_type": "text", "location": "Sheet1!B4"},
    {"name": "eta", "field_type": "date", "location": "Sheet1!B7"},
    {"name": "gross_tonnage", "field_type": "number", "location": "Sheet1!B10"},
    {"name": "has_dangerous_goods", "field_type": "boolean", "location": "Sheet1!B15"}
  ],
  "created_at": "2026-02-07T11:00:00Z"
}
GET /api/forms/templates

List all uploaded form templates with field counts and file types.

GET /api/forms/templates/{id}

Retrieve details for a specific template including its full field list.

DELETE /api/forms/templates/{id}

Delete a form template and its associated file.

Form Filling

POST /api/forms/fill

Fill a template with manually provided field values. Replaces {{placeholder}} patterns with the specified values.

Request Body

{
  "template_id": 3,
  "field_values": {
    "vessel_name": "MV Pacific Star",
    "imo_number": "9123456",
    "eta": "2026-03-15T08:00:00",
    "gross_tonnage": "45230",
    "has_dangerous_goods": "No"
  },
  "vessel_profile_id": 1
}
POST /api/forms/fill/auto

Fill a template using LLM-assisted auto-fill. The LLM extracts field values from the provided natural language context and optional vessel profile.

Request Body

{
  "template_id": 3,
  "context": "We are MV Pacific Star, IMO 9123456, Panama flag, arriving at Rotterdam on March 15th 2026 at approximately 0800 local time. Coming from Singapore via Suez Canal. Gross tonnage 45,230. 22 crew members, all healthy. No dangerous goods.",
  "vessel_profile_id": 1
}

Response

{
  "id": 15,
  "template_id": 3,
  "vessel_profile_id": 1,
  "field_values": {
    "vessel_name": "MV Pacific Star",
    "imo_number": "9123456",
    "flag_state": "Panama",
    "port_of_arrival": "Rotterdam",
    "eta": "2026-03-15T08:00:00",
    "last_port": "Singapore",
    "gross_tonnage": "45230",
    "crew_count": "22",
    "has_dangerous_goods": "No"
  },
  "fields_filled": 9,
  "fields_total": 28,
  "created_at": "2026-02-07T11:30:00Z"
}
GET /api/forms/filled

List all filled forms with their template association and fill completion percentage.

GET /api/forms/filled/{id}/download

Download the filled form document (Excel or DOCX) with all placeholders replaced by their values.

Vessel Profile Management

POST /api/forms/vessels

Create a new vessel profile with static vessel information for form pre-population.

Request Body

{
  "vessel_name": "MV Pacific Star",
  "imo_number": "9123456",
  "flag_state": "Panama",
  "call_sign": "3FXY7",
  "mmsi": "354123456",
  "gross_tonnage": 45230,
  "net_tonnage": 22615,
  "deadweight": 52000,
  "vessel_type": "Bulk Carrier",
  "year_built": 2018,
  "owner": "Pacific Shipping Ltd",
  "operator": "StarBulk Management",
  "classification_society": "Lloyd's Register"
}
GET /api/forms/vessels

List all vessel profiles.

GET /api/forms/vessels/{id}

Retrieve a specific vessel profile with all fields including extra_fields_json.

Database Schema

MarChat uses SQLite for relational data storage with 11 tables organized into RAG-related and form-related groups. The database file is located at data/marchat.db.

RAG Tables

DocumentSchema

Defines the metadata structure and chunking strategy for a collection type.

Column Type Description
id INTEGER PK Auto-increment primary key
schema_id VARCHAR UNIQUE Schema identifier (e.g., "maritime_v1")
name VARCHAR Human-readable schema name
metadata_fields_json TEXT JSON array of metadata field definitions
hierarchy_config_json TEXT JSON object defining hierarchy detection rules
chunking_strategy_json TEXT JSON object with chunk_size, overlap, and strategy type

Collection

Represents a document collection backed by a Qdrant vector collection.

Column Type Description
id INTEGER PK Auto-increment primary key
name VARCHAR Collection display name
description TEXT Collection description
schema_id VARCHAR FK References DocumentSchema.schema_id
qdrant_collection_name VARCHAR Corresponding Qdrant collection name
document_count INTEGER Number of documents in this collection
created_at DATETIME Creation timestamp

CollectionSearchConfig

Per-collection search configuration for the hybrid pipeline.

Column Type Description
id INTEGER PK Auto-increment primary key
collection_id INTEGER FK References Collection.id
use_hybrid BOOLEAN Enable hybrid search (semantic + BM25). Default: true
semantic_weight FLOAT Weight for semantic search results in fusion
keyword_weight FLOAT Weight for BM25 keyword results in fusion
top_k INTEGER Number of results to return. Default: 5
use_reranking BOOLEAN Enable cross-encoder reranking. Default: true

Document

Represents an uploaded PDF document.

Column Type Description
id INTEGER PK Auto-increment primary key
title VARCHAR Document title (derived from filename)
filename VARCHAR Original uploaded filename
file_path VARCHAR Path to stored file on disk
collection_id INTEGER FK References Collection.id
status VARCHAR Processing status: uploaded, indexing, indexed, error
total_chunks INTEGER Number of chunks generated from this document
created_at DATETIME Upload timestamp

Query

Stores RAG query history for analytics and debugging.

Column Type Description
id INTEGER PK Auto-increment primary key
query_text TEXT The user's original query
answer_text TEXT The LLM-generated answer
mode VARCHAR Chat mode used for this query
response_time_ms INTEGER Total response time in milliseconds
created_at DATETIME Query timestamp

Form Tables

VesselProfile

Stores reusable vessel information for form pre-population.

Column Type Description
id INTEGER PK Auto-increment primary key
vessel_name VARCHAR Full vessel name
imo_number VARCHAR UNIQUE 7-digit IMO number
flag_state VARCHAR Flag state
call_sign VARCHAR Radio call sign
mmsi VARCHAR MMSI number
gross_tonnage FLOAT Gross tonnage (GT)
net_tonnage FLOAT Net tonnage (NT)
deadweight FLOAT Deadweight tonnage (DWT)
vessel_type VARCHAR Type of vessel
year_built INTEGER Year of construction
owner VARCHAR Registered owner
operator VARCHAR Ship operator
classification_society VARCHAR Classification society
extra_fields_json TEXT Extensible JSON key-value store

FormTemplate

Stores uploaded form templates with their detected fields.

Column Type Description
id INTEGER PK Auto-increment primary key
name VARCHAR Template display name
file_path VARCHAR Path to stored template file
file_type VARCHAR File type: xlsx or docx
fields_json TEXT JSON array of detected FormField objects
created_at DATETIME Upload timestamp

FilledForm

Stores completed form instances with their field values.

Column Type Description
id INTEGER PK Auto-increment primary key
template_id INTEGER FK References FormTemplate.id
vessel_profile_id INTEGER FK References VesselProfile.id (optional)
field_values_json TEXT JSON object mapping field names to filled values
file_path VARCHAR Path to the filled output file
created_at DATETIME Fill timestamp

Settings_DB

Key-value store for application settings.

Column Type Description
key VARCHAR PK Setting key (e.g., "default_model", "embedding_model")
value TEXT Setting value

Collection Schemas

Collection schemas define how documents are chunked, what metadata fields are extracted, and how the hierarchy detection works for each collection type. MarChat ships with four built-in schemas optimized for different document types.

System Schemas

Schema Metadata Fields Chunking Strategy Chunk Size
maritime_v1 convention, annex, chapter, regulation, paragraph Hierarchical 800 chars
legal_v1 document_type, section, clause Paragraph-based 600 chars
medical_v1 guideline_type, specialty, section Sentence-based 600 chars
technical_v1 doc_type, chapter, section Hierarchical 700 chars

Schema Configuration Example

# maritime_v1 schema definition
{
  "schema_id": "maritime_v1",
  "name": "Maritime Regulations",
  "metadata_fields": [
    {"name": "convention", "type": "string", "required": true},
    {"name": "annex", "type": "string", "required": false},
    {"name": "chapter", "type": "string", "required": false},
    {"name": "regulation", "type": "string", "required": false},
    {"name": "paragraph", "type": "string", "required": false},
    {"name": "page", "type": "integer", "required": true}
  ],
  "hierarchy_config": {
    "levels": ["convention", "annex", "chapter", "regulation", "paragraph"],
    "patterns": {
      "convention": "(SOLAS|MARPOL|STCW|COLREG|LOADLINE|TONNAGE|SFV|STP|SAR)",
      "annex": "ANNEX\\s+[IVX]+|ANNEX\\s+[0-9]+",
      "chapter": "CHAPTER\\s+\\w+",
      "regulation": "Regulation\\s+\\d+|Reg\\.\\s*\\d+",
      "paragraph": "\\d+\\.\\d+(\\.\\d+)*"
    },
    "cascade_reset": true
  },
  "chunking_strategy": {
    "type": "hierarchical",
    "chunk_size": 800,
    "overlap": 100,
    "respect_boundaries": true
  }
}

Schema Validation Rules

  • schema_id pattern — Must match [a-z_]+_v[0-9]+ (lowercase with version suffix)
  • name length — Minimum 3 characters
  • regex patterns — Must be valid regular expressions (tested at schema creation time)
  • chunk_size — Must be at least 100 characters
  • overlap — Must be less than chunk_size
  • metadata fields — At least one field must be defined per schema

Validation Steps

This section provides a comprehensive validation checklist for verifying that all MarChat components are functioning correctly. Follow these steps after initial installation or after making configuration changes.

1. Service Health

  • GET /health returns all services green (qdrant: connected, ollama: connected, database: connected)
  • GET /api/settings/health shows service connectivity status with version information
  • Ollama responds with the correct model (gemma3:12b) when queried via /api/tags
  • Qdrant has the correct vector dimension (1024) for all created collections
# Verify service health
curl http://localhost:8005/health

# Verify Ollama models
curl http://localhost:11434/api/tags | python -m json.tool

# Verify Qdrant collections
curl http://localhost:6333/collections | python -m json.tool

2. Document Processing

  • Upload a PDF document → status changes from uploaded to indexing to indexed
  • total_chunks is greater than 0 after indexing completes
  • BM25 index is updated (data/bm25_index.pkl file modified timestamp changes)
  • Qdrant collection has a matching point count (number of vectors equals total chunks)
# Upload and verify
curl -X POST http://localhost:8005/api/documents/upload \
  -F "file=@SOLAS_Chapter_II-2.pdf" \
  -F "collection_name=maritime_v1"

# Check document status
curl http://localhost:8005/api/documents/7

3. RAG Query Pipeline

  • Semantic search returns relevant chunks based on meaning
  • BM25 search returns keyword-matching chunks for exact terms (regulation numbers, abbreviations)
  • RRF fusion combines both result sets, boosting documents that appear in both
  • Cross-encoder reranking improves precision by filtering low-relevance candidates
  • Generated answer includes proper citations with hierarchy metadata
  • Response time is logged in the query history table
# Test RAG query
curl -X POST http://localhost:8005/api/rag/query \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What are the fire detection requirements for machinery spaces?",
    "mode": "regulatory",
    "collection_name": "maritime_v1",
    "top_k": 5
  }'

4. Multi-Mode Chat

  • Each mode produces appropriately formatted responses matching its system prompt
  • Regulatory mode includes full convention hierarchy citations (Convention > Chapter > Regulation > Para, Page)
  • Medical mode references IMGS protocols and includes appropriate safety disclaimers
  • General mode provides practical operational advice with source references
  • Forms mode outputs valid JSON that can be parsed by the auto-fill engine
  • Research mode uses academic tone with page-level citations and source attribution

5. Pre-Arrival Forms

  • Template upload extracts the correct field count from the uploaded Excel or DOCX file
  • {{placeholder}} patterns are correctly detected in both Excel cells and DOCX paragraphs
  • Manual fill replaces all placeholders with the provided values
  • Auto-fill produces valid JSON from natural language context
  • Downloaded file has all placeholder values filled correctly
  • Vessel profile data is correctly pre-populated into form fields
# Test template upload
curl -X POST http://localhost:8005/api/forms/templates/upload \
  -F "file=@Rotterdam_PreArrival.xlsx" \
  -F "name=Rotterdam Pre-Arrival"

# Test auto-fill
curl -X POST http://localhost:8005/api/forms/fill/auto \
  -H "Content-Type: application/json" \
  -d '{
    "template_id": 3,
    "context": "MV Pacific Star, IMO 9123456, arriving Rotterdam March 15th 2026",
    "vessel_profile_id": 1
  }'

6. Collection Management

  • Create collection with schema → corresponding Qdrant collection is created with correct vector config (1024 dim, cosine distance)
  • Upload document to specific collection → chunks are indexed in the correct Qdrant collection
  • Query with collection_name parameter → search is scoped to only that collection
  • Delete collection → removes both Qdrant vectors and all related database records (documents, chunks, search config)
Validation Status
All validation steps have been verified during Phase 1 (RAG + Multi-Mode Chat) and Phase 2 (Pre-Arrival Forms) development. The hybrid search pipeline consistently outperforms single-method retrieval on maritime regulation queries, with cross-encoder reranking providing measurable precision improvements.