EDB Docs - EDB Postgres AI Database v7

Each step in an AI pipeline is defined by its operation type and an optional configuration object. You pass the operation name as a string to the step_N parameter of aidb.create_pipeline(), and the configuration as JSONB to the corresponding step_N_options parameter.

SELECT aidb.create_pipeline(
    name              => 'my_pipeline',
    source            => 'my_source_table',
    source_key_column => 'id',
    source_data_column => 'content',
    step_1            => 'ChunkText',
    step_1_options    => aidb.chunk_text_config(desired_length => 120, max_length => 150),
    step_2            => 'KnowledgeBase',
    step_2_options    => aidb.knowledge_base_config(model => 'bert', data_format => 'Text')
);

Use the helper functions described below to build the step_N_options value for each operation type. All helper functions return a JSONB configuration object.

ChunkText

Divides text into smaller segments to fit within LLM context windows.

Helper function: aidb.chunk_text_config()

Parameter	Type	Default	Description
`desired_length`	`INTEGER`	Required	Target chunk size. The unit depends on `strategy`.
`max_length`	`INTEGER`	`NULL`	Maximum chunk size. If omitted, `desired_length` is a strict upper limit.
`overlap_length`	`INTEGER`	`NULL`	Amount of content to overlap between consecutive chunks. Defaults to 0 (no overlap).
`strategy`	`TEXT`	`NULL`	`'chars'` (default) for character-based or `'words'` for word-based chunking.

Basic chunking example:

SELECT aidb.create_pipeline(
    name               => 'chunk_pipeline',
    source             => 'source_table',
    source_key_column  => 'id',
    source_data_column => 'content',
    step_1             => 'ChunkText',
    step_1_options     => aidb.chunk_text_config(
        desired_length => 100,
        max_length     => 150,
        overlap_length => 20,
        strategy       => 'words'
    )
);

Result

ChunkText transforms the shape of the data by introducing a part_id column. Each source row may produce multiple output rows, one per chunk.

ParseHtml

Extracts readable text from HTML strings, stripping tags while preserving logical structure.

Helper function: aidb.html_parse_config()

Parameter	Type	Default	Description
`method`	`TEXT`	`NULL`	`'StructuredPlaintext'` (default) for plain text extraction, or `'StructuredMarkdown'` to retain hierarchy.

SELECT aidb.create_pipeline(
    name               => 'html_pipeline',
    source             => 'web_data_table',
    source_key_column  => 'id',
    source_data_column => 'html_content',
    step_1             => 'ParseHtml',
    step_1_options     => aidb.html_parse_config(method => 'StructuredMarkdown'),
    step_2             => 'ChunkText',
    step_2_options     => aidb.chunk_text_config(desired_length => 120)
);

ParsePdf

Extracts text from binary PDF data, with options to handle non-compliant or complex files.

Helper function: aidb.pdf_parse_config()

Parameter	Type	Default	Description
`method`	`TEXT`	`NULL`	`'Structured'` (default) — uses the PDF specification to identify text blocks.
`allow_partial_parsing`	`BOOLEAN`	`NULL`	If `true` (default), continues parsing when errors are encountered on individual pages.

The resulting part_id column maps to the page index from which each text block was extracted.

SELECT aidb.create_pipeline(
    name               => 'pdf_pipeline',
    source             => 'pdf_files_table',
    source_key_column  => 'id',
    source_data_column => 'pdf_data',
    step_1             => 'ParsePdf',
    step_1_options     => aidb.pdf_parse_config(
        method                => 'Structured',
        allow_partial_parsing => true
    ),
    step_2             => 'KnowledgeBase',
    step_2_options     => aidb.knowledge_base_config(model => 'bert', data_format => 'Text')
);

Result

ParsePdf unnests results — a multi-page PDF produces one row per page, each with a part_id corresponding to the page index.

PdfToImage

Converts each page of a PDF document into an image. Use this step when source PDFs are scanned documents, image-heavy layouts, or any PDF where the text layer is absent or unreliable. The rendered page images can then pass directly to a PerformOcr step for text extraction.

For digitally produced PDFs that contain a native text layer, use ParsePdf instead — it's faster and doesn't require an OCR model.

Options format: Pass options as a JSON literal.

Parameter	Type	Default	Description
`dpi`	`INTEGER`	`300`	Resolution in dots per inch used to render each page. Higher values produce sharper images at the cost of larger file size.
`first_page`	`INTEGER`	`NULL`	First page to render (1-based). Renders from the beginning when omitted.
`last_page`	`INTEGER`	`NULL`	Last page to render, inclusive (1-based). Renders to the end when omitted.
`max_pages`	`INTEGER`	`NULL`	Maximum number of pages to render. Acts as a safety cap regardless of `first_page` and `last_page`.
`render_annotations`	`BOOLEAN`	`true`	Whether to include PDF annotations (such as comments and form fields) in the rendered output.
`format`	`JSONB`	`{"type": "png"}`	Output image format. Accepted values: `{"type": "png"}`, `{"type": "jpeg"}`. Use `{"type": "png"}` when passing output to `PerformOcr`.

Output images are in PNG or JPEG format depending on the format option. Use "png" (the default) when passing output to the PerformOcr step — NVIDIA NIM PaddleOCR requires PNG or JPEG input.

Before using this step in a pipeline that includes PerformOcr, register an OCR-capable model:

SELECT aidb.create_model(
    'my_paddle_ocr_model',
    'nim_paddle_ocr',
    credentials => '{"api_key": "<NVIDIA_NIM_API_KEY>"}'::JSONB
);

Then create the pipeline:

SELECT aidb.create_pipeline(
    name               => 'pdf_ocr_pipeline',
    source             => 'pdf_files_table',
    source_key_column  => 'id',
    source_data_column => 'pdf_data',
    step_1             => 'PdfToImage',
    step_1_options     => '{"dpi": 150}'::json,
    step_2             => 'PerformOcr',
    step_2_options     => aidb.ocr_config(model => 'my_paddle_ocr_model'),
    step_3             => 'KnowledgeBase',
    step_3_options     => aidb.knowledge_base_config(model => 'bert', data_format => 'Text')
);

Result

PdfToImage unnests results — each page of the source PDF produces one output row containing the rendered page as a PNG or JPEG image, depending on the format option. The part_id column maps to the page index (zero-based). This output shape is directly compatible with the PerformOcr step, which expects image bytes as its input.

PerformOcr

Extracts text from images using an OCR-capable AI model, such as NVIDIA NIM PaddleOCR.

Helper function: aidb.ocr_config()

Parameter	Type	Default	Description
`model`	`TEXT`	Required	Name of the registered OCR model to use.

Before using this step, register an OCR-capable model:

SELECT aidb.create_model(
    'my_paddle_ocr_model',
    'nim_paddle_ocr',
    credentials => '{"api_key": "<NVIDIA_NIM_API_KEY>"}'::JSONB
);

Then reference it in your pipeline:

SELECT aidb.create_pipeline(
    name               => 'ocr_pipeline',
    source             => 'images_table',
    source_key_column  => 'id',
    source_data_column => 'image_data',
    step_1             => 'PerformOcr',
    step_1_options     => aidb.ocr_config(model => 'my_paddle_ocr_model'),
    step_2             => 'KnowledgeBase',
    step_2_options     => aidb.knowledge_base_config(model => 'bert', data_format => 'Text')
);

Result

PerformOcr unnests results — a single image may produce multiple rows, one per detected text block. The NVIDIA NIM provider currently supports only png and jpeg formats.

SummarizeText

Generates concise summaries of long text passages using an AI language model.

Helper function: aidb.summarize_text_config()

Parameter	Type	Default	Description
`model`	`TEXT`	Required	Name of the registered model to use for summarization.
`chunk_config`	`JSONB`	`NULL`	Optional chunking configuration (from `aidb.chunk_text_config()`) applied before summarization.
`prompt`	`TEXT`	`NULL`	Custom prompt to guide the summarization. Uses a standard prompt if omitted.
`strategy`	`TEXT`	`NULL`	`'append'` (default) concatenates per-chunk summaries, `'reduce'` iteratively compresses.
`reduction_factor`	`INTEGER`	`NULL`	Used with `'reduce'` strategy. Controls how aggressively text is reduced per iteration (default is 3).
`inference_config`	`JSONB`	`NULL`	Optional runtime inference settings (from `aidb.inference_config()`).

SELECT aidb.create_pipeline(
    name               => 'summary_pipeline',
    source             => 'articles_table',
    source_key_column  => 'id',
    source_data_column => 'body',
    step_1             => 'SummarizeText',
    step_1_options     => aidb.summarize_text_config(
        model        => 'my_t5_model',
        chunk_config => aidb.chunk_text_config(100, 120, 10, 'words'),
        prompt       => 'Summarize the key points concisely',
        strategy     => 'reduce',
        reduction_factor => 3
    ),
    step_2             => 'KnowledgeBase',
    step_2_options     => aidb.knowledge_base_config(model => 'bert', data_format => 'Text')
);

KnowledgeBase

Converts processed text or image data into vector embeddings and stores them in a searchable knowledge base. This step must always be the last step in a pipeline, as its output is a VECTOR type that cannot be used as input by any subsequent step. For querying the knowledge base with semantic or hybrid search, see Knowledge bases.

Helper function: aidb.knowledge_base_config()

Parameter	Type	Default	Description
`model`	`TEXT`	Required	Name of the embedding model.
`data_format`	`aidb.PipelineDataFormat`	Required	`'Text'` or `'Image'`.
`distance_operator`	`aidb.DistanceOperator`	`NULL`	Similarity metric: `L2` (default), `Cosine`, or `InnerProduct`.
`vector_index`	`JSONB`	`NULL`	Vector index config, built with a vector index helper such as `aidb.vector_index_hnsw_config()`.

SELECT aidb.create_pipeline(
    name               => 'kb_pipeline',
    source             => 'source_table',
    source_key_column  => 'id',
    source_data_column => 'content',
    step_1             => 'KnowledgeBase',
    step_1_options     => aidb.knowledge_base_config(
        model             => 'bert',
        data_format       => 'Text',
        distance_operator => 'Cosine',
        vector_index      => aidb.vector_index_hnsw_config(m => 16, ef_construction => 64)
    )
);

To link multiple pipelines to the same knowledge base, use aidb.knowledge_base_config_from_kb(data_format) instead. This technique inherits the model and distance operator settings from the existing knowledge base.

Destination table

The KnowledgeBase step automatically creates a destination table named pipeline_<pipeline_name> with the following schema:

Column	Type	Description
`id`	BIGSERIAL	Primary key.
`pipeline_id`	INT	Reference to the originating pipeline.
`source_id`	TEXT	ID of the original source record.
`part_ids`	BIGINT[]	Tracks segments if the data was chunked or parsed.
`value`	VECTOR	The pgvector embedding.

Multi-pipeline knowledge bases

A single knowledge base can aggregate embeddings from multiple pipelines. The internal knowledge_base_pipeline junction table manages these mappings. When retrieving results via aidb.retrieve_text(), each row includes a pipeline_name column so you can identify which pipeline produced each embedding.

For knowledge base views and statistics, see Knowledge bases reference.

To see pipeline steps used together in a complete end-to-end workflow, see Example.

Pipeline steps v7

ChunkText

Result

ParseHtml

ParsePdf

Result

PdfToImage

Result

PerformOcr

Result

SummarizeText

KnowledgeBase

Destination table

Multi-pipeline knowledge bases

← Prev

↑ Up

Next →