Pipeline steps v7

Each step in an AI pipeline is defined by its operation type and an optional configuration object. You pass the operation name as a string to the step_N parameter of aidb.create_pipeline(), and the configuration as JSONB to the corresponding step_N_options parameter.

SELECT aidb.create_pipeline(
    name              => 'my_pipeline',
    source            => 'my_source_table',
    source_key_column => 'id',
    source_data_column => 'content',
    step_1            => 'ChunkText',
    step_1_options    => aidb.chunk_text_config(desired_length => 120, max_length => 150),
    step_2            => 'KnowledgeBase',
    step_2_options    => aidb.knowledge_base_config(model => 'bert', data_format => 'Text')
);

Use the helper functions described below to build the step_N_options value for each operation type. All helper functions return a JSONB configuration object.

ChunkText

Divides text into smaller segments to fit within LLM context windows.

Helper function: aidb.chunk_text_config()

ParameterTypeDefaultDescription
desired_lengthINTEGERRequiredTarget chunk size. The unit depends on strategy.
max_lengthINTEGERNULLMaximum chunk size. If omitted, desired_length is a strict upper limit.
overlap_lengthINTEGERNULLAmount of content to overlap between consecutive chunks. Defaults to 0 (no overlap).
strategyTEXTNULL'chars' (default) for character-based or 'words' for word-based chunking.

Basic chunking example:

SELECT aidb.create_pipeline(
    name               => 'chunk_pipeline',
    source             => 'source_table',
    source_key_column  => 'id',
    source_data_column => 'content',
    step_1             => 'ChunkText',
    step_1_options     => aidb.chunk_text_config(
        desired_length => 100,
        max_length     => 150,
        overlap_length => 20,
        strategy       => 'words'
    )
);

Result

ChunkText transforms the shape of the data by introducing a part_id column. Each source row may produce multiple output rows, one per chunk.

ParseHtml

Extracts readable text from HTML strings, stripping tags while preserving logical structure.

Helper function: aidb.html_parse_config()

ParameterTypeDefaultDescription
methodTEXTNULL'StructuredPlaintext' (default) for plain text extraction, or 'StructuredMarkdown' to retain hierarchy.
SELECT aidb.create_pipeline(
    name               => 'html_pipeline',
    source             => 'web_data_table',
    source_key_column  => 'id',
    source_data_column => 'html_content',
    step_1             => 'ParseHtml',
    step_1_options     => aidb.html_parse_config(method => 'StructuredMarkdown'),
    step_2             => 'ChunkText',
    step_2_options     => aidb.chunk_text_config(desired_length => 120)
);

ParsePdf

Extracts text from binary PDF data, with options to handle non-compliant or complex files.

Helper function: aidb.pdf_parse_config()

ParameterTypeDefaultDescription
methodTEXTNULL'Structured' (default) — uses the PDF specification to identify text blocks.
allow_partial_parsingBOOLEANNULLIf true (default), continues parsing when errors are encountered on individual pages.

The resulting part_id column maps to the page index from which each text block was extracted.

SELECT aidb.create_pipeline(
    name               => 'pdf_pipeline',
    source             => 'pdf_files_table',
    source_key_column  => 'id',
    source_data_column => 'pdf_data',
    step_1             => 'ParsePdf',
    step_1_options     => aidb.pdf_parse_config(
        method                => 'Structured',
        allow_partial_parsing => true
    ),
    step_2             => 'KnowledgeBase',
    step_2_options     => aidb.knowledge_base_config(model => 'bert', data_format => 'Text')
);

Result

ParsePdf unnests results — a multi-page PDF produces one row per page, each with a part_id corresponding to the page index.

PdfToImage

Converts each page of a PDF document into an image. Use this step when source PDFs are scanned documents, image-heavy layouts, or any PDF where the text layer is absent or unreliable. The rendered page images can then pass directly to a PerformOcr step for text extraction.

For digitally produced PDFs that contain a native text layer, use ParsePdf instead — it's faster and doesn't require an OCR model.

Options format: Pass options as a JSON literal.

ParameterTypeDefaultDescription
dpiINTEGER300Resolution in dots per inch used to render each page. Higher values produce sharper images at the cost of larger file size.
first_pageINTEGERNULLFirst page to render (1-based). Renders from the beginning when omitted.
last_pageINTEGERNULLLast page to render, inclusive (1-based). Renders to the end when omitted.
max_pagesINTEGERNULLMaximum number of pages to render. Acts as a safety cap regardless of first_page and last_page.
render_annotationsBOOLEANtrueWhether to include PDF annotations (such as comments and form fields) in the rendered output.
formatJSONB{"type": "png"}Output image format. Accepted values: {"type": "png"}, {"type": "jpeg"}. Use {"type": "png"} when passing output to PerformOcr.

Output images are in PNG or JPEG format depending on the format option. Use "png" (the default) when passing output to the PerformOcr step — NVIDIA NIM PaddleOCR requires PNG or JPEG input.

Before using this step in a pipeline that includes PerformOcr, register an OCR-capable model:

SELECT aidb.create_model(
    'my_paddle_ocr_model',
    'nim_paddle_ocr',
    credentials => '{"api_key": "<NVIDIA_NIM_API_KEY>"}'::JSONB
);

Then create the pipeline:

SELECT aidb.create_pipeline(
    name               => 'pdf_ocr_pipeline',
    source             => 'pdf_files_table',
    source_key_column  => 'id',
    source_data_column => 'pdf_data',
    step_1             => 'PdfToImage',
    step_1_options     => '{"dpi": 150}'::json,
    step_2             => 'PerformOcr',
    step_2_options     => aidb.ocr_config(model => 'my_paddle_ocr_model'),
    step_3             => 'KnowledgeBase',
    step_3_options     => aidb.knowledge_base_config(model => 'bert', data_format => 'Text')
);

Result

PdfToImage unnests results — each page of the source PDF produces one output row containing the rendered page as a PNG or JPEG image, depending on the format option. The part_id column maps to the page index (zero-based). This output shape is directly compatible with the PerformOcr step, which expects image bytes as its input.

PerformOcr

Extracts text from images using an OCR-capable AI model, such as NVIDIA NIM PaddleOCR.

Helper function: aidb.ocr_config()

ParameterTypeDefaultDescription
modelTEXTRequiredName of the registered OCR model to use.

Before using this step, register an OCR-capable model:

SELECT aidb.create_model(
    'my_paddle_ocr_model',
    'nim_paddle_ocr',
    credentials => '{"api_key": "<NVIDIA_NIM_API_KEY>"}'::JSONB
);

Then reference it in your pipeline:

SELECT aidb.create_pipeline(
    name               => 'ocr_pipeline',
    source             => 'images_table',
    source_key_column  => 'id',
    source_data_column => 'image_data',
    step_1             => 'PerformOcr',
    step_1_options     => aidb.ocr_config(model => 'my_paddle_ocr_model'),
    step_2             => 'KnowledgeBase',
    step_2_options     => aidb.knowledge_base_config(model => 'bert', data_format => 'Text')
);

Result

PerformOcr unnests results — a single image may produce multiple rows, one per detected text block. The NVIDIA NIM provider currently supports only png and jpeg formats.

SummarizeText

Generates concise summaries of long text passages using an AI language model.

Helper function: aidb.summarize_text_config()

ParameterTypeDefaultDescription
modelTEXTRequiredName of the registered model to use for summarization.
chunk_configJSONBNULLOptional chunking configuration (from aidb.chunk_text_config()) applied before summarization.
promptTEXTNULLCustom prompt to guide the summarization. Uses a standard prompt if omitted.
strategyTEXTNULL'append' (default) concatenates per-chunk summaries, 'reduce' iteratively compresses.
reduction_factorINTEGERNULLUsed with 'reduce' strategy. Controls how aggressively text is reduced per iteration (default is 3).
inference_configJSONBNULLOptional runtime inference settings (from aidb.inference_config()).
SELECT aidb.create_pipeline(
    name               => 'summary_pipeline',
    source             => 'articles_table',
    source_key_column  => 'id',
    source_data_column => 'body',
    step_1             => 'SummarizeText',
    step_1_options     => aidb.summarize_text_config(
        model        => 'my_t5_model',
        chunk_config => aidb.chunk_text_config(100, 120, 10, 'words'),
        prompt       => 'Summarize the key points concisely',
        strategy     => 'reduce',
        reduction_factor => 3
    ),
    step_2             => 'KnowledgeBase',
    step_2_options     => aidb.knowledge_base_config(model => 'bert', data_format => 'Text')
);

KnowledgeBase

Converts processed text or image data into vector embeddings and stores them in a searchable knowledge base. This step must always be the last step in a pipeline, as its output is a VECTOR type that cannot be used as input by any subsequent step. For querying the knowledge base with semantic or hybrid search, see Knowledge bases.

Helper function: aidb.knowledge_base_config()

ParameterTypeDefaultDescription
modelTEXTRequiredName of the embedding model.
data_formataidb.PipelineDataFormatRequired'Text' or 'Image'.
distance_operatoraidb.DistanceOperatorNULLSimilarity metric: L2 (default), Cosine, or InnerProduct.
vector_indexJSONBNULLVector index config, built with a vector index helper such as aidb.vector_index_hnsw_config().
SELECT aidb.create_pipeline(
    name               => 'kb_pipeline',
    source             => 'source_table',
    source_key_column  => 'id',
    source_data_column => 'content',
    step_1             => 'KnowledgeBase',
    step_1_options     => aidb.knowledge_base_config(
        model             => 'bert',
        data_format       => 'Text',
        distance_operator => 'Cosine',
        vector_index      => aidb.vector_index_hnsw_config(m => 16, ef_construction => 64)
    )
);

To link multiple pipelines to the same knowledge base, use aidb.knowledge_base_config_from_kb(data_format) instead. This technique inherits the model and distance operator settings from the existing knowledge base.

Destination table

The KnowledgeBase step automatically creates a destination table named pipeline_<pipeline_name> with the following schema:

ColumnTypeDescription
idBIGSERIALPrimary key.
pipeline_idINTReference to the originating pipeline.
source_idTEXTID of the original source record.
part_idsBIGINT[]Tracks segments if the data was chunked or parsed.
valueVECTORThe pgvector embedding.

Multi-pipeline knowledge bases

A single knowledge base can aggregate embeddings from multiple pipelines. The internal knowledge_base_pipeline junction table manages these mappings. When retrieving results via aidb.retrieve_text(), each row includes a pipeline_name column so you can identify which pipeline produced each embedding.

For knowledge base views and statistics, see Knowledge bases reference.

To see pipeline steps used together in a complete end-to-end workflow, see Example.