Extractor

This document describes how to use the DIMARC API to interact with extraction endpoints. The API offers two approaches for extracting structured data from documents.

Prerequisites

An active DIMARC account.
Administrator status in your organization.
Your authentication token x-api-key (see Retrieving Your Authentication Token)

Two Extraction Approaches

The Extractor API offers two methods of use depending on your needs:

1. Global Extractor

The global extractor allows you to define the extraction schema directly in the API request, without requiring prior agent configuration.

Endpoints:

POST /v2/extractor
POST /v2/extractor/async

Use case: Ideal for one-time extractions, testing, or when you want to dynamically define the extraction schema without going through the Dimarc interface.

2. Agent-based Extractor

The agent-based extractor uses extraction models preconfigured in the Dimarc interface. This approach is recommended for repetitive extractions with the same schema.

Endpoints:

POST /v2/extractor/:agentReference
POST /v2/extractor/async/:agentReference

Use case: Recommended for production use cases where you have standardized and reusable extraction models.

Retrieving an agent reference

To use the agent-based extractor, you must first retrieve your agent’s reference:

Click on your profile icon in the top right corner, then go to the Organization > API section or by clicking here
In the References of your agents section, you can retrieve the reference of the Extractor agent you want to use.

Configuring extraction models

Extraction models are configured in the Dimarc interface. To add a new model, go to your dashboard: https://app.dimarc.ai

Go to your Extractor agent configuration and add a new model by defining the fields to extract.

Processing Modes

Each approach (Global and Agent) offers two processing modes:

Synchronous: The request is processed immediately and the response is returned in the same HTTP connection.
Asynchronous: The processing is queued and the results are sent to a defined webhook once the extraction is completed.

Synchronous Extraction

Synchronous extraction is ideal for quick processing or when your application is directly waiting for the result.

Approach 1: Global Extractor (Synchronous)

This method allows you to define the extraction schema directly in the request.

Request

curl --location 'https://api.dimarc.ai/v2/extractor' \
--header 'Content-Type: application/json' \
--header 'x-api-key: <your_api_key>' \
--data '{
    "filename": "invoice.pdf",
    "file": "<base64_encoded_file>",
    "params": [
        {
            "name": "invoice_number",
            "format": "text",
            "description": "Invoice number"
        },
        {
            "name": "total_amount",
            "format": "number",
            "description": "Total amount including tax"
        },
        {
            "name": "products",
            "format": "list",
            "description": "List of products",
            "items": [
                {
                    "name": "product_name",
                    "format": "text",
                    "description": "Product name"
                },
                {
                    "name": "quantity",
                    "format": "number",
                    "description": "Quantity"
                }
            ]
        }
    ]
}'

Parameters

Parameter	Type	Description
`filename`	string	Original filename (with extension)
`file`	string	File content encoded in base64
`params`	array	Array defining the extraction schema (see Parameters Structure)

Approach 2: Agent-based Extractor (Synchronous)

This method uses a preconfigured agent with an extraction schema defined in the Dimarc interface.

Request

curl --location 'https://api.dimarc.ai/v2/extractor/<agent_reference>' \
--header 'Content-Type: application/json' \
--header 'x-api-key: <your_api_key>' \
--data '{
    "filename": "invoice.pdf",
    "file": "<base64_encoded_file>"
}'

Parameters

Parameter	Type	Description
`filename`	string	Original filename (with extension)
`file`	string	File content encoded in base64

Synchronous Response Format (common to both approaches)

{
    "status": "success",
    "data": {
        "invoice_number": "INV-2024-001",
        "total_amount": 1250.50,
        "products": [
            {
                "product_name": "Product A",
                "quantity": 2
            },
            {
                "product_name": "Product B",
                "quantity": 1
            }
        ]
    },
    "metadata": {
        "filename": "invoice.pdf",
        "duration": "3.45",
        "tokens_consumed": 1
    }
}

Asynchronous Extraction

Asynchronous extraction is recommended for:

Large documents
Complex extractions requiring more time
Systems that cannot wait for an immediate response

Approach 1: Global Extractor (Asynchronous)

This method allows you to define the extraction schema directly in the request.

Request

curl --location 'https://api.dimarc.ai/v2/extractor/async' \
--header 'Content-Type: application/json' \
--header 'x-api-key: <your_api_key>' \
--data '{
    "filename": "invoice.pdf",
    "file": "<base64_encoded_file>",
    "callback": "https://your-callback-endpoint.com/webhook",
    "params": [
        {
            "name": "invoice_number",
            "format": "text",
            "description": "Invoice number"
        },
        {
            "name": "total_amount",
            "format": "number",
            "description": "Total amount including tax"
        }
    ]
}'

Parameters

Parameter	Type	Description
`filename`	string	Original filename (with extension)
`file`	string	File content encoded in base64
`callback`	string	URL of the webhook that will receive the results
`params`	array	Array defining the extraction schema (see Parameters Structure)

Approach 2: Agent-based Extractor (Asynchronous)

This method uses a preconfigured agent with an extraction schema defined in the Dimarc interface.

Request

curl --location 'https://api.dimarc.ai/v2/extractor/async/<agent_reference>' \
--header 'Content-Type: application/json' \
--header 'x-api-key: <your_api_key>' \
--data '{
    "filename": "invoice.pdf",
    "file": "<base64_encoded_file>",
    "callback": "https://your-callback-endpoint.com/webhook"
}'

Parameters

Parameter	Type	Description
`filename`	string	Original filename (with extension)
`file`	string	File content encoded in base64
`callback`	string	URL of the webhook that will receive the results

Immediate Response (common to both approaches)

{
    "status": "started",
    "extract_id": "550e8400-e29b-41d4-a716-446655440000"
}

Webhook Response Format

When the extraction is complete, the results are sent to the specified callback URL.

On success

{
    "status": "success",
    "extract_id": "550e8400-e29b-41d4-a716-446655440000",
    "data": {
        "invoice_number": "INV-2024-001",
        "total_amount": 1250.50
    },
    "metadata": {
        "filename": "invoice.pdf",
        "duration": "3.45",
        "tokens_consumed": 1
    }
}

On error

{
    "extract_id": "550e8400-e29b-41d4-a716-446655440000",
    "error": "Extraction failed: Unable to process the document",
    "detail": "HTTP status: 500"
}

Parameters Structure

When using the global extractor, you must define the extraction schema via the params parameter. Each element in the params array is an object with the following properties:

Property	Type	Description
`name`	string	Name of the field to extract
`format`	string	Data type: `text`, `number`, `boolean`, or `list`
`description`	string	Clear description of the field to extract (helps the AI understand)
`items`	array	(Required only for `format: "list"`) Array of sub-fields to extract for each list item

Supported Format Types

1. Simple Types

For scalar values, use:

text: For text, character strings
number: For numeric values
boolean: For true/false values

{
    "name": "invoice_number",
    "format": "text",
    "description": "Invoice number"
}

2. Lists (`list` format)

For complex data like arrays, series of items, etc., use the list format and define sub-elements in the items array:

{
    "name": "products",
    "format": "list",
    "description": "List of products in the invoice",
    "items": [
        {
            "name": "product_name",
            "format": "text",
            "description": "Product name"
        },
        {
            "name": "quantity",
            "format": "number",
            "description": "Ordered quantity"
        },
        {
            "name": "unit_price",
            "format": "number",
            "description": "Unit price before tax"
        }
    ]
}

Complete Schema Example

{
    "params": [
        {
            "name": "invoice_number",
            "format": "text",
            "description": "Invoice number"
        },
        {
            "name": "invoice_date",
            "format": "text",
            "description": "Invoice issue date"
        },
        {
            "name": "total_before_tax",
            "format": "number",
            "description": "Total amount before tax"
        },
        {
            "name": "total_with_tax",
            "format": "number",
            "description": "Total amount including tax"
        },
        {
            "name": "is_paid",
            "format": "boolean",
            "description": "Is the invoice paid?"
        },
        {
            "name": "products",
            "format": "list",
            "description": "List of invoiced products",
            "items": [
                {
                    "name": "designation",
                    "format": "text",
                    "description": "Product or service designation"
                },
                {
                    "name": "quantity",
                    "format": "number",
                    "description": "Quantity"
                },
                {
                    "name": "unit_price",
                    "format": "number",
                    "description": "Unit price before tax"
                }
            ]
        }
    ]
}

Approach Comparison

Criteria	Global Extractor	Agent-based Extractor
Configuration	Schema defined in API request	Schema preconfigured in Dimarc interface
Flexibility	⭐⭐⭐ Very flexible, different schema for each request	⭐ Fixed schema, requires modification via interface
Maintenance	⚠️ Schema managed client-side	✅ Centralized
Use case	Testing, one-time extractions, dynamic integrations	Production, repetitive extractions, standardized processes
Reusability	❌ Each client must define the schema	✅ Schema shared across different systems
Request complexity	Larger (includes schema)	Lighter (file only)

Which approach to choose?

Choose the Global Extractor if:

You are testing or prototyping
You need different extraction schemas for each document
You want quick integration without prior configuration
Your use case is one-time or exploratory

Choose the Agent-based Extractor if:

You are in production with repetitive processes
You always extract the same types of data
You want to centralize extraction schema management
Multiple systems need to use the same extraction schema
You want better tracking and traceability

Limitations and Considerations

Maximum file size: 36 MB
Supported formats: PDF, PNG, JPG, JPEG, DOCX, XLSX
Average extraction time: 5 to 10 seconds for standard documents (without too much imagery)
Asynchronous mode is strongly recommended for files > 20 MB

Support and assistance

For any questions regarding the Extractor API, contact our support team at contact@dimarc.fr

Extractor

Prerequisites

Two Extraction Approaches

1. Global Extractor

2. Agent-based Extractor

Retrieving an agent reference

Configuring extraction models

Processing Modes

Synchronous Extraction

Approach 1: Global Extractor (Synchronous)

Request

Parameters

Approach 2: Agent-based Extractor (Synchronous)

Request

Parameters

Synchronous Response Format (common to both approaches)

Asynchronous Extraction

Approach 1: Global Extractor (Asynchronous)

Request

Parameters

Approach 2: Agent-based Extractor (Asynchronous)

Request

Parameters

Immediate Response (common to both approaches)

Webhook Response Format

On success

On error

Parameters Structure

Supported Format Types

1. Simple Types

2. Lists (list format)

Complete Schema Example

Approach Comparison

Which approach to choose?

Limitations and Considerations

Support and assistance

2. Lists (`list` format)