Skip to content

Extractor

This document describes how to use the DIMARC API to interact with extraction endpoints. The API offers two approaches for extracting structured data from documents.

Prerequisites

Two Extraction Approaches

The Extractor API offers two methods of use depending on your needs:

1. Global Extractor

The global extractor allows you to define the extraction schema directly in the API request, without requiring prior agent configuration.

Endpoints:

Fenêtre de terminal
POST /v2/extractor
POST /v2/extractor/async

Use case: Ideal for one-time extractions, testing, or when you want to dynamically define the extraction schema without going through the Dimarc interface.

2. Agent-based Extractor

The agent-based extractor uses extraction models preconfigured in the Dimarc interface. This approach is recommended for repetitive extractions with the same schema.

Endpoints:

Fenêtre de terminal
POST /v2/extractor/:agentReference
POST /v2/extractor/async/:agentReference

Use case: Recommended for production use cases where you have standardized and reusable extraction models.

Retrieving an agent reference

To use the agent-based extractor, you must first retrieve your agent’s reference:

  1. Click on your profile icon in the top right corner, then go to the Organization > API section or by clicking here
  2. In the References of your agents section, you can retrieve the reference of the Extractor agent you want to use.

Configuring extraction models

Extraction models are configured in the Dimarc interface. To add a new model, go to your dashboard: https://app.dimarc.ai

Go to your Extractor agent configuration and add a new model by defining the fields to extract.

Processing Modes

Each approach (Global and Agent) offers two processing modes:

  1. Synchronous: The request is processed immediately and the response is returned in the same HTTP connection.
  2. Asynchronous: The processing is queued and the results are sent to a defined webhook once the extraction is completed.

Synchronous Extraction

Synchronous extraction is ideal for quick processing or when your application is directly waiting for the result.

Approach 1: Global Extractor (Synchronous)

This method allows you to define the extraction schema directly in the request.

Request

Fenêtre de terminal
curl --location 'https://api.dimarc.ai/v2/extractor' \
--header 'Content-Type: application/json' \
--header 'x-api-key: <your_api_key>' \
--data '{
"filename": "invoice.pdf",
"file": "<base64_encoded_file>",
"params": [
{
"name": "invoice_number",
"format": "text",
"description": "Invoice number"
},
{
"name": "total_amount",
"format": "number",
"description": "Total amount including tax"
},
{
"name": "products",
"format": "list",
"description": "List of products",
"items": [
{
"name": "product_name",
"format": "text",
"description": "Product name"
},
{
"name": "quantity",
"format": "number",
"description": "Quantity"
}
]
}
]
}'

Parameters

ParameterTypeDescription
filenamestringOriginal filename (with extension)
filestringFile content encoded in base64
paramsarrayArray defining the extraction schema (see Parameters Structure)

Approach 2: Agent-based Extractor (Synchronous)

This method uses a preconfigured agent with an extraction schema defined in the Dimarc interface.

Request

Fenêtre de terminal
curl --location 'https://api.dimarc.ai/v2/extractor/<agent_reference>' \
--header 'Content-Type: application/json' \
--header 'x-api-key: <your_api_key>' \
--data '{
"filename": "invoice.pdf",
"file": "<base64_encoded_file>"
}'

Parameters

ParameterTypeDescription
filenamestringOriginal filename (with extension)
filestringFile content encoded in base64

Synchronous Response Format (common to both approaches)

{
"status": "success",
"data": {
"invoice_number": "INV-2024-001",
"total_amount": 1250.50,
"products": [
{
"product_name": "Product A",
"quantity": 2
},
{
"product_name": "Product B",
"quantity": 1
}
]
},
"metadata": {
"filename": "invoice.pdf",
"duration": "3.45",
"tokens_consumed": 1
}
}

Asynchronous Extraction

Asynchronous extraction is recommended for:

  • Large documents
  • Complex extractions requiring more time
  • Systems that cannot wait for an immediate response

Approach 1: Global Extractor (Asynchronous)

This method allows you to define the extraction schema directly in the request.

Request

Fenêtre de terminal
curl --location 'https://api.dimarc.ai/v2/extractor/async' \
--header 'Content-Type: application/json' \
--header 'x-api-key: <your_api_key>' \
--data '{
"filename": "invoice.pdf",
"file": "<base64_encoded_file>",
"callback": "https://your-callback-endpoint.com/webhook",
"params": [
{
"name": "invoice_number",
"format": "text",
"description": "Invoice number"
},
{
"name": "total_amount",
"format": "number",
"description": "Total amount including tax"
}
]
}'

Parameters

ParameterTypeDescription
filenamestringOriginal filename (with extension)
filestringFile content encoded in base64
callbackstringURL of the webhook that will receive the results
paramsarrayArray defining the extraction schema (see Parameters Structure)

Approach 2: Agent-based Extractor (Asynchronous)

This method uses a preconfigured agent with an extraction schema defined in the Dimarc interface.

Request

Fenêtre de terminal
curl --location 'https://api.dimarc.ai/v2/extractor/async/<agent_reference>' \
--header 'Content-Type: application/json' \
--header 'x-api-key: <your_api_key>' \
--data '{
"filename": "invoice.pdf",
"file": "<base64_encoded_file>",
"callback": "https://your-callback-endpoint.com/webhook"
}'

Parameters

ParameterTypeDescription
filenamestringOriginal filename (with extension)
filestringFile content encoded in base64
callbackstringURL of the webhook that will receive the results

Immediate Response (common to both approaches)

{
"status": "started",
"extract_id": "550e8400-e29b-41d4-a716-446655440000"
}

Webhook Response Format

When the extraction is complete, the results are sent to the specified callback URL.

On success

{
"status": "success",
"extract_id": "550e8400-e29b-41d4-a716-446655440000",
"data": {
"invoice_number": "INV-2024-001",
"total_amount": 1250.50
},
"metadata": {
"filename": "invoice.pdf",
"duration": "3.45",
"tokens_consumed": 1
}
}

On error

{
"extract_id": "550e8400-e29b-41d4-a716-446655440000",
"error": "Extraction failed: Unable to process the document",
"detail": "HTTP status: 500"
}

Parameters Structure

When using the global extractor, you must define the extraction schema via the params parameter. Each element in the params array is an object with the following properties:

PropertyTypeDescription
namestringName of the field to extract
formatstringData type: text, number, boolean, or list
descriptionstringClear description of the field to extract (helps the AI understand)
itemsarray(Required only for format: "list") Array of sub-fields to extract for each list item

Supported Format Types

1. Simple Types

For scalar values, use:

  • text: For text, character strings
  • number: For numeric values
  • boolean: For true/false values
{
"name": "invoice_number",
"format": "text",
"description": "Invoice number"
}

2. Lists (list format)

For complex data like arrays, series of items, etc., use the list format and define sub-elements in the items array:

{
"name": "products",
"format": "list",
"description": "List of products in the invoice",
"items": [
{
"name": "product_name",
"format": "text",
"description": "Product name"
},
{
"name": "quantity",
"format": "number",
"description": "Ordered quantity"
},
{
"name": "unit_price",
"format": "number",
"description": "Unit price before tax"
}
]
}

Complete Schema Example

{
"params": [
{
"name": "invoice_number",
"format": "text",
"description": "Invoice number"
},
{
"name": "invoice_date",
"format": "text",
"description": "Invoice issue date"
},
{
"name": "total_before_tax",
"format": "number",
"description": "Total amount before tax"
},
{
"name": "total_with_tax",
"format": "number",
"description": "Total amount including tax"
},
{
"name": "is_paid",
"format": "boolean",
"description": "Is the invoice paid?"
},
{
"name": "products",
"format": "list",
"description": "List of invoiced products",
"items": [
{
"name": "designation",
"format": "text",
"description": "Product or service designation"
},
{
"name": "quantity",
"format": "number",
"description": "Quantity"
},
{
"name": "unit_price",
"format": "number",
"description": "Unit price before tax"
}
]
}
]
}

Approach Comparison

CriteriaGlobal ExtractorAgent-based Extractor
ConfigurationSchema defined in API requestSchema preconfigured in Dimarc interface
Flexibility⭐⭐⭐ Very flexible, different schema for each request⭐ Fixed schema, requires modification via interface
Maintenance⚠️ Schema managed client-side✅ Centralized
Use caseTesting, one-time extractions, dynamic integrationsProduction, repetitive extractions, standardized processes
Reusability❌ Each client must define the schema✅ Schema shared across different systems
Request complexityLarger (includes schema)Lighter (file only)

Which approach to choose?

Choose the Global Extractor if:

  • You are testing or prototyping
  • You need different extraction schemas for each document
  • You want quick integration without prior configuration
  • Your use case is one-time or exploratory

Choose the Agent-based Extractor if:

  • You are in production with repetitive processes
  • You always extract the same types of data
  • You want to centralize extraction schema management
  • Multiple systems need to use the same extraction schema
  • You want better tracking and traceability

Limitations and Considerations

  • Maximum file size: 36 MB
  • Supported formats: PDF, PNG, JPG, JPEG, DOCX, XLSX
  • Average extraction time: 5 to 10 seconds for standard documents (without too much imagery)
  • Asynchronous mode is strongly recommended for files > 20 MB

Support and assistance

For any questions regarding the Extractor API, contact our support team at contact@dimarc.fr