Extractor
This document describes how to use the DIMARC API to interact with extraction endpoints. The API offers two approaches for extracting structured data from documents.
Prerequisites
- An active DIMARC account.
- Administrator status in your organization.
- Your authentication token
x-api-key(see Retrieving Your Authentication Token)
Two Extraction Approaches
The Extractor API offers two methods of use depending on your needs:
1. Global Extractor
The global extractor allows you to define the extraction schema directly in the API request, without requiring prior agent configuration.
Endpoints:
POST /v2/extractorPOST /v2/extractor/asyncUse case: Ideal for one-time extractions, testing, or when you want to dynamically define the extraction schema without going through the Dimarc interface.
2. Agent-based Extractor
The agent-based extractor uses extraction models preconfigured in the Dimarc interface. This approach is recommended for repetitive extractions with the same schema.
Endpoints:
POST /v2/extractor/:agentReferencePOST /v2/extractor/async/:agentReferenceUse case: Recommended for production use cases where you have standardized and reusable extraction models.
Retrieving an agent reference
To use the agent-based extractor, you must first retrieve your agent’s reference:
- Click on your profile icon in the top right corner, then go to the Organization > API section or by clicking here
- In the References of your agents section, you can retrieve the reference of the Extractor agent you want to use.
Configuring extraction models
Extraction models are configured in the Dimarc interface. To add a new model, go to your dashboard: https://app.dimarc.ai
Go to your Extractor agent configuration and add a new model by defining the fields to extract.
Processing Modes
Each approach (Global and Agent) offers two processing modes:
- Synchronous: The request is processed immediately and the response is returned in the same HTTP connection.
- Asynchronous: The processing is queued and the results are sent to a defined webhook once the extraction is completed.
Synchronous Extraction
Synchronous extraction is ideal for quick processing or when your application is directly waiting for the result.
Approach 1: Global Extractor (Synchronous)
This method allows you to define the extraction schema directly in the request.
Request
curl --location 'https://api.dimarc.ai/v2/extractor' \--header 'Content-Type: application/json' \--header 'x-api-key: <your_api_key>' \--data '{ "filename": "invoice.pdf", "file": "<base64_encoded_file>", "params": [ { "name": "invoice_number", "format": "text", "description": "Invoice number" }, { "name": "total_amount", "format": "number", "description": "Total amount including tax" }, { "name": "products", "format": "list", "description": "List of products", "items": [ { "name": "product_name", "format": "text", "description": "Product name" }, { "name": "quantity", "format": "number", "description": "Quantity" } ] } ]}'Parameters
| Parameter | Type | Description |
|---|---|---|
filename | string | Original filename (with extension) |
file | string | File content encoded in base64 |
params | array | Array defining the extraction schema (see Parameters Structure) |
Approach 2: Agent-based Extractor (Synchronous)
This method uses a preconfigured agent with an extraction schema defined in the Dimarc interface.
Request
curl --location 'https://api.dimarc.ai/v2/extractor/<agent_reference>' \--header 'Content-Type: application/json' \--header 'x-api-key: <your_api_key>' \--data '{ "filename": "invoice.pdf", "file": "<base64_encoded_file>"}'Parameters
| Parameter | Type | Description |
|---|---|---|
filename | string | Original filename (with extension) |
file | string | File content encoded in base64 |
Synchronous Response Format (common to both approaches)
{ "status": "success", "data": { "invoice_number": "INV-2024-001", "total_amount": 1250.50, "products": [ { "product_name": "Product A", "quantity": 2 }, { "product_name": "Product B", "quantity": 1 } ] }, "metadata": { "filename": "invoice.pdf", "duration": "3.45", "tokens_consumed": 1 }}Asynchronous Extraction
Asynchronous extraction is recommended for:
- Large documents
- Complex extractions requiring more time
- Systems that cannot wait for an immediate response
Approach 1: Global Extractor (Asynchronous)
This method allows you to define the extraction schema directly in the request.
Request
curl --location 'https://api.dimarc.ai/v2/extractor/async' \--header 'Content-Type: application/json' \--header 'x-api-key: <your_api_key>' \--data '{ "filename": "invoice.pdf", "file": "<base64_encoded_file>", "callback": "https://your-callback-endpoint.com/webhook", "params": [ { "name": "invoice_number", "format": "text", "description": "Invoice number" }, { "name": "total_amount", "format": "number", "description": "Total amount including tax" } ]}'Parameters
| Parameter | Type | Description |
|---|---|---|
filename | string | Original filename (with extension) |
file | string | File content encoded in base64 |
callback | string | URL of the webhook that will receive the results |
params | array | Array defining the extraction schema (see Parameters Structure) |
Approach 2: Agent-based Extractor (Asynchronous)
This method uses a preconfigured agent with an extraction schema defined in the Dimarc interface.
Request
curl --location 'https://api.dimarc.ai/v2/extractor/async/<agent_reference>' \--header 'Content-Type: application/json' \--header 'x-api-key: <your_api_key>' \--data '{ "filename": "invoice.pdf", "file": "<base64_encoded_file>", "callback": "https://your-callback-endpoint.com/webhook"}'Parameters
| Parameter | Type | Description |
|---|---|---|
filename | string | Original filename (with extension) |
file | string | File content encoded in base64 |
callback | string | URL of the webhook that will receive the results |
Immediate Response (common to both approaches)
{ "status": "started", "extract_id": "550e8400-e29b-41d4-a716-446655440000"}Webhook Response Format
When the extraction is complete, the results are sent to the specified callback URL.
On success
{ "status": "success", "extract_id": "550e8400-e29b-41d4-a716-446655440000", "data": { "invoice_number": "INV-2024-001", "total_amount": 1250.50 }, "metadata": { "filename": "invoice.pdf", "duration": "3.45", "tokens_consumed": 1 }}On error
{ "extract_id": "550e8400-e29b-41d4-a716-446655440000", "error": "Extraction failed: Unable to process the document", "detail": "HTTP status: 500"}Parameters Structure
When using the global extractor, you must define the extraction schema via the params parameter. Each element in the params array is an object with the following properties:
| Property | Type | Description |
|---|---|---|
name | string | Name of the field to extract |
format | string | Data type: text, number, boolean, or list |
description | string | Clear description of the field to extract (helps the AI understand) |
items | array | (Required only for format: "list") Array of sub-fields to extract for each list item |
Supported Format Types
1. Simple Types
For scalar values, use:
text: For text, character stringsnumber: For numeric valuesboolean: For true/false values
{ "name": "invoice_number", "format": "text", "description": "Invoice number"}2. Lists (list format)
For complex data like arrays, series of items, etc., use the list format and define sub-elements in the items array:
{ "name": "products", "format": "list", "description": "List of products in the invoice", "items": [ { "name": "product_name", "format": "text", "description": "Product name" }, { "name": "quantity", "format": "number", "description": "Ordered quantity" }, { "name": "unit_price", "format": "number", "description": "Unit price before tax" } ]}Complete Schema Example
{ "params": [ { "name": "invoice_number", "format": "text", "description": "Invoice number" }, { "name": "invoice_date", "format": "text", "description": "Invoice issue date" }, { "name": "total_before_tax", "format": "number", "description": "Total amount before tax" }, { "name": "total_with_tax", "format": "number", "description": "Total amount including tax" }, { "name": "is_paid", "format": "boolean", "description": "Is the invoice paid?" }, { "name": "products", "format": "list", "description": "List of invoiced products", "items": [ { "name": "designation", "format": "text", "description": "Product or service designation" }, { "name": "quantity", "format": "number", "description": "Quantity" }, { "name": "unit_price", "format": "number", "description": "Unit price before tax" } ] } ]}Approach Comparison
| Criteria | Global Extractor | Agent-based Extractor |
|---|---|---|
| Configuration | Schema defined in API request | Schema preconfigured in Dimarc interface |
| Flexibility | ⭐⭐⭐ Very flexible, different schema for each request | ⭐ Fixed schema, requires modification via interface |
| Maintenance | ⚠️ Schema managed client-side | ✅ Centralized |
| Use case | Testing, one-time extractions, dynamic integrations | Production, repetitive extractions, standardized processes |
| Reusability | ❌ Each client must define the schema | ✅ Schema shared across different systems |
| Request complexity | Larger (includes schema) | Lighter (file only) |
Which approach to choose?
Choose the Global Extractor if:
- You are testing or prototyping
- You need different extraction schemas for each document
- You want quick integration without prior configuration
- Your use case is one-time or exploratory
Choose the Agent-based Extractor if:
- You are in production with repetitive processes
- You always extract the same types of data
- You want to centralize extraction schema management
- Multiple systems need to use the same extraction schema
- You want better tracking and traceability
Limitations and Considerations
- Maximum file size: 36 MB
- Supported formats: PDF, PNG, JPG, JPEG, DOCX, XLSX
- Average extraction time: 5 to 10 seconds for standard documents (without too much imagery)
- Asynchronous mode is strongly recommended for files > 20 MB
Support and assistance
For any questions regarding the Extractor API, contact our support team at contact@dimarc.fr