Extractor
This document describes how to use the DIMARC API to interact with the /extractor and /extractor/async endpoints. These endpoints allow you to extract structured data from documents of any type according to models defined in the Dimarc interface.
POST /v2/extractorPOST /v2/extractor/async
Prerequisites
- An active DIMARC account.
- Administrator status in your organization.
- A configured “Extractor” type agent.
- Your authentication token
x-api-key
(see Retrieving Your Authentication Token)
Retrieving an agent’s ID
Each agent has a unique ID. To retrieve the ID of the Extractor agent, go to your dashboard:
- Click on your profile icon in the top right corner, then go to the Organization > API section or by clicking here
- In the References of your agents section, you can retrieve the ID of the Extractor agent you want to use.
Extraction Modes
The Extractor API offers two operating modes:
- Synchronous: The request is processed immediately and the response is returned in the same HTTP connection.
- Asynchronous: The processing is queued and the results are sent to a defined webhook once the extraction is completed.
Where to configure extraction models
Extraction models are configured in the Dimarc interface. To add a new model, go to your dashboard: https://app.dimarc.ai
Go to your Extractor agent configuration and add a new model by defining the fields to extract.
Synchronous Extraction
Synchronous extraction is ideal for quick processing or when your application is directly waiting for the result.
Synchronous Request
curl --location 'https://api.dimarc.ai/v2/extractor/<agent_id>' \--header 'Content-Type: application/json' \--header 'x-api-key: <your_api_key>' \--data '{ "filename": "L100.pdf", "file": "<base64_encoded_file>",}'
Synchronous Request Parameters
Parameter | Type | Description |
---|---|---|
filename | string | Original filename (with extension) |
file | string | File content encoded in base64 |
Synchronous Response Format
{ "status": "success", "data": { .... }}
Asynchronous Extraction
Asynchronous extraction is recommended for:
- Large documents
- Complex extractions requiring more time
- Systems that cannot wait for an immediate response
Asynchronous Request
curl --location 'https://api.dimarc.ai/v2/extractor/async/<agent_id>' \--header 'Content-Type: application/json' \--header 'x-api-key: <your_api_key>' \--data '{ "filename": "L100.pdf", "file": "<base64_encoded_file>", "callback": "https://your-callback-endpoint.com/webhook"}'
Asynchronous Request Parameters
Parameter | Type | Description |
---|---|---|
filename | string | Original filename (with extension) |
file | string | File content encoded in base64 |
callback | string | URL of the webhook that will receive the results |
Immediate Response (asynchronous)
{ "message": "Extraction started", "data": { "extraction_id": "uid_extraction" }}
Webhook Response Format
{ "extract_id": "uid_extraction", "data": { .... }}
Structured Data Extraction
For complex data like lists (tables, series of items, etc.), use the “list” format and define sub-elements in the “items” array:
{ "name": "products", "format": "list", "description": "List of products in the invoice", "items": [ { "name": "product_name", "format": "text", "description": "Product name", "items": [] }, { "name": "quantity", "format": "text", "description": "Ordered quantity", "items": [] }, { "name": "unit_price", "format": "text", "description": "Unit price before tax", "items": [] } ]}
Limitations and considerations
- Maximum file size: 36 MB
- Supported formats: PDF, PNG, JPG, JPEG, DOCX, XLSX
- Average extraction time: 5 to 10 minutes for standard documents (without too much imagery)
Support and assistance
For any questions regarding the Extractor API, contact our support team at support@dimarc.fr