Documentation/Dataset Intelligence

Dataset Intelligence

Experimental51.85 credits

Corpus-level knowledge graph construction, ontology induction, and incremental dataset ingestion. Transforms pipeline outputs into entities, relations, graph embeddings, and ontological concepts.

Production Recommendation

This is a direct endpoint for development and testing. For production workloads, use the Data Intelligence Pipeline -- it provides structured Data Packages with quality metrics, is async by default, and is covered by Enterprise SLAs.

Overview

The Dataset Intelligence service turns pipeline outputs into structured knowledge at corpus scale.

Key features:

•3-tier processing: enrichment (tier 1), knowledge graph (tier 2), ontology (tier 3)
•Entity resolution and deduplication across documents
•RotatE link prediction for knowledge graph completion
•Concept clustering and SHACL ontology induction
•Delta-aware append mode for incremental ingestion
•Automatic B2 presigned upload for large payloads

API Reference

POSThttps://api.latence.ai/api/v1/dataset_intelligence/process

Submit a Dataset Intelligence job. Always async — returns a job_id for polling.

Request Parameters

Parameter	Type	Default	Description
`tier`	`string` tier1tier2tier3full	`full`	Processing tier
`input_data`	`object`	—	Pipeline output payload (inline). Mutually exclusive with input_url.
`input_url`	`string`	—	B2 presigned URL to pipeline output. Use /api/v1/di/presign for large payloads.
`dataset_id`	`string`	—	Existing dataset ID for append mode (e.g. ds_abc123)
`mode`	`string` createappend	`create`	create or append
`name`	`string`	—	Human-readable dataset name
`config_overrides`	`object`	—	Override tier-specific configuration
`total_pages`	`integer`	—	Page count for cost estimation and billing

Response Fields

Field	Type	Description
`job_id`	`string`	Job identifier (prefix: di_)
`dataset_id`	`string`	Dataset identifier (prefix: ds_)
`status`	`string`	Initial status: QUEUED
`poll_url`	`string`	URL to poll for job status
`cost_estimated`	`number`nullable	Estimated cost in USD
`pre_billed`	`boolean`	Whether estimated cost was pre-deducted

Response Example

200 OKJSON

{
  "job_id": "di_abc123def456",
  "dataset_id": "ds_xyz789",
  "status": "QUEUED",
  "poll_url": "/api/v1/pipeline/di_abc123def456",
  "cost_estimated": 5.35,
  "pre_billed": true
}

POSThttps://api.latence.ai/api/v1/di/presign

Get presigned upload URL for large DI payloads (>8 MB). Upload pipeline output via PUT, then pass the download URL as input_url.

Request Parameters

Parameter	Type	Required	Default	Description
`content_type`	`string`		`application/json`	MIME type of the upload

Response Fields

Field	Type	Description
`upload_url`	`string`	PUT this URL with the pipeline output JSON
`download_url`	`string`	Pass this as input_url in the process request

Response Example

200 OKJSON

{
  "upload_url": "https://s3.us-west-004.backblazeb2.com/...",
  "download_url": "https://s3.us-west-004.backblazeb2.com/..."
}

Error Handling

All errors return a JSON body with error and details fields.

Status	Code	Description
400	`INVALID_TIER` tier must be one of: full, tier1, tier2, tier3	Unknown processing tier
400	`INVALID_MODE` mode must be one of: append, create	Unknown ingestion mode
400	`MISSING_INPUT` Provide input_data (inline) or input_url (presigned URL)	No input data provided
400	`MISSING_DATASET_ID` dataset_id is required for append mode	Append mode requires an existing dataset
402	`INSUFFICIENT_BALANCE` Insufficient balance	Not enough credits for the estimated cost

Billing

Pricing Formula

cost = (pages / 1,000) × tier_rate × mode_discount

Add-ons & Multipliers

Option	Price	Description
Tier 1 (enrich)	`$1.00 / 1K pages`	Semantic enrichment, feature vectors
Tier 2 (build_graph)	`$10.00 / 1K pages`	Entity resolution, knowledge graph, RotatE link prediction
Tier 3 (build_ontology)	`$50.00 / 1K pages`	Concept clustering, ontology induction, SHACL shapes
Full (run)	`$51.85 / 1K pages`	All 3 tiers (15% bundle discount)
Append mode	`−30%`	Discount for incremental ingestion into existing dataset

Pricing Examples

Full tier, 1,000 pages, create mode$51.85

Full tier, 629 pages, create mode$32.61

Tier 2 only, 1,000 pages, create mode$10.00

Full tier, 500 pages, append mode$18.15

Code Examples

from latence import Latence

client = Latence(api_key="YOUR_API_KEY")
di = client.experimental.dataset_intelligence_service

# Create a new dataset from pipeline output
job = di.run(input_data=pipeline_output, return_job=True)
print(f"Job: {job.job_id}")
# Poll at GET /api/v1/pipeline/{job.job_id}

# Append new documents to existing dataset
delta = di.run(
    input_data=new_pipeline_output,
    dataset_id="ds_existing_id",
    return_job=True,
)

# Individual tiers
result = di.enrich(input_data=pipeline_output)        # Tier 1
result = di.build_graph(input_data=pipeline_output)    # Tier 2
result = di.build_ontology(input_data=pipeline_output) # Tier 3

Best Practices

Run a pipeline first — Dataset Intelligence requires pipeline output as input

Use the full tier for best results; individual tiers are for when you only need specific outputs

For large datasets (>8 MB payload), the SDK automatically uses B2 presigned upload

Use append mode with dataset_id to incrementally update datasets without reprocessing everything

Always pass total_pages for accurate cost estimation and upfront billing

Use return_job=True for production workloads — synchronous calls may timeout for large datasets

The delta_summary in append mode responses shows exactly what changed

Explore Tutorials & Notebooks

Deep-dive examples and interactive notebooks in our GitHub repository

View on GitHub

Looking for production-grade processing?

The Data Intelligence Pipeline chains services automatically and returns structured Data Packages.

Pipeline Guide Get API Key