Documentation/Dataset Intelligence

Dataset Intelligence

Experimental51.85 credits

Corpus-level knowledge graph construction, ontology induction, and incremental dataset ingestion. Transforms pipeline outputs into entities, relations, graph embeddings, and ontological concepts.

Production Recommendation

This is a direct endpoint for development and testing. For production workloads, use the Data Intelligence Pipeline -- it provides structured Data Packages with quality metrics, is async by default, and is covered by Enterprise SLAs.

Overview

The Dataset Intelligence service turns pipeline outputs into structured knowledge at corpus scale.

Key features:

  • 3-tier processing: enrichment (tier 1), knowledge graph (tier 2), ontology (tier 3)
  • Entity resolution and deduplication across documents
  • RotatE link prediction for knowledge graph completion
  • Concept clustering and SHACL ontology induction
  • Delta-aware append mode for incremental ingestion
  • Automatic B2 presigned upload for large payloads

API Reference

POSThttps://api.latence.ai/api/v1/dataset_intelligence/process
Submit a Dataset Intelligence job. Always async — returns a job_id for polling.

Request Parameters

ParameterTypeRequiredDefaultDescription
tierstring
tier1tier2tier3full
fullProcessing tier
input_dataobjectPipeline output payload (inline). Mutually exclusive with input_url.
input_urlstringB2 presigned URL to pipeline output. Use /api/v1/di/presign for large payloads.
dataset_idstringExisting dataset ID for append mode (e.g. ds_abc123)
modestring
createappend
createcreate or append
namestringHuman-readable dataset name
config_overridesobjectOverride tier-specific configuration
total_pagesintegerPage count for cost estimation and billing

Response Fields

FieldTypeDescription
job_idstringJob identifier (prefix: di_)
dataset_idstringDataset identifier (prefix: ds_)
statusstringInitial status: QUEUED
poll_urlstringURL to poll for job status
cost_estimatednumbernullableEstimated cost in USD
pre_billedbooleanWhether estimated cost was pre-deducted

Response Example

200 OKJSON
{
  "job_id": "di_abc123def456",
  "dataset_id": "ds_xyz789",
  "status": "QUEUED",
  "poll_url": "/api/v1/pipeline/di_abc123def456",
  "cost_estimated": 5.35,
  "pre_billed": true
}
POSThttps://api.latence.ai/api/v1/di/presign
Get presigned upload URL for large DI payloads (>8 MB). Upload pipeline output via PUT, then pass the download URL as input_url.

Request Parameters

ParameterTypeRequiredDefaultDescription
content_typestringapplication/jsonMIME type of the upload

Response Fields

FieldTypeDescription
upload_urlstringPUT this URL with the pipeline output JSON
download_urlstringPass this as input_url in the process request

Response Example

200 OKJSON
{
  "upload_url": "https://s3.us-west-004.backblazeb2.com/...",
  "download_url": "https://s3.us-west-004.backblazeb2.com/..."
}

Error Handling

All errors return a JSON body with error and details fields.

StatusCodeDescription
400INVALID_TIER

tier must be one of: full, tier1, tier2, tier3

Unknown processing tier
400INVALID_MODE

mode must be one of: append, create

Unknown ingestion mode
400MISSING_INPUT

Provide input_data (inline) or input_url (presigned URL)

No input data provided
400MISSING_DATASET_ID

dataset_id is required for append mode

Append mode requires an existing dataset
402INSUFFICIENT_BALANCE

Insufficient balance

Not enough credits for the estimated cost

Billing

Pricing Formula

cost = (pages / 1,000) × tier_rate × mode_discount

Add-ons & Multipliers

OptionPriceDescription
Tier 1 (enrich)$1.00 / 1K pagesSemantic enrichment, feature vectors
Tier 2 (build_graph)$10.00 / 1K pagesEntity resolution, knowledge graph, RotatE link prediction
Tier 3 (build_ontology)$50.00 / 1K pagesConcept clustering, ontology induction, SHACL shapes
Full (run)$51.85 / 1K pagesAll 3 tiers (15% bundle discount)
Append mode−30%Discount for incremental ingestion into existing dataset

Pricing Examples

Full tier, 1,000 pages, create mode$51.85
Full tier, 629 pages, create mode$32.61
Tier 2 only, 1,000 pages, create mode$10.00
Full tier, 500 pages, append mode$18.15

Code Examples

from latence import Latence

client = Latence(api_key="YOUR_API_KEY")
di = client.experimental.dataset_intelligence_service

# Create a new dataset from pipeline output
job = di.run(input_data=pipeline_output, return_job=True)
print(f"Job: {job.job_id}")
# Poll at GET /api/v1/pipeline/{job.job_id}

# Append new documents to existing dataset
delta = di.run(
    input_data=new_pipeline_output,
    dataset_id="ds_existing_id",
    return_job=True,
)

# Individual tiers
result = di.enrich(input_data=pipeline_output)        # Tier 1
result = di.build_graph(input_data=pipeline_output)    # Tier 2
result = di.build_ontology(input_data=pipeline_output) # Tier 3

Best Practices

Run a pipeline first — Dataset Intelligence requires pipeline output as input

Use the full tier for best results; individual tiers are for when you only need specific outputs

For large datasets (>8 MB payload), the SDK automatically uses B2 presigned upload

Use append mode with dataset_id to incrementally update datasets without reprocessing everything

Always pass total_pages for accurate cost estimation and upfront billing

Use return_job=True for production workloads — synchronous calls may timeout for large datasets

The delta_summary in append mode responses shows exactly what changed

Explore Tutorials & Notebooks

Deep-dive examples and interactive notebooks in our GitHub repository

View on GitHub

Looking for production-grade processing?

The Data Intelligence Pipeline chains services automatically and returns structured Data Packages.