Check before you train.

A machine-readable registry of declared AI training permissions. Check what's available, what's not and who to contact — before ingestion.

Speak with us

The gap

Training datasets draw from mixed sources: licensed material, public content, open datasets and third-party suppliers. Before ingestion, the same question keeps coming up: has the creator declared a position on AI training?

No standard signal

Creators lack a widely adopted way to declare AI training permissions in a form ingestion pipelines can read.

No standard check

Dataset teams and suppliers lack a common registry to query before training.

Disputes surface late

When questions about consent come up after training, they’re slower to resolve, costlier to remediate and harder to document.

Transparency is hard to evidence

Even where teams want stronger processes, there is often no standardised record of what was checked.

What Sourcemark provides

A queryable registry

Query it with exact or perceptual fingerprints to check for declared signals before ingestion.

A structured result

For each matched file: consent status, licensing contact, timestamp, declarer details and version history. For unmatched files: no registered signal.

A verifiable record

A record of what was checked, what was matched and what result was returned — supporting internal governance and compliance processes.

Better transparency

A more consistent, auditable basis for consent checking before training. Not a compliance guarantee — but a clearer process for the people making dataset decisions.

How it works

1

Connect to the registry

Get access to the Sourcemark API for batch checks, workflow integration and dataset auditing.

2

Run checks across your dataset

Generate fingerprints across candidate training data and query the registry in batch before ingestion.

3

Interpret the result

For matched files: declared consent status, licensing pathway, timestamp and record details. For unmatched files: no declared signal.

4

Act on the output

Exclude not-available content, route licensing enquiries, flag unresolved files and document the checks carried out.

Use cases

Dataset auditing

Check an existing or acquired dataset against the registry before training begins.

Pipeline integration

Integrate the Sourcemark API into ingestion or review workflows as a standard pre-training step.

Compliance and governance

Use Sourcemark query records to support internal governance, procurement reviews and broader dataset oversight.

Transparency and procurement

Explain how consent signals were checked and how dataset decisions were made with a standardised record.

Platform integration

If you host creator content, surface Sourcemark declarations so AI training signals are clearer at the point of use.

What Sourcemark is not

Not an enforcement tool

Sourcemark records declarations, it does not police use.

Not a licensing broker

It provides the contact pathway, not the negotiation.

Not a rights authority

It records who made a declaration, not whether they had the legal authority to do so.

Not a guarantee of coverage

Absence of a match means no Sourcemark declaration was found, not that permission exists.

Registry coverage

Images
Documents
Video

Exact fingerprinting and perceptual matching. Individual creators, representatives and organisational rights holders. API-based checks for audits and pipeline workflows.

Ready to check your datasets?

Pricing is structured around use case and volume, including options for one-off audits, ongoing access and workflow integration.

Speak with us