Check before you train.
A machine-readable registry of declared AI training permissions. Check what's available, what's not and who to contact — before ingestion.
Speak with usThe gap
Training datasets draw from mixed sources: licensed material, public content, open datasets and third-party suppliers. Before ingestion, the same question keeps coming up: has the creator declared a position on AI training?
No standard signal
Creators lack a widely adopted way to declare AI training permissions in a form ingestion pipelines can read.
No standard check
Dataset teams and suppliers lack a common registry to query before training.
Disputes surface late
When questions about consent come up after training, they’re slower to resolve, costlier to remediate and harder to document.
Transparency is hard to evidence
Even where teams want stronger processes, there is often no standardised record of what was checked.
What Sourcemark provides
A queryable registry
Query it with exact or perceptual fingerprints to check for declared signals before ingestion.
A structured result
For each matched file: consent status, licensing contact, timestamp, declarer details and version history. For unmatched files: no registered signal.
A verifiable record
A record of what was checked, what was matched and what result was returned — supporting internal governance and compliance processes.
Better transparency
A more consistent, auditable basis for consent checking before training. Not a compliance guarantee — but a clearer process for the people making dataset decisions.
How it works
Connect to the registry
Get access to the Sourcemark API for batch checks, workflow integration and dataset auditing.
Run checks across your dataset
Generate fingerprints across candidate training data and query the registry in batch before ingestion.
Interpret the result
For matched files: declared consent status, licensing pathway, timestamp and record details. For unmatched files: no declared signal.
Act on the output
Exclude not-available content, route licensing enquiries, flag unresolved files and document the checks carried out.
Use cases
Dataset auditing
Check an existing or acquired dataset against the registry before training begins.
Pipeline integration
Integrate the Sourcemark API into ingestion or review workflows as a standard pre-training step.
Compliance and governance
Use Sourcemark query records to support internal governance, procurement reviews and broader dataset oversight.
Transparency and procurement
Explain how consent signals were checked and how dataset decisions were made with a standardised record.
Platform integration
If you host creator content, surface Sourcemark declarations so AI training signals are clearer at the point of use.
What Sourcemark is not
Not an enforcement tool
Sourcemark records declarations, it does not police use.
Not a licensing broker
It provides the contact pathway, not the negotiation.
Not a rights authority
It records who made a declaration, not whether they had the legal authority to do so.
Not a guarantee of coverage
Absence of a match means no Sourcemark declaration was found, not that permission exists.
Registry coverage
Exact fingerprinting and perceptual matching. Individual creators, representatives and organisational rights holders. API-based checks for audits and pipeline workflows.
Ready to check your datasets?
Pricing is structured around use case and volume, including options for one-off audits, ongoing access and workflow integration.
Speak with us