Knowledge
Pull structured entities and relationships out of your documents into a queryable store — for the exact, exhaustive, and relational questions vector search can't answer
Knowledge Guide
This guide is for the developer who has documents and needs exact answers from them. You have a folder full of Excel project sheets, PowerPoint case studies, PDFs and Word files, and your users keep asking questions that vector search can't answer well:
- "List every company we've worked with in retail."
- "How many of our people have a competence in Java?"
- "Which consultants have worked with a fintech client?"
These are exhaustive, exact, and relational questions. Retrieval-augmented generation (RAG) is great at fuzzy, open-ended ones ("find a project like this", "what did we say about pricing") but it cannot guarantee a complete, deduplicated list, and it cannot count. The reason is simple: the entities — the company names, the people, the skills — never leave the document. Only a lossy summary embedding does.
The knowledge layer fixes that. It pulls structured entities and the relationships between them out of your documents and into a queryable store, so the questions above become deterministic queries instead of similarity guesses. It runs alongside the existing RAG layer, not instead of it.
This guide explains the mental model, walks you through the common patterns from a trivial "hello world" to a relational multi-hop agent, and ends with a reference cheat sheet you can keep open while you work.
What You Get
The platform gives you one cooperating set of capabilities,
reached through context.getKnowledge() inside any Action, plus
two native AITools the LLM can call on its own.
| Capability | Accessor | What it does |
|---|---|---|
| Config | context.getKnowledge().createDataset(...), addEntityType(...), addRelationshipType(...), addTaxonomyTerm(...) | Declare what to extract and how to deduplicate it — no code, just configuration. |
| Ingestion | preview(...), backfill(...), reprocess(...) | Dry-run extraction, then enqueue documents for extraction in the background. |
| Jobs | jobStatus(...), listJobs(...), jobSummary(...) | Watch extraction progress and inspect failures. |
| Query | query(...), vocabulary(...) | Run exact lookups, aggregations, and 1–2 hop relational queries — always access-filtered. |
| Relationships | addRelationship(...), listRelationshipTypes(...) | Read the relationship vocabulary; write edges by hand when you need to. |
| Agent tools | knowledge_query, knowledge_vocabulary | The LLM discovers values and runs queries itself, mid-turn. |
Everything is multi-tenant and app-scoped by
construction. A dataset is keyed by (tenant, app, datasetId);
you never see another tenant's or another app's entities.
Everything is also access-filtered per fact: the same query
returns different results for different users, and that is
correct behaviour (see Access Control).
The same surface is reachable over the agent-tool layer for LLM-driven use. This guide focuses on the JS Action surface — that's where most integration code lives — and then shows how to hand the same power to an agent.
The Mental Model In One Picture
If you take away one mental model from this guide, take this one:
DOCUMENTS KNOWLEDGE STORE
───────── ───────────────
(Excel / PPT / (entities + facts + edges,
PDF / Word) deduplicated, exact, queryable)
┌─────────────────────────┐ ┌──────────────────────────────┐
│ project-list.xlsx │ │ entity: "Acme Corp" │
│ ┌────────┬─────────┐ │ │ type: company │
│ │ Person │ Client │ │ ───► │ aliases: [Acme, Acme Inc.] │
│ ├────────┼─────────┤ │ │ facts: │
│ │ Jane │ Acme │ │ │ industry=retail (doc A) │
│ │ Erik │ Acme │ │ │ hq=Boston (doc B) │
│ └────────┴─────────┘ │ └──────────────────────────────┘
└─────────────────────────┘ ▲ │
│ │ │
│ EXTRACTION + RESOLUTION │ │ QUERY
│ (background worker, │ │ ( where / traverse
│ per-row, dedup inline) │ │ / aggregate )
│ │ ▼
└─────────────────────────────────────── ┌──────────────────────┐
│ "list all retail │
┌───────────────────────────────┐ │ companies" → exact │
│ edge: Jane —worked_on→ Acme │ ◄───► │ "count per industry"│
│ edge: Erik —worked_on→ Acme │ │ "who worked w/ Acme"│
└───────────────────────────────┘ └──────────────────────┘
A document goes in. The platform classifies each block as tabular (rows of entities) or prose, then runs an LLM extraction pass that pulls out entities (a company, a person, a skill), their attribute facts (industry, headcount, each carrying its own source), and the relationships between them (Jane worked_on Acme). Before anything is written, each entity is resolved against what's already there — "Acme", "Acme Inc." and "Acme Corp" collapse into one canonical node — so your counts and lists are clean.
At query time you don't do similarity search. You issue a structured query: select an entity type, filter by attribute predicates, optionally traverse one or two relationship hops, and optionally reduce to a count or a group-by. The answer is exact and exhaustive, and it's filtered to exactly the documents the asking user is allowed to see.
That's the whole story. Everything below is mechanics.
Quick Start: From Documents To An Exact List
Let's do the smallest end-to-end thing that has value. You have an app with a Document model holding a pile of client documents. You want to answer "list every company, by industry."
Four steps: declare a dataset, declare the entity type, backfill the documents, query.
// action: setupAndQueryClients
function setupAndQueryClients() {
var k = context.getKnowledge();
// 1. Declare a dataset (idempotent per (tenant, app, datasetId))
if (!datasetExists(k, "clients")) {
k.createDataset("clients", {
name: "Client corpus",
extractionModel: "gemini-3.1-flash-lite"
});
// 2. Declare what to extract and how to deduplicate it
k.addEntityType("clients", {
type: "company",
attributes: ["industry", "hq", "headcount"],
extractionHint: "Companies the firm has worked with as a client.",
dedupKeys: ["canonicalName"]
});
}
// 3. Enqueue every Document-model record for extraction.
// This runs in the background; it returns immediately.
var enq = k.backfill("clients");
context.log("Enqueued " + enq.enqueued + " documents");
// 4. Query — once extraction has run (watch jobSummary), this is
// an exact, deduplicated, access-filtered list.
var companies = k.query("clients", {
entityType: "company",
where: [{ attr: "industry", op: "eq", value: "retail" }],
limit: 100
});
companies.forEach(function (c) {
context.log(c.canonicalName + " (" + c.attributes.hq + ")");
});
}
function datasetExists(k, id) {
try { k.listEntityTypes(id); return true; }
catch (e) { return false; }
}
That's it. No collections to create, no schema migration. You
declared two things (a dataset and a type), pointed the pipeline
at your documents, and got back an exact list. Adding industry
counts is one more query:
var counts = k.query("clients", {
entityType: "company",
aggregate: { op: "group_by", attr: "industry" }
});
// → [ { _id: "retail", count: 14 }, { _id: "fintech", count: 9 }, ... ]
The rest of this guide unpacks each step and shows how far it goes.
The Four Building Blocks
Everything in the knowledge layer is built from four nouns.
Dataset
A dataset is the container — a named knowledge graph scoped
to your app. It holds the extraction configuration (entity types,
relationship types, taxonomy) and owns all the entities and edges
extracted under it. You'll usually have one per problem domain
("clients", "products", "case-law"), keyed by a datasetId you
choose. A dataset also pins the models used for extraction and
resolution.
Entity
An entity is a resolved, real-world thing of a configured
entityType — a company, a person, a skill. After deduplication
it is a single canonical node, even though it was mentioned in
twenty documents under five spellings. It carries:
canonicalName— the resolved, preferred form.aliases[]— every observed variant ("IBM", "I.B.M.", "International Business Machines"), kept, not discarded.attributes[]— an array of attribute facts, not a flat key/value bag (see below).sourceDocs[]/folderIds[]— the derived union of every source that contributed a fact, used for fast access pre-filtering.
When you read an entity back through query(...), attributes are
flattened to a convenient { key: value } map (first-wins when a
merged entity holds several facts for the same key). The raw
per-fact structure stays underneath, where access control needs
it.
Attribute Fact
This is the subtle, important one. An attribute is not a bare
property on the entity. It is a { key, value, sourceDocId, folderIds, confidence } record. "Acme is in retail" is a fact
asserted by a specific document. The same entity can carry the
same key from several documents, each with its own source and its
own access.
Why it matters: it's what lets one canonical "Acme" node present a different view of itself to different users (you only see the facts whose source you can read), and it's what lets the firm click "where does it say Acme is in retail?" Facts carry provenance because provenance does triple duty — access, traceability, and dedup signal.
Relationship
A relationship is a directed, typed edge between two
resolved entities: Jane —worked_on→ "Project Atlas",
"Project Atlas" —client_of→ Acme. Edges are created between
canonical ids, never between raw name strings, so the graph
doesn't fragment across spelling variants. The relationship
vocabulary is small and typed (you declare it), which keeps
traversal queries clean. Each edge carries confidence and a
source, just like a fact.
Tabular rows are the high-confidence source for relationships:
a row person | project | client | year is three edges, stated
with high reliability because row proximity is the relationship.
Prose-derived edges ("Anna led the team at IBM") are extracted
too, but at lower confidence.
Configuring A Dataset
Configuration is the feature — adding or tuning a type is data,
not code. All config writes are on context.getKnowledge().
Create The Dataset
var ds = context.getKnowledge().createDataset("clients", {
name: "Client corpus",
extractionModel: "gemini-3.1-flash-lite", // bulk per-row extraction
resolutionModel: "gemini-3.1-pro" // ambiguous-merge escalation
});
createDataset is one-per-(tenant, app, datasetId) — a
duplicate id is rejected, so guard it (see the datasetExists
helper above). The two model fields are optional; they let you
run cheap, fast extraction on the bulk path and escalate only
genuinely ambiguous merge decisions to a stronger model.
Add Entity Types
An entity type declares what to pull out and, critically, how to decide two extractions are the same thing:
context.getKnowledge().addEntityType("clients", {
type: "company",
attributes: ["industry", "hq", "headcount", "ticker"],
extractionHint: "A client company the firm has done work for. " +
"Prefer the legal name; capture stock ticker if present.",
dedupKeys: ["canonicalName", "ticker"],
autoMergeThreshold: 0.92, // ≥ this similarity → merge automatically
reviewThreshold: 0.75, // ≥ this but < auto → flag for review
escalateAmbiguous: true // route hard calls to the resolution model
});
context.getKnowledge().addEntityType("clients", {
type: "person",
attributes: ["title", "email"],
extractionHint: "A consultant or client contact named in the document.",
dedupKeys: ["email"], // people without a hard id stay conservative
reviewThreshold: 0.80,
escalateAmbiguous: true
});
The service stamps version = 1 and an addedAt on each type;
duplicate type names are rejected. The thresholds encode the
"prefer a duplicate over a false merge" principle: a false
merge ("two different people collapsed into one") gives silently
wrong answers and is hard to detect; a duplicate is visible and
fixable. Be conservative, especially for people.
Read them back any time:
context.getKnowledge().listEntityTypes("clients").forEach(function (t) {
context.log(t.type + " v" + t.version + " — dedup on " + t.dedupKeys.join(","));
});
Add Relationship Types
Declare the small, typed edge vocabulary. Each type is
constrained to a fromType → toType, which both guides
extraction and validates traversal queries:
var k = context.getKnowledge();
k.addRelationshipType("clients", {
relType: "worked_on", fromType: "person", toType: "project",
extractionHint: "The person staffed on or delivering the project."
});
k.addRelationshipType("clients", {
relType: "client_of", fromType: "project", toType: "company",
extractionHint: "The client company a project was delivered for."
});
k.addRelationshipType("clients", {
relType: "has_skill", fromType: "person", toType: "skill"
});
The combined extraction pass then emits these edges
automatically from documents (tabular-first). List them with
k.listRelationshipTypes("clients").
Keep the vocabulary small. A handful of well-defined types yields cleaner queries and less extraction noise than open-ended "relate freely." Add a type when a concrete query need appears.
Add A Controlled Vocabulary (Taxonomy)
The "IT is too broad" problem: a user asks for "people with IT skills", but the data says "Java", "Kubernetes", "React". Declare a taxonomy so broad terms resolve to their underlying values instead of being guessed:
var k = context.getKnowledge();
k.addTaxonomyTerm("clients", {
term: "Financial Technology",
aliases: ["fintech", "fin-tech", "fin tech"],
entityType: "industry"
});
k.addTaxonomyTerm("clients", {
term: "Java",
categoryPath: "IT/Backend",
aliases: ["java se", "jdk"],
entityType: "skill"
});
At extraction time, variant surface forms normalize to the canonical term; at query time the agent can resolve a category ("IT/Backend") to its members. This is what stops "fintech" and "Financial Technology" from being counted as two industries.
Ingestion: Preview, Backfill, Reprocess, Jobs
Extraction runs as a background job per document, driven off the platform message bus and executed by an extraction worker. Your code enqueues; the worker extracts, resolves, and writes. You never block an Action on an LLM extraction call.
Preview Before You Commit (preview)
preview is a dry run: it shows you what would be
extracted and resolved from one document — nothing written, no
job row created. Use it to tune an extractionHint or a
dedupKeys list before a full backfill:
var p = context.getKnowledge().preview("clients", "doc_6631a2");
p.entities.forEach(function (e) {
// action is CREATE (new), MERGE (folds into an existing entity),
// or REVIEW (ambiguous — flagged, not auto-merged)
context.log(e.entityType + " " + e.canonicalName +
" → " + e.action +
(e.targetEntityId ? " into " + e.targetEntityId : "") +
" (conf " + e.confidence + ")");
});
p.relationships.forEach(function (r) {
context.log(r.fromName + " —" + r.relType + "→ " + r.toName +
" (conf " + r.confidence + ")");
});
If you see two obviously-different companies coming back as one
MERGE, tighten dedupKeys or raise autoMergeThreshold and
preview again. This is your tuning loop.
Backfill The Corpus (backfill)
backfill walks every Document-model record in the app and
enqueues each one for extraction into the dataset. It is
idempotent (a re-enqueue replaces the queued job and re-runs
the document) and bounded per call (it processes a slice;
call it again to continue a large corpus):
var r = context.getKnowledge().backfill("clients");
context.log("Enqueued " + r.enqueued + " documents this pass");
Reprocess After A Config Change (reprocess)
When you add or change an entity type after documents are
already extracted, you don't want to re-run the whole corpus —
only the documents that the new/bumped type still needs.
reprocess re-enqueues exactly those: documents whose completed
job lacks the type or predates its current version.
// You just added a "skill" type. Catch up only what's stale:
var r = context.getKnowledge().reprocess("clients", "skill");
context.log("Reprocessing " + r.enqueued + " documents for 'skill'");
Watch The Jobs (jobSummary, listJobs, jobStatus)
Extraction is asynchronous, so you watch it through the job API.
Statuses are QUEUED, RUNNING, DONE, FAILED.
var k = context.getKnowledge();
// Dashboard counts
var s = k.jobSummary("clients");
context.log(s.DONE + " done, " + s.QUEUED + " queued, " +
s.RUNNING + " running, " + s.FAILED + " failed");
// Drill into failures (newest first; limit caps at 200)
k.listJobs("clients", "FAILED", 20).forEach(function (j) {
context.log(j.documentRecordId + " (attempt " + j.attempts + "): " + j.error);
});
// One document's status
var job = k.jobStatus("clients", "doc_6631a2");
if (job) context.log("status=" + job.status + " types=" + job.extractedTypes);
Failed jobs are retried with backoff by the platform; the
attempts and error fields tell you what's happening. A
healthy ingestion trends QUEUED → DONE with FAILED near zero.
Querying The Knowledge Store
This is where the value is realised. You issue a structured query through a small, validated DSL — never raw Mongo. The structure is the point: because the agent or app can only ever produce a well-formed query of this shape, the access-control stages can never be bypassed, and injection is impossible.
var rows = context.getKnowledge().query("clients", {
entityType: "company", // required
where: [ /* attribute predicates */ ],
traverse: [ /* 1–2 relationship hops */ ],
aggregate: { /* optional terminal reduction */ },
limit: 50, // default 50
offset: 0
});
Any top-level field outside {entityType, where, traverse, aggregate, limit, offset} is rejected.
where: Attribute Predicates
A where is a list of { attr, op, value | min/max } clauses,
AND-ed together. The operator is one of a whitelisted set —
nothing else is expressible:
op | Shape | Meaning |
|---|---|---|
eq | { attr, op:"eq", value } | attribute equals value |
in | { attr, op:"in", value:[…] } | attribute is one of a set |
contains | { attr, op:"contains", value } | attribute contains value (substring / member) |
range | { attr, op:"range", min, max } | numeric/date range |
exists | { attr, op:"exists" } | the attribute is present at all |
// Retail companies headquartered in Boston with a known headcount
context.getKnowledge().query("clients", {
entityType: "company",
where: [
{ attr: "industry", op: "eq", value: "retail" },
{ attr: "hq", op: "eq", value: "Boston" },
{ attr: "headcount",op: "exists" }
]
});
// Companies in any of three industries, headcount 500–5000
context.getKnowledge().query("clients", {
entityType: "company",
where: [
{ attr: "industry", op: "in", value: ["retail", "fintech", "logistics"] },
{ attr: "headcount", op: "range", min: 500, max: 5000 }
]
});
Every attr and the entityType are validated against the
dataset config — query a typo and you get a clear error, not a
silently empty result.
aggregate: Count, Group, Distinct
Add a terminal aggregate to reduce instead of listing. Without
one, you get the matched entities; with one, you get aggregate
rows.
op | Needs attr? | Returns |
|---|---|---|
count | no | a single row { total: N } |
group_by | yes | one row per distinct value: { _id: value, count: N }, highest count first |
distinct | yes | one row per distinct value: { _id: value }, sorted |
// How many fintech companies?
context.getKnowledge().query("clients", {
entityType: "company",
where: [{ attr: "industry", op: "eq", value: "fintech" }],
aggregate: { op: "count" }
});
// Company count per industry (the classic "give me the breakdown")
context.getKnowledge().query("clients", {
entityType: "company",
aggregate: { op: "group_by", attr: "industry" }
});
// What industries do we even have?
context.getKnowledge().query("clients", {
entityType: "company",
aggregate: { op: "distinct", attr: "industry" }
});
This is the class of question RAG simply cannot answer: a similarity search returns the most similar chunks, never a count. Here, "14 retail clients" is a fact.
Vocabulary: Query Against Reality, Not Guesses
Before you (or an agent) filter on industry = "fintech", you
need to know the data actually says "fintech" and not "Financial
Technology" or "FinTech". vocabulary returns the distinct
attribute values and taxonomy facets that actually exist for a
type — computed over the access-filtered set, so it never leaks a
value the user can't see.
var vocab = context.getKnowledge().vocabulary("clients", "company");
// → { industry: ["retail","fintech","logistics", …],
// hq: ["Boston","London", …],
// … plus taxonomy facets … }
The pattern is vocabulary first, then query: read the real values, pick the right literal, then predicate on it. This is the single most effective habit for getting complete answers — it's why "list retail clients" returns 14 and not 9 (you queried the value that exists, not the one you assumed).
Relationships And Traversal
Relationships unlock the query class RAG never could. A traverse
list adds up to two relationship hops to a query; each hop is
{ relType, direction, as }, where direction is out
(follow the edge forwards, the default) or in (follow it
backwards), and as names the joined set in the output.
// "Which people worked on a project for a retail client?"
// person --worked_on--> project --client_of--> company(retail)
context.getKnowledge().query("clients", {
entityType: "person",
traverse: [
{ relType: "worked_on", direction: "out", as: "projects" },
{ relType: "client_of", direction: "out", as: "clients" }
]
// (filter the far end via where on the joined attributes, or
// narrow the start set, depending on your model)
});
// Reverse direction: "who is staffed on Project Atlas?"
// start from the project, walk worked_on backwards to people
context.getKnowledge().query("clients", {
entityType: "project",
where: [{ attr: "name", op: "eq", value: "Project Atlas" }],
traverse: [{ relType: "worked_on", direction: "in", as: "people" }]
});
A traversal is only as good as the resolution underneath it — the whole reason edges are built on canonical ids is so "IBM" and "I.B.M." don't fragment your graph and make relational queries incomplete again. More than two hops is rejected: the design deliberately stays within MongoDB's comfortable reach rather than adopting a graph database.
Writing Edges By Hand
Extraction creates edges for you, but sometimes you have a
relationship your app knows (e.g. from its own relational data)
that you want to assert into the graph. addRelationship writes
one directed edge between two existing entities:
var rel = context.getKnowledge().addRelationship("clients", {
fromEntity: "entity_jane",
toEntity: "entity_atlas",
relType: "worked_on",
sourceDocId: "doc_hr_roster", // for lineage / "where does it say?"
sourceDocName: "HR Roster 2026",
folderIds: ["folder_hr"], // for access filtering on this edge
confidence: 1.0
});
context.log("Created edge " + rel.id);
Edge writes are idempotent on (dataset, from, to, relType, sourceDoc), so re-asserting the same edge from the same source
won't duplicate it.
Letting The Agent Query On Its Own
Everything above assumes you write the query. The real power move is handing the query surface to an LLM agent and letting it decide. Two native AITools do this:
| Tool | What the LLM does with it |
|---|---|
knowledge_vocabulary | "What values can I even filter on?" — discover concrete literals before querying. |
knowledge_query | "Run this structured query" — exact lookups, aggregates, traversals. |
Add them to your agent's chain JSON:
{
"tools": [
{ "type": "knowledgeVocabulary", "name": "discoverValues" },
{ "type": "knowledgeQuery", "name": "queryKnowledge" }
]
}
(The canonical type names KNOWLEDGE_VOCABULARY and
KNOWLEDGE_QUERY work too; knowledgeVocabulary /
knowledgeQuery are the friendly aliases.) Unlike the recall
tools, the knowledge tools take their datasetId as a tool
argument, not as chain config — so one agent definition can
query any dataset the prompt names.
The agent then runs a discover-then-query loop on its own:
User: "How many of our consultants know Kubernetes, and who are they?"
Agent: [calls discoverValues: datasetId="clients", entityType="person",
attributeKey="skill"]
Tool: skill: ["Java", "Kubernetes", "React", "Terraform", …]
Agent: [sees "Kubernetes" is the real value; calls queryKnowledge with
spec={ entityType:"person",
where:[{attr:"skill", op:"contains", value:"Kubernetes"}],
aggregate:{op:"count"} }]
Tool: aggregate: count; entityType: person; 1 row. { total: 7 }
Agent: [calls queryKnowledge again, same where, no aggregate, to list them]
Tool: 7 entities. Jane Okafor; Erik Lind; …
Agent: "7 consultants list Kubernetes: Jane Okafor, Erik Lind, …"
The agent grounded itself in real vocabulary, ran an exact count, then listed — three tool calls, zero hallucinated skill names. Both tools' results are always access-filtered to what the current user may read, so you can expose them to an end-user agent without leaking protected facts.
You can also pass an optional folderIds array to either tool to
scope visibility further; the record-level ACL still applies on
top.
Access Control
An entity is not in a folder. The same "Acme" appears in documents across many folders with different permissions, so "can this user see Acme?" has no single answer — it depends on which facts, from which sources, the user may read. The knowledge layer answers this at the per-fact level.
How It Works
Three things are true, and they're worth internalising:
- Facts inherit the access of their own source. If the only document asserting "Acme is in retail" is one you can't see, that fact does not count for you — even if you can see some other, unrelated Acme document. Otherwise "list retail clients" would leak a fact that lived only in a protected file.
- An entity with no readable fact disappears. Filtering drops the whole entity if nothing survives, which closes the inference leak ("Acme showed up in my retail list, therefore something says Acme is retail").
- The same query yields different answers for different users, and that is correct. Counts and lists are relative to permissions. Frame this for stakeholders: "we have 14 retail clients but you see 9" is access filtering, not a bug.
Enforcement is defense-in-depth. There's a fast folder pre-filter (a performance optimization) and an authoritative per-source record ACL that is always applied. Even if a denormalized folder list drifts, access cannot leak, because the record ACL is the real gate.
Registering The Resolver (One-Time, App Startup)
The fast pre-filter needs to know "which folders may this user see?" — app-specific logic. Register it once, process-wide, in Java startup:
KnowledgeAccess.setPermittedFolderResolver((scope, user) -> {
// Return the set of folder/document ids this user may see,
// or null to mean "no pre-filter; rely on the record ACL".
// MUST be a superset of the user's true folders; MUST NOT be
// empty to mean "see nothing" — return null for that.
return myAcl.foldersVisibleTo(user);
});
If no resolver is registered, you don't lose access control — you just lose the pre-filter optimization. The authoritative per-source check still runs on every query. "No resolver" does not mean "no access control."
Scoping Per Call (folderIds)
From JS or an agent tool you can narrow visibility further by
passing folderIds — a performance/scoping hint, with the record
ACL still authoritative:
// Only consider facts sourced from these folders (plus the ACL):
context.getKnowledge().query("clients",
{ entityType: "company", where: [{ attr: "industry", op: "eq", value: "retail" }] },
["folder_active_engagements"]);
Knowledge vs RAG: When To Use Which
The two layers coexist by design. Route the question to the right tool:
| The question is… | Use | Why |
|---|---|---|
| "List all X", "how many Y", "count per Z" | Knowledge query + aggregate | Exhaustive & exact; RAG can't count or guarantee completeness. |
| "Which A relate to B" (1–2 hops) | Knowledge traverse | Relational; RAG has no notion of edges. |
| "Find a project like this", "what did we say about pricing" | RAG (existing vector search) | Fuzzy / semantic / discovery; knowledge has no similarity notion. |
| "Give me an example of…" | RAG | Open-ended retrieval. |
| "Narrow to retail clients, then find the most relevant case study" | Hybrid | Knowledge to get the exact candidate set, RAG to rank within it. |
A good agent has both a knowledge_query tool and the existing
RAG/search tool, and a system prompt that tells it which class of
question each serves. Exhaustive and exact → knowledge. Fuzzy and
illustrative → RAG.
Worked Examples, Trivial To Advanced
Level 0 — Hello World: Count One Thing
// "How many companies have we extracted?"
context.getKnowledge().query("clients", {
entityType: "company",
aggregate: { op: "count" }
});
Level 1 — An Exact, Filtered List
// "List our fintech clients, headquartered anywhere, with a ticker."
context.getKnowledge().query("clients", {
entityType: "company",
where: [
{ attr: "industry", op: "eq", value: "fintech" },
{ attr: "ticker", op: "exists" }
],
limit: 200
}).map(function (c) { return c.canonicalName + " (" + c.attributes.ticker + ")"; });
Level 2 — A Breakdown For A Dashboard Widget
// Feed a chart: clients per industry.
function clientsByIndustry() {
return context.getKnowledge().query("clients", {
entityType: "company",
aggregate: { op: "group_by", attr: "industry" }
}); // → [{ _id:"retail", count:14 }, { _id:"fintech", count:9 }, …]
}
Level 3 — Vocabulary-Grounded Query (No Guessing)
// Build a filter dropdown from the values that actually exist,
// then query the chosen one — always the real literal.
function companiesForIndustryPicker(chosen) {
var k = context.getKnowledge();
var industries = k.vocabulary("clients", "company").industry; // real values
if (industries.indexOf(chosen) < 0) return []; // not in data
return k.query("clients", {
entityType: "company",
where: [{ attr: "industry", op: "eq", value: chosen }]
});
}
Level 4 — A Relational, Two-Hop Question
// "Who has worked on a project for a retail client?"
// person --worked_on--> project --client_of--> retail company
function peopleOnRetailEngagements() {
return context.getKnowledge().query("clients", {
entityType: "person",
traverse: [
{ relType: "worked_on", direction: "out", as: "projects" },
{ relType: "client_of", direction: "out", as: "clients" }
],
limit: 100
});
}
Level 5 — A Self-Driving Knowledge Agent
Wire both tools into an agent and let it handle arbitrary exact/relational questions over the corpus. The agent definition:
{
"name": "corpusAnalyst",
"system": "You answer questions about the firm's client corpus. " +
"For EXACT or RELATIONAL questions (list all, count, who " +
"worked with whom) use the knowledge tools: first call " +
"discoverValues to learn the real attribute values, then " +
"queryKnowledge. For OPEN-ENDED or EXAMPLE questions use " +
"the search tool. Never invent attribute values; discover " +
"them. Always pass datasetId 'clients'.",
"tools": [
{ "type": "knowledgeVocabulary", "name": "discoverValues" },
{ "type": "knowledgeQuery", "name": "queryKnowledge" },
{ "type": "search", "name": "searchDocuments" }
]
}
Invoke it from an Action and the agent does discover → query → answer on its own, access-filtered to the calling user:
function askAnalyst(arguments) {
return context.getAIFunctions().invokeAgent("corpusAnalyst", {
arguments: { userMessage: arguments.question }
});
}
Level 6 — A Governed Ingestion Pipeline
Preview-driven tuning, controlled rollout, and monitoring, combined into an admin Action:
function ingestClients(arguments) {
var k = context.getKnowledge();
// 1. Tune against a sample before committing the corpus.
if (arguments.sampleDocId) {
var p = k.preview("clients", arguments.sampleDocId);
var merges = p.entities.filter(function (e) { return e.action === "MERGE"; });
var reviews = p.entities.filter(function (e) { return e.action === "REVIEW"; });
context.log("preview: " + p.entities.length + " entities, " +
merges.length + " merges, " + reviews.length + " to review");
if (arguments.previewOnly) return p; // human checks before go-live
}
// 2. Backfill in bounded passes (loop until jobSummary stabilises).
var enq = k.backfill("clients");
// 3. Report health.
var s = k.jobSummary("clients");
var failures = k.listJobs("clients", "FAILED", 10);
return { enqueued: enq.enqueued, summary: s, recentFailures: failures };
}
Common Patterns
Declare-once, guard the create. createDataset rejects a
duplicate. Wrap setup so it's safe to run on every deploy: probe
with listEntityTypes (or your own marker) and only create when
absent.
Preview before backfill, always. One preview on a
representative document catches a bad extractionHint or an
over-eager dedupKeys before you've extracted ten thousand
documents wrong. It's free — nothing is written.
Vocabulary before query. Whether it's your code or an agent,
read vocabulary to get the real literals, then predicate on
them. This is the difference between a complete answer and a
plausible-looking partial one.
Reprocess, don't re-backfill, after a config change. Added a
type? reprocess(dataset, type) touches only stale documents.
backfill re-runs everything — use it for the initial load, not
for incremental config changes.
Let the agent count; let RAG illustrate. Give a customer agent both surfaces and a prompt that routes by question class. "How many / list all / who worked with" → knowledge. "Show me an example / what did we say about" → RAG.
Source every hand-written edge. When you addRelationship,
pass sourceDocId/folderIds. It costs nothing and buys you
traceability ("where does it say?") and correct access filtering
on that edge.
What To Watch Out For
Extraction is asynchronous — don't query immediately after
backfill. backfill returns an enqueued count, not a
done count. Poll jobSummary until QUEUED and RUNNING
drain before treating the store as complete. A query against a
half-extracted corpus is correct but partial.
Counts are access-relative, and that's by design. Two users running the identical query can get different numbers. Surface this to stakeholders up front so "the count looks low" is understood as filtering, not a defect.
A false merge is worse than a duplicate. If two distinct
companies collapse into one, your "list all clients" is silently
wrong and you may never notice. Stay conservative on
autoMergeThreshold, lean on reviewThreshold/
escalateAmbiguous for the hard cases, and remember that a
visible duplicate is fixable while a hidden false-merge is not.
Validate against config, and trust the errors. An unknown
attr, a bad operator, a missing entityType, or a third
traversal hop all throw a clear IllegalArgumentException — they
do not silently return []. If a query comes back empty, check
that the value exists (via vocabulary) before assuming the
query is wrong.
backfill is bounded per call. For a large corpus it
processes a slice and returns; call it again (or loop) until the
enqueued count settles. It's idempotent, so repeated calls are
safe.
No resolver ≠ no access control. Skipping
setPermittedFolderResolver only drops the pre-filter
optimization; the authoritative per-source record ACL still
runs on every query. But a resolver that wrongly returns an
empty set means "see nothing" — return null, never empty, for
"no derivation available."
Reference Cheat Sheet
Facade — context.getKnowledge()
Config
createDataset(datasetId, { name, extractionModel?, resolutionModel? }) // → dataset config
addEntityType(datasetId, { type, attributes?, extractionHint?, dedupKeys?,
autoMergeThreshold?, reviewThreshold?, escalateAmbiguous? })
addRelationshipType(datasetId, { relType, fromType, toType, extractionHint? })
addTaxonomyTerm(datasetId, { term, categoryPath?, aliases?, entityType? })
listEntityTypes(datasetId) // → [{ type, version, attributes, dedupKeys, … }]
listRelationshipTypes(datasetId) // → [{ relType, fromType, toType, version, … }]
Ingestion & jobs
preview(datasetId, documentId) // dry run → { entities[], relationships[] }, nothing written
backfill(datasetId) // → { datasetId, enqueued } (idempotent, bounded)
reprocess(datasetId, entityType) // → { datasetId, entityType, enqueued } (only stale docs)
jobStatus(datasetId, documentId) // → job map | null
listJobs(datasetId, status, limit) // status ∈ QUEUED|RUNNING|DONE|FAILED ; limit ≤ 200
jobSummary(datasetId) // → { QUEUED, RUNNING, DONE, FAILED }
Query
query(datasetId, spec) // entities, or aggregate rows
query(datasetId, spec, folderIds) // …scoped to folderIds (ACL still applies)
vocabulary(datasetId, entityType) // → { attrKey: [distinct values…], …taxonomy facets }
vocabulary(datasetId, entityType, folderIds)
addRelationship(datasetId, { fromEntity, toEntity, relType,
sourceDocId?, sourceDocName?, folderIds?, confidence? })
Query Spec DSL
{
entityType: "company", // REQUIRED
where: [ // AND-ed predicates
{ attr, op: "eq", value },
{ attr, op: "in", value: [ … ] },
{ attr, op: "contains", value },
{ attr, op: "range", min, max },
{ attr, op: "exists" }
],
traverse: [ // up to 2 hops
{ relType, direction: "out"|"in", as }
],
aggregate: { op: "count" }, // or
aggregate: { op: "group_by", attr }, // or
aggregate: { op: "distinct", attr },
limit: 50, // default 50
offset: 0
}
// Any other top-level field → rejected.
Entity Result Shape (No Aggregate)
{
id, entityType, canonicalName, normalizedName,
attributes: { key: value, … }, // flattened, first-wins
sourceDocs: [ … ], folderIds: [ … ]
}
Agent Tools (Chain JSON)
{ "type": "knowledgeVocabulary", "name": "discoverValues" }
{ "type": "knowledgeQuery", "name": "queryKnowledge" }
Both take datasetId + (spec | entityType) as tool
arguments; both accept an optional folderIds; both are always
access-filtered to the current user.
Access Control (Java Startup)
KnowledgeAccess.setPermittedFolderResolver(
(scope, user) -> aclFoldersFor(user) /* superset, or null — never empty */ );