Knowledge

Pull structured entities and relationships out of your documents into a queryable store — for the exact, exhaustive, and relational questions vector search can't answer

Knowledge Guide

This guide is for the developer who has documents and needs exact answers from them. You have a folder full of Excel project sheets, PowerPoint case studies, PDFs and Word files, and your users keep asking questions that vector search can't answer well:

  • "List every company we've worked with in retail."
  • "How many of our people have a competence in Java?"
  • "Which consultants have worked with a fintech client?"

These are exhaustive, exact, and relational questions. Retrieval-augmented generation (RAG) is great at fuzzy, open-ended ones ("find a project like this", "what did we say about pricing") but it cannot guarantee a complete, deduplicated list, and it cannot count. The reason is simple: the entities — the company names, the people, the skills — never leave the document. Only a lossy summary embedding does.

The knowledge layer fixes that. It pulls structured entities and the relationships between them out of your documents and into a queryable store, so the questions above become deterministic queries instead of similarity guesses. It runs alongside the existing RAG layer, not instead of it.

This guide explains the mental model, walks you through the common patterns from a trivial "hello world" to a relational multi-hop agent, and ends with a reference cheat sheet you can keep open while you work.

What You Get

The platform gives you one cooperating set of capabilities, reached through context.getKnowledge() inside any Action, plus two native AITools the LLM can call on its own.

CapabilityAccessorWhat it does
Configcontext.getKnowledge().createDataset(...), addEntityType(...), addRelationshipType(...), addTaxonomyTerm(...)Declare what to extract and how to deduplicate it — no code, just configuration.
Ingestionpreview(...), backfill(...), reprocess(...)Dry-run extraction, then enqueue documents for extraction in the background.
JobsjobStatus(...), listJobs(...), jobSummary(...)Watch extraction progress and inspect failures.
Queryquery(...), vocabulary(...)Run exact lookups, aggregations, and 1–2 hop relational queries — always access-filtered.
RelationshipsaddRelationship(...), listRelationshipTypes(...)Read the relationship vocabulary; write edges by hand when you need to.
Agent toolsknowledge_query, knowledge_vocabularyThe LLM discovers values and runs queries itself, mid-turn.

Everything is multi-tenant and app-scoped by construction. A dataset is keyed by (tenant, app, datasetId); you never see another tenant's or another app's entities. Everything is also access-filtered per fact: the same query returns different results for different users, and that is correct behaviour (see Access Control).

The same surface is reachable over the agent-tool layer for LLM-driven use. This guide focuses on the JS Action surface — that's where most integration code lives — and then shows how to hand the same power to an agent.

The Mental Model In One Picture

If you take away one mental model from this guide, take this one:

        DOCUMENTS                         KNOWLEDGE STORE
        ─────────                         ───────────────
        (Excel / PPT /                    (entities + facts + edges,
         PDF / Word)                       deduplicated, exact, queryable)

   ┌─────────────────────────┐      ┌──────────────────────────────┐
   │  project-list.xlsx      │      │  entity: "Acme Corp"          │
   │  ┌────────┬─────────┐   │      │    type: company              │
   │  │ Person │ Client  │   │ ───► │    aliases: [Acme, Acme Inc.] │
   │  ├────────┼─────────┤   │      │    facts:                     │
   │  │ Jane   │ Acme    │   │      │      industry=retail (doc A)  │
   │  │ Erik   │ Acme    │   │      │      hq=Boston       (doc B)  │
   │  └────────┴─────────┘   │      └──────────────────────────────┘
   └─────────────────────────┘                 ▲           │
        │                                       │           │
        │  EXTRACTION + RESOLUTION              │           │  QUERY
        │  (background worker,                  │           │  ( where / traverse
        │   per-row, dedup inline)              │           │    / aggregate )
        │                                       │           ▼
        └───────────────────────────────────────   ┌──────────────────────┐
                                                    │  "list all retail    │
            ┌───────────────────────────────┐       │   companies" → exact │
            │  edge: Jane —worked_on→ Acme   │ ◄───► │  "count per industry"│
            │  edge: Erik —worked_on→ Acme   │       │  "who worked w/ Acme"│
            └───────────────────────────────┘       └──────────────────────┘

A document goes in. The platform classifies each block as tabular (rows of entities) or prose, then runs an LLM extraction pass that pulls out entities (a company, a person, a skill), their attribute facts (industry, headcount, each carrying its own source), and the relationships between them (Jane worked_on Acme). Before anything is written, each entity is resolved against what's already there — "Acme", "Acme Inc." and "Acme Corp" collapse into one canonical node — so your counts and lists are clean.

At query time you don't do similarity search. You issue a structured query: select an entity type, filter by attribute predicates, optionally traverse one or two relationship hops, and optionally reduce to a count or a group-by. The answer is exact and exhaustive, and it's filtered to exactly the documents the asking user is allowed to see.

That's the whole story. Everything below is mechanics.

Quick Start: From Documents To An Exact List

Let's do the smallest end-to-end thing that has value. You have an app with a Document model holding a pile of client documents. You want to answer "list every company, by industry."

Four steps: declare a dataset, declare the entity type, backfill the documents, query.

// action: setupAndQueryClients
function setupAndQueryClients() {
    var k = context.getKnowledge();

    // 1. Declare a dataset (idempotent per (tenant, app, datasetId))
    if (!datasetExists(k, "clients")) {
        k.createDataset("clients", {
            name: "Client corpus",
            extractionModel: "gemini-3.1-flash-lite"
        });

        // 2. Declare what to extract and how to deduplicate it
        k.addEntityType("clients", {
            type:           "company",
            attributes:     ["industry", "hq", "headcount"],
            extractionHint: "Companies the firm has worked with as a client.",
            dedupKeys:      ["canonicalName"]
        });
    }

    // 3. Enqueue every Document-model record for extraction.
    //    This runs in the background; it returns immediately.
    var enq = k.backfill("clients");
    context.log("Enqueued " + enq.enqueued + " documents");

    // 4. Query — once extraction has run (watch jobSummary), this is
    //    an exact, deduplicated, access-filtered list.
    var companies = k.query("clients", {
        entityType: "company",
        where: [{ attr: "industry", op: "eq", value: "retail" }],
        limit: 100
    });
    companies.forEach(function (c) {
        context.log(c.canonicalName + "  (" + c.attributes.hq + ")");
    });
}

function datasetExists(k, id) {
    try { k.listEntityTypes(id); return true; }
    catch (e) { return false; }
}

That's it. No collections to create, no schema migration. You declared two things (a dataset and a type), pointed the pipeline at your documents, and got back an exact list. Adding industry counts is one more query:

var counts = k.query("clients", {
    entityType: "company",
    aggregate: { op: "group_by", attr: "industry" }
});
// → [ { _id: "retail", count: 14 }, { _id: "fintech", count: 9 }, ... ]

The rest of this guide unpacks each step and shows how far it goes.

The Four Building Blocks

Everything in the knowledge layer is built from four nouns.

Dataset

A dataset is the container — a named knowledge graph scoped to your app. It holds the extraction configuration (entity types, relationship types, taxonomy) and owns all the entities and edges extracted under it. You'll usually have one per problem domain ("clients", "products", "case-law"), keyed by a datasetId you choose. A dataset also pins the models used for extraction and resolution.

Entity

An entity is a resolved, real-world thing of a configured entityType — a company, a person, a skill. After deduplication it is a single canonical node, even though it was mentioned in twenty documents under five spellings. It carries:

  • canonicalName — the resolved, preferred form.
  • aliases[] — every observed variant ("IBM", "I.B.M.", "International Business Machines"), kept, not discarded.
  • attributes[] — an array of attribute facts, not a flat key/value bag (see below).
  • sourceDocs[] / folderIds[] — the derived union of every source that contributed a fact, used for fast access pre-filtering.

When you read an entity back through query(...), attributes are flattened to a convenient { key: value } map (first-wins when a merged entity holds several facts for the same key). The raw per-fact structure stays underneath, where access control needs it.

Attribute Fact

This is the subtle, important one. An attribute is not a bare property on the entity. It is a { key, value, sourceDocId, folderIds, confidence } record. "Acme is in retail" is a fact asserted by a specific document. The same entity can carry the same key from several documents, each with its own source and its own access.

Why it matters: it's what lets one canonical "Acme" node present a different view of itself to different users (you only see the facts whose source you can read), and it's what lets the firm click "where does it say Acme is in retail?" Facts carry provenance because provenance does triple duty — access, traceability, and dedup signal.

Relationship

A relationship is a directed, typed edge between two resolved entities: Jane —worked_on→ "Project Atlas", "Project Atlas" —client_of→ Acme. Edges are created between canonical ids, never between raw name strings, so the graph doesn't fragment across spelling variants. The relationship vocabulary is small and typed (you declare it), which keeps traversal queries clean. Each edge carries confidence and a source, just like a fact.

Tabular rows are the high-confidence source for relationships: a row person | project | client | year is three edges, stated with high reliability because row proximity is the relationship. Prose-derived edges ("Anna led the team at IBM") are extracted too, but at lower confidence.

Configuring A Dataset

Configuration is the feature — adding or tuning a type is data, not code. All config writes are on context.getKnowledge().

Create The Dataset

var ds = context.getKnowledge().createDataset("clients", {
    name:            "Client corpus",
    extractionModel: "gemini-3.1-flash-lite", // bulk per-row extraction
    resolutionModel: "gemini-3.1-pro"         // ambiguous-merge escalation
});

createDataset is one-per-(tenant, app, datasetId) — a duplicate id is rejected, so guard it (see the datasetExists helper above). The two model fields are optional; they let you run cheap, fast extraction on the bulk path and escalate only genuinely ambiguous merge decisions to a stronger model.

Add Entity Types

An entity type declares what to pull out and, critically, how to decide two extractions are the same thing:

context.getKnowledge().addEntityType("clients", {
    type:           "company",
    attributes:     ["industry", "hq", "headcount", "ticker"],
    extractionHint: "A client company the firm has done work for. " +
                    "Prefer the legal name; capture stock ticker if present.",
    dedupKeys:      ["canonicalName", "ticker"],
    autoMergeThreshold: 0.92,  // ≥ this similarity → merge automatically
    reviewThreshold:    0.75,  // ≥ this but < auto → flag for review
    escalateAmbiguous:  true   // route hard calls to the resolution model
});

context.getKnowledge().addEntityType("clients", {
    type:           "person",
    attributes:     ["title", "email"],
    extractionHint: "A consultant or client contact named in the document.",
    dedupKeys:      ["email"],     // people without a hard id stay conservative
    reviewThreshold: 0.80,
    escalateAmbiguous: true
});

The service stamps version = 1 and an addedAt on each type; duplicate type names are rejected. The thresholds encode the "prefer a duplicate over a false merge" principle: a false merge ("two different people collapsed into one") gives silently wrong answers and is hard to detect; a duplicate is visible and fixable. Be conservative, especially for people.

Read them back any time:

context.getKnowledge().listEntityTypes("clients").forEach(function (t) {
    context.log(t.type + " v" + t.version + " — dedup on " + t.dedupKeys.join(","));
});

Add Relationship Types

Declare the small, typed edge vocabulary. Each type is constrained to a fromType → toType, which both guides extraction and validates traversal queries:

var k = context.getKnowledge();
k.addRelationshipType("clients", {
    relType: "worked_on", fromType: "person",  toType: "project",
    extractionHint: "The person staffed on or delivering the project."
});
k.addRelationshipType("clients", {
    relType: "client_of", fromType: "project", toType: "company",
    extractionHint: "The client company a project was delivered for."
});
k.addRelationshipType("clients", {
    relType: "has_skill", fromType: "person",  toType: "skill"
});

The combined extraction pass then emits these edges automatically from documents (tabular-first). List them with k.listRelationshipTypes("clients").

Keep the vocabulary small. A handful of well-defined types yields cleaner queries and less extraction noise than open-ended "relate freely." Add a type when a concrete query need appears.

Add A Controlled Vocabulary (Taxonomy)

The "IT is too broad" problem: a user asks for "people with IT skills", but the data says "Java", "Kubernetes", "React". Declare a taxonomy so broad terms resolve to their underlying values instead of being guessed:

var k = context.getKnowledge();
k.addTaxonomyTerm("clients", {
    term: "Financial Technology",
    aliases: ["fintech", "fin-tech", "fin tech"],
    entityType: "industry"
});
k.addTaxonomyTerm("clients", {
    term: "Java",
    categoryPath: "IT/Backend",
    aliases: ["java se", "jdk"],
    entityType: "skill"
});

At extraction time, variant surface forms normalize to the canonical term; at query time the agent can resolve a category ("IT/Backend") to its members. This is what stops "fintech" and "Financial Technology" from being counted as two industries.

Ingestion: Preview, Backfill, Reprocess, Jobs

Extraction runs as a background job per document, driven off the platform message bus and executed by an extraction worker. Your code enqueues; the worker extracts, resolves, and writes. You never block an Action on an LLM extraction call.

Preview Before You Commit (preview)

preview is a dry run: it shows you what would be extracted and resolved from one document — nothing written, no job row created. Use it to tune an extractionHint or a dedupKeys list before a full backfill:

var p = context.getKnowledge().preview("clients", "doc_6631a2");

p.entities.forEach(function (e) {
    // action is CREATE (new), MERGE (folds into an existing entity),
    // or REVIEW (ambiguous — flagged, not auto-merged)
    context.log(e.entityType + "  " + e.canonicalName +
                "  → " + e.action +
                (e.targetEntityId ? " into " + e.targetEntityId : "") +
                "  (conf " + e.confidence + ")");
});

p.relationships.forEach(function (r) {
    context.log(r.fromName + " —" + r.relType + "→ " + r.toName +
                "  (conf " + r.confidence + ")");
});

If you see two obviously-different companies coming back as one MERGE, tighten dedupKeys or raise autoMergeThreshold and preview again. This is your tuning loop.

Backfill The Corpus (backfill)

backfill walks every Document-model record in the app and enqueues each one for extraction into the dataset. It is idempotent (a re-enqueue replaces the queued job and re-runs the document) and bounded per call (it processes a slice; call it again to continue a large corpus):

var r = context.getKnowledge().backfill("clients");
context.log("Enqueued " + r.enqueued + " documents this pass");

Reprocess After A Config Change (reprocess)

When you add or change an entity type after documents are already extracted, you don't want to re-run the whole corpus — only the documents that the new/bumped type still needs. reprocess re-enqueues exactly those: documents whose completed job lacks the type or predates its current version.

// You just added a "skill" type. Catch up only what's stale:
var r = context.getKnowledge().reprocess("clients", "skill");
context.log("Reprocessing " + r.enqueued + " documents for 'skill'");

Watch The Jobs (jobSummary, listJobs, jobStatus)

Extraction is asynchronous, so you watch it through the job API. Statuses are QUEUED, RUNNING, DONE, FAILED.

var k = context.getKnowledge();

// Dashboard counts
var s = k.jobSummary("clients");
context.log(s.DONE + " done, " + s.QUEUED + " queued, " +
            s.RUNNING + " running, " + s.FAILED + " failed");

// Drill into failures (newest first; limit caps at 200)
k.listJobs("clients", "FAILED", 20).forEach(function (j) {
    context.log(j.documentRecordId + " (attempt " + j.attempts + "): " + j.error);
});

// One document's status
var job = k.jobStatus("clients", "doc_6631a2");
if (job) context.log("status=" + job.status + " types=" + job.extractedTypes);

Failed jobs are retried with backoff by the platform; the attempts and error fields tell you what's happening. A healthy ingestion trends QUEUED → DONE with FAILED near zero.

Querying The Knowledge Store

This is where the value is realised. You issue a structured query through a small, validated DSL — never raw Mongo. The structure is the point: because the agent or app can only ever produce a well-formed query of this shape, the access-control stages can never be bypassed, and injection is impossible.

var rows = context.getKnowledge().query("clients", {
    entityType: "company",                    // required
    where:      [ /* attribute predicates */ ],
    traverse:   [ /* 1–2 relationship hops */ ],
    aggregate:  { /* optional terminal reduction */ },
    limit:      50,                           // default 50
    offset:     0
});

Any top-level field outside {entityType, where, traverse, aggregate, limit, offset} is rejected.

where: Attribute Predicates

A where is a list of { attr, op, value | min/max } clauses, AND-ed together. The operator is one of a whitelisted set — nothing else is expressible:

opShapeMeaning
eq{ attr, op:"eq", value }attribute equals value
in{ attr, op:"in", value:[…] }attribute is one of a set
contains{ attr, op:"contains", value }attribute contains value (substring / member)
range{ attr, op:"range", min, max }numeric/date range
exists{ attr, op:"exists" }the attribute is present at all
// Retail companies headquartered in Boston with a known headcount
context.getKnowledge().query("clients", {
    entityType: "company",
    where: [
        { attr: "industry", op: "eq",     value: "retail" },
        { attr: "hq",       op: "eq",     value: "Boston" },
        { attr: "headcount",op: "exists" }
    ]
});

// Companies in any of three industries, headcount 500–5000
context.getKnowledge().query("clients", {
    entityType: "company",
    where: [
        { attr: "industry",  op: "in",    value: ["retail", "fintech", "logistics"] },
        { attr: "headcount", op: "range", min: 500, max: 5000 }
    ]
});

Every attr and the entityType are validated against the dataset config — query a typo and you get a clear error, not a silently empty result.

aggregate: Count, Group, Distinct

Add a terminal aggregate to reduce instead of listing. Without one, you get the matched entities; with one, you get aggregate rows.

opNeeds attr?Returns
countnoa single row { total: N }
group_byyesone row per distinct value: { _id: value, count: N }, highest count first
distinctyesone row per distinct value: { _id: value }, sorted
// How many fintech companies?
context.getKnowledge().query("clients", {
    entityType: "company",
    where: [{ attr: "industry", op: "eq", value: "fintech" }],
    aggregate: { op: "count" }
});

// Company count per industry (the classic "give me the breakdown")
context.getKnowledge().query("clients", {
    entityType: "company",
    aggregate: { op: "group_by", attr: "industry" }
});

// What industries do we even have?
context.getKnowledge().query("clients", {
    entityType: "company",
    aggregate: { op: "distinct", attr: "industry" }
});

This is the class of question RAG simply cannot answer: a similarity search returns the most similar chunks, never a count. Here, "14 retail clients" is a fact.

Vocabulary: Query Against Reality, Not Guesses

Before you (or an agent) filter on industry = "fintech", you need to know the data actually says "fintech" and not "Financial Technology" or "FinTech". vocabulary returns the distinct attribute values and taxonomy facets that actually exist for a type — computed over the access-filtered set, so it never leaks a value the user can't see.

var vocab = context.getKnowledge().vocabulary("clients", "company");
// → { industry: ["retail","fintech","logistics", …],
//     hq:       ["Boston","London", …],
//     … plus taxonomy facets … }

The pattern is vocabulary first, then query: read the real values, pick the right literal, then predicate on it. This is the single most effective habit for getting complete answers — it's why "list retail clients" returns 14 and not 9 (you queried the value that exists, not the one you assumed).

Relationships And Traversal

Relationships unlock the query class RAG never could. A traverse list adds up to two relationship hops to a query; each hop is { relType, direction, as }, where direction is out (follow the edge forwards, the default) or in (follow it backwards), and as names the joined set in the output.

// "Which people worked on a project for a retail client?"
// person --worked_on--> project --client_of--> company(retail)
context.getKnowledge().query("clients", {
    entityType: "person",
    traverse: [
        { relType: "worked_on", direction: "out", as: "projects" },
        { relType: "client_of", direction: "out", as: "clients"  }
    ]
    // (filter the far end via where on the joined attributes, or
    //  narrow the start set, depending on your model)
});

// Reverse direction: "who is staffed on Project Atlas?"
// start from the project, walk worked_on backwards to people
context.getKnowledge().query("clients", {
    entityType: "project",
    where:    [{ attr: "name", op: "eq", value: "Project Atlas" }],
    traverse: [{ relType: "worked_on", direction: "in", as: "people" }]
});

A traversal is only as good as the resolution underneath it — the whole reason edges are built on canonical ids is so "IBM" and "I.B.M." don't fragment your graph and make relational queries incomplete again. More than two hops is rejected: the design deliberately stays within MongoDB's comfortable reach rather than adopting a graph database.

Writing Edges By Hand

Extraction creates edges for you, but sometimes you have a relationship your app knows (e.g. from its own relational data) that you want to assert into the graph. addRelationship writes one directed edge between two existing entities:

var rel = context.getKnowledge().addRelationship("clients", {
    fromEntity:  "entity_jane",
    toEntity:    "entity_atlas",
    relType:     "worked_on",
    sourceDocId: "doc_hr_roster",      // for lineage / "where does it say?"
    sourceDocName: "HR Roster 2026",
    folderIds:   ["folder_hr"],        // for access filtering on this edge
    confidence:  1.0
});
context.log("Created edge " + rel.id);

Edge writes are idempotent on (dataset, from, to, relType, sourceDoc), so re-asserting the same edge from the same source won't duplicate it.

Letting The Agent Query On Its Own

Everything above assumes you write the query. The real power move is handing the query surface to an LLM agent and letting it decide. Two native AITools do this:

ToolWhat the LLM does with it
knowledge_vocabulary"What values can I even filter on?" — discover concrete literals before querying.
knowledge_query"Run this structured query" — exact lookups, aggregates, traversals.

Add them to your agent's chain JSON:

{
  "tools": [
    { "type": "knowledgeVocabulary", "name": "discoverValues" },
    { "type": "knowledgeQuery",      "name": "queryKnowledge" }
  ]
}

(The canonical type names KNOWLEDGE_VOCABULARY and KNOWLEDGE_QUERY work too; knowledgeVocabulary / knowledgeQuery are the friendly aliases.) Unlike the recall tools, the knowledge tools take their datasetId as a tool argument, not as chain config — so one agent definition can query any dataset the prompt names.

The agent then runs a discover-then-query loop on its own:

User:   "How many of our consultants know Kubernetes, and who are they?"

Agent:  [calls discoverValues: datasetId="clients", entityType="person",
         attributeKey="skill"]
Tool:   skill: ["Java", "Kubernetes", "React", "Terraform", …]

Agent:  [sees "Kubernetes" is the real value; calls queryKnowledge with
         spec={ entityType:"person",
                where:[{attr:"skill", op:"contains", value:"Kubernetes"}],
                aggregate:{op:"count"} }]
Tool:   aggregate: count; entityType: person; 1 row.  { total: 7 }

Agent:  [calls queryKnowledge again, same where, no aggregate, to list them]
Tool:   7 entities. Jane Okafor; Erik Lind; …

Agent:  "7 consultants list Kubernetes: Jane Okafor, Erik Lind, …"

The agent grounded itself in real vocabulary, ran an exact count, then listed — three tool calls, zero hallucinated skill names. Both tools' results are always access-filtered to what the current user may read, so you can expose them to an end-user agent without leaking protected facts.

You can also pass an optional folderIds array to either tool to scope visibility further; the record-level ACL still applies on top.

Access Control

An entity is not in a folder. The same "Acme" appears in documents across many folders with different permissions, so "can this user see Acme?" has no single answer — it depends on which facts, from which sources, the user may read. The knowledge layer answers this at the per-fact level.

How It Works

Three things are true, and they're worth internalising:

  1. Facts inherit the access of their own source. If the only document asserting "Acme is in retail" is one you can't see, that fact does not count for you — even if you can see some other, unrelated Acme document. Otherwise "list retail clients" would leak a fact that lived only in a protected file.
  2. An entity with no readable fact disappears. Filtering drops the whole entity if nothing survives, which closes the inference leak ("Acme showed up in my retail list, therefore something says Acme is retail").
  3. The same query yields different answers for different users, and that is correct. Counts and lists are relative to permissions. Frame this for stakeholders: "we have 14 retail clients but you see 9" is access filtering, not a bug.

Enforcement is defense-in-depth. There's a fast folder pre-filter (a performance optimization) and an authoritative per-source record ACL that is always applied. Even if a denormalized folder list drifts, access cannot leak, because the record ACL is the real gate.

Registering The Resolver (One-Time, App Startup)

The fast pre-filter needs to know "which folders may this user see?" — app-specific logic. Register it once, process-wide, in Java startup:

KnowledgeAccess.setPermittedFolderResolver((scope, user) -> {
    // Return the set of folder/document ids this user may see,
    // or null to mean "no pre-filter; rely on the record ACL".
    // MUST be a superset of the user's true folders; MUST NOT be
    // empty to mean "see nothing" — return null for that.
    return myAcl.foldersVisibleTo(user);
});

If no resolver is registered, you don't lose access control — you just lose the pre-filter optimization. The authoritative per-source check still runs on every query. "No resolver" does not mean "no access control."

Scoping Per Call (folderIds)

From JS or an agent tool you can narrow visibility further by passing folderIds — a performance/scoping hint, with the record ACL still authoritative:

// Only consider facts sourced from these folders (plus the ACL):
context.getKnowledge().query("clients",
    { entityType: "company", where: [{ attr: "industry", op: "eq", value: "retail" }] },
    ["folder_active_engagements"]);

Knowledge vs RAG: When To Use Which

The two layers coexist by design. Route the question to the right tool:

The question is…UseWhy
"List all X", "how many Y", "count per Z"Knowledge query + aggregateExhaustive & exact; RAG can't count or guarantee completeness.
"Which A relate to B" (1–2 hops)Knowledge traverseRelational; RAG has no notion of edges.
"Find a project like this", "what did we say about pricing"RAG (existing vector search)Fuzzy / semantic / discovery; knowledge has no similarity notion.
"Give me an example of…"RAGOpen-ended retrieval.
"Narrow to retail clients, then find the most relevant case study"HybridKnowledge to get the exact candidate set, RAG to rank within it.

A good agent has both a knowledge_query tool and the existing RAG/search tool, and a system prompt that tells it which class of question each serves. Exhaustive and exact → knowledge. Fuzzy and illustrative → RAG.

Worked Examples, Trivial To Advanced

Level 0 — Hello World: Count One Thing

// "How many companies have we extracted?"
context.getKnowledge().query("clients", {
    entityType: "company",
    aggregate: { op: "count" }
});

Level 1 — An Exact, Filtered List

// "List our fintech clients, headquartered anywhere, with a ticker."
context.getKnowledge().query("clients", {
    entityType: "company",
    where: [
        { attr: "industry", op: "eq",     value: "fintech" },
        { attr: "ticker",   op: "exists" }
    ],
    limit: 200
}).map(function (c) { return c.canonicalName + " (" + c.attributes.ticker + ")"; });

Level 2 — A Breakdown For A Dashboard Widget

// Feed a chart: clients per industry.
function clientsByIndustry() {
    return context.getKnowledge().query("clients", {
        entityType: "company",
        aggregate: { op: "group_by", attr: "industry" }
    }); // → [{ _id:"retail", count:14 }, { _id:"fintech", count:9 }, …]
}

Level 3 — Vocabulary-Grounded Query (No Guessing)

// Build a filter dropdown from the values that actually exist,
// then query the chosen one — always the real literal.
function companiesForIndustryPicker(chosen) {
    var k = context.getKnowledge();
    var industries = k.vocabulary("clients", "company").industry; // real values
    if (industries.indexOf(chosen) < 0) return [];                // not in data
    return k.query("clients", {
        entityType: "company",
        where: [{ attr: "industry", op: "eq", value: chosen }]
    });
}

Level 4 — A Relational, Two-Hop Question

// "Who has worked on a project for a retail client?"
// person --worked_on--> project --client_of--> retail company
function peopleOnRetailEngagements() {
    return context.getKnowledge().query("clients", {
        entityType: "person",
        traverse: [
            { relType: "worked_on", direction: "out", as: "projects" },
            { relType: "client_of", direction: "out", as: "clients"  }
        ],
        limit: 100
    });
}

Level 5 — A Self-Driving Knowledge Agent

Wire both tools into an agent and let it handle arbitrary exact/relational questions over the corpus. The agent definition:

{
  "name": "corpusAnalyst",
  "system": "You answer questions about the firm's client corpus. " +
            "For EXACT or RELATIONAL questions (list all, count, who " +
            "worked with whom) use the knowledge tools: first call " +
            "discoverValues to learn the real attribute values, then " +
            "queryKnowledge. For OPEN-ENDED or EXAMPLE questions use " +
            "the search tool. Never invent attribute values; discover " +
            "them. Always pass datasetId 'clients'.",
  "tools": [
    { "type": "knowledgeVocabulary", "name": "discoverValues"  },
    { "type": "knowledgeQuery",      "name": "queryKnowledge"  },
    { "type": "search",              "name": "searchDocuments" }
  ]
}

Invoke it from an Action and the agent does discover → query → answer on its own, access-filtered to the calling user:

function askAnalyst(arguments) {
    return context.getAIFunctions().invokeAgent("corpusAnalyst", {
        arguments: { userMessage: arguments.question }
    });
}

Level 6 — A Governed Ingestion Pipeline

Preview-driven tuning, controlled rollout, and monitoring, combined into an admin Action:

function ingestClients(arguments) {
    var k = context.getKnowledge();

    // 1. Tune against a sample before committing the corpus.
    if (arguments.sampleDocId) {
        var p = k.preview("clients", arguments.sampleDocId);
        var merges = p.entities.filter(function (e) { return e.action === "MERGE"; });
        var reviews = p.entities.filter(function (e) { return e.action === "REVIEW"; });
        context.log("preview: " + p.entities.length + " entities, " +
                    merges.length + " merges, " + reviews.length + " to review");
        if (arguments.previewOnly) return p; // human checks before go-live
    }

    // 2. Backfill in bounded passes (loop until jobSummary stabilises).
    var enq = k.backfill("clients");

    // 3. Report health.
    var s = k.jobSummary("clients");
    var failures = k.listJobs("clients", "FAILED", 10);
    return { enqueued: enq.enqueued, summary: s, recentFailures: failures };
}

Common Patterns

Declare-once, guard the create. createDataset rejects a duplicate. Wrap setup so it's safe to run on every deploy: probe with listEntityTypes (or your own marker) and only create when absent.

Preview before backfill, always. One preview on a representative document catches a bad extractionHint or an over-eager dedupKeys before you've extracted ten thousand documents wrong. It's free — nothing is written.

Vocabulary before query. Whether it's your code or an agent, read vocabulary to get the real literals, then predicate on them. This is the difference between a complete answer and a plausible-looking partial one.

Reprocess, don't re-backfill, after a config change. Added a type? reprocess(dataset, type) touches only stale documents. backfill re-runs everything — use it for the initial load, not for incremental config changes.

Let the agent count; let RAG illustrate. Give a customer agent both surfaces and a prompt that routes by question class. "How many / list all / who worked with" → knowledge. "Show me an example / what did we say about" → RAG.

Source every hand-written edge. When you addRelationship, pass sourceDocId/folderIds. It costs nothing and buys you traceability ("where does it say?") and correct access filtering on that edge.

What To Watch Out For

Extraction is asynchronous — don't query immediately after backfill. backfill returns an enqueued count, not a done count. Poll jobSummary until QUEUED and RUNNING drain before treating the store as complete. A query against a half-extracted corpus is correct but partial.

Counts are access-relative, and that's by design. Two users running the identical query can get different numbers. Surface this to stakeholders up front so "the count looks low" is understood as filtering, not a defect.

A false merge is worse than a duplicate. If two distinct companies collapse into one, your "list all clients" is silently wrong and you may never notice. Stay conservative on autoMergeThreshold, lean on reviewThreshold/ escalateAmbiguous for the hard cases, and remember that a visible duplicate is fixable while a hidden false-merge is not.

Validate against config, and trust the errors. An unknown attr, a bad operator, a missing entityType, or a third traversal hop all throw a clear IllegalArgumentException — they do not silently return []. If a query comes back empty, check that the value exists (via vocabulary) before assuming the query is wrong.

backfill is bounded per call. For a large corpus it processes a slice and returns; call it again (or loop) until the enqueued count settles. It's idempotent, so repeated calls are safe.

No resolver ≠ no access control. Skipping setPermittedFolderResolver only drops the pre-filter optimization; the authoritative per-source record ACL still runs on every query. But a resolver that wrongly returns an empty set means "see nothing" — return null, never empty, for "no derivation available."

Reference Cheat Sheet

Facade — context.getKnowledge()

Config

createDataset(datasetId, { name, extractionModel?, resolutionModel? })  // → dataset config
addEntityType(datasetId, { type, attributes?, extractionHint?, dedupKeys?,
                           autoMergeThreshold?, reviewThreshold?, escalateAmbiguous? })
addRelationshipType(datasetId, { relType, fromType, toType, extractionHint? })
addTaxonomyTerm(datasetId, { term, categoryPath?, aliases?, entityType? })
listEntityTypes(datasetId)            // → [{ type, version, attributes, dedupKeys, … }]
listRelationshipTypes(datasetId)      // → [{ relType, fromType, toType, version, … }]

Ingestion & jobs

preview(datasetId, documentId)        // dry run → { entities[], relationships[] }, nothing written
backfill(datasetId)                   // → { datasetId, enqueued }  (idempotent, bounded)
reprocess(datasetId, entityType)      // → { datasetId, entityType, enqueued }  (only stale docs)
jobStatus(datasetId, documentId)      // → job map | null
listJobs(datasetId, status, limit)    // status ∈ QUEUED|RUNNING|DONE|FAILED ; limit ≤ 200
jobSummary(datasetId)                  // → { QUEUED, RUNNING, DONE, FAILED }

Query

query(datasetId, spec)                 // entities, or aggregate rows
query(datasetId, spec, folderIds)      // …scoped to folderIds (ACL still applies)
vocabulary(datasetId, entityType)      // → { attrKey: [distinct values…], …taxonomy facets }
vocabulary(datasetId, entityType, folderIds)
addRelationship(datasetId, { fromEntity, toEntity, relType,
                             sourceDocId?, sourceDocName?, folderIds?, confidence? })

Query Spec DSL

{
  entityType: "company",                       // REQUIRED
  where: [                                     // AND-ed predicates
    { attr, op: "eq",       value },
    { attr, op: "in",       value: [ … ] },
    { attr, op: "contains", value },
    { attr, op: "range",    min, max },
    { attr, op: "exists" }
  ],
  traverse: [                                  // up to 2 hops
    { relType, direction: "out"|"in", as }
  ],
  aggregate: { op: "count" },                  // or
  aggregate: { op: "group_by", attr },         // or
  aggregate: { op: "distinct", attr },
  limit: 50,                                   // default 50
  offset: 0
}
// Any other top-level field → rejected.

Entity Result Shape (No Aggregate)

{
  id, entityType, canonicalName, normalizedName,
  attributes: { key: value, … },   // flattened, first-wins
  sourceDocs: [ … ], folderIds: [ … ]
}

Agent Tools (Chain JSON)

{ "type": "knowledgeVocabulary", "name": "discoverValues" }
{ "type": "knowledgeQuery",      "name": "queryKnowledge" }

Both take datasetId + (spec | entityType) as tool arguments; both accept an optional folderIds; both are always access-filtered to the current user.

Access Control (Java Startup)

KnowledgeAccess.setPermittedFolderResolver(
    (scope, user) -> aclFoldersFor(user) /* superset, or null — never empty */ );

Next Steps

  • AI Agents — build the agents that query this knowledge, and the RAG layer it complements
  • AI Memory — give those agents per-user memory to pair with corpus-wide knowledge
  • Security — the tenant, app, and per-fact access model behind every query