Sources are how an agent learns about your business. This page covers every kind of source, the ingestion pipeline, and what to expect after you click "Add".
| Type | Use it for | What we ingest |
|---|---|---|
url | One specific page | Crawl + extract main content + chunk + embed |
sitemap | A whole site at once | Read sitemap, fan out to one CrawlPageJob per URL |
feed | RSS / Atom blogs | Same as sitemap but reads <item> entries |
text | FAQs, snippets, anything you can paste | Skip the crawl, chunk + embed directly |
notion | Notion pages or databases | OAuth into Notion, fetch via API, treat each page as a document |
google_doc | Google Docs (Workspace) | OAuth, fetch via Drive API, ingest as a document |
google_sheet | Google Sheets tabs | OAuth, pull every non-empty row, one Document body with Row N: … markers |
sql | Remote MySQL / PostgreSQL | Direct PDO read-only SELECT; one Document body with Row N: … markers. Credentials AES-256-GCM at rest. |
file | PDF / DOCX / XLSX uploads | Parsed via Cloudflare toMarkdown (free tier), chunked + embedded |
woocommerce_products | WP / WooCommerce stores | Synced by the companion WordPress plugin |
auto | Pages visitors land on | Auto-queued by AutoIndexPageVisit from /v1/widget/init |
Connect a read-only MySQL or PostgreSQL database directly as a knowledge
source. Every non-empty row your SELECT returns becomes part of the agent's
training data, with citations the LLM can use as Row N
references.
Setup steps:
/app/agents/{id}/sources, scroll to Add SQL database.3306 or 5432.SELECT query — every row that returns becomes part of the agent knowledge base.colname: value pairs.SyncSqlSourceJob runs on the crawl queue, indexes the rows, and flips the source to indexed.Hard rules enforced on every query:
UrlSafetyGuard::assertSafe() — same allowlist the crawler uses. 127.0.0.1, RFC1918 ranges, link-local 169.254.0.0/16 (AWS metadata IP), and any name that resolves to a private IP are all refused. The source flips to failed with "Unsafe SQL host: Refusing to crawl an internal / loopback / link-local host."SELECT (case-insensitive). Multi-statement (containing ;) is rejected. The keywords INSERT, UPDATE, DELETE, DROP, ALTER, TRUNCATE, GRANT, REVOKE, CREATE, REPLACE, CALL, EXEC, EXECUTE, MERGE, LOAD are blocked.START TRANSACTION READ ONLY (MySQL) / BEGIN TRANSACTION READ ONLY (PostgreSQL) so even a permissive query string can't mutate state.SELECT with a LIMIT if you have a big table you don't want fully indexed.Credentials at rest:
The host, port, database name, username, and password are stored in
sources.credentials_encrypted (text column) and encrypted
via Laravel's encrypted:array cast — AES-256-GCM with the
install's APP_KEY. The non-sensitive bits (driver, query,
title/body column mappings, optional label) live in the regular
config JSON column.
Reading the raw column produces ciphertext only; the plaintext is
visible only inside worker memory during a sync run and never logged.
Rotating APP_KEY will require buyers to re-enter the
password (standard Laravel behaviour — see
Security).
What gets indexed:
Each row becomes a labeled line in a single Document body. When you
set title_column = "title" and body_column = "body",
a row with { title: "Welcome", body: "Hello world" }
renders as:
Welcome: Hello world
Without a body column it falls back to a key/value join of every non-null column:
Row 1 — id: 42, name: ACME Corp, plan: Pro, last_seen: 2026-05-13
The Document body is then chunked and embedded through the same
IndexDocumentJob pipeline every other source uses, so
SQL rows surface in retrieval just like crawled pages or pasted text.
Re-sync today: manual via the source's Refresh action. Periodic auto-sync isn't scheduled yet — same as Notion / Google Doc / Google Sheet. File a card if you want a cron.
Drivers not shipped yet: MSSQL and Oracle. Both
require PHP extensions (pdo_sqlsrv / oci8) that
aren't bundled by default and aren't universal across CodeCanyon hosts.
Open a feature request if you need them.
Open /app/agents/{id}/sources. The Add source
modal handles all types in one form. Behind the scenes:
10.x, 192.168.x, 127.x, ::1) are blocked to prevent SSRF.status = pending.CrawlSourceJob for url/sitemap/feed; IngestNotionPageJob/IngestGoogleDocJob for connected sources; IndexTextSourceJob for pasted text.crawl queue, fetches content, creates Document rows, then dispatches IndexDocumentJob on the index queue.pending → crawling → done (or failed with an error message you can read in the UI).
A page can land in the Knowledge view with 0 chunks
and an amber "Indexing didn't finish" badge. The expanded row now
shows the actual error from sources.error when present,
plus the crawler used and last-fetched timestamp. Most failures map
to one of these:
detectBlocker() heuristics.looksLike404() rejects these.IndexDocumentJob hasn't run yet. Wait a minute and refresh; if it persists, check Queue Health.Click Reindex on a failed row to retry the same pipeline (URL re-crawl + re-index, or file re-parse for uploads). Persistent failures usually mean the URL itself is unscrapable — try a different page on the same site, or upload the content as a file.
The Cloudflare Vectorize index is provisioned at the exact
dimension of the embedding model that was active when it was
first created. Changing CLOUDFLARE_EMBED_MODEL from
a 768-dim model (bge-base-en-v1.5) to a 1024-dim model (bge-m3,
bge-large-en-v1.5) or back will cause every IndexDocumentJob
to crash with:
Cloudflare 40012: invalid vector for id="...", expected 768 dimensions, and got 1024 dimensions
Pitchbar now detects this BEFORE sending the upsert, surfaces an
actionable error, and ships a recovery command. Known model→dim
map (auto-applied when VECTOR_DIM env is unset):
| Model | Dim |
|---|---|
@cf/baai/bge-small-en-v1.5 | 384 |
@cf/baai/bge-base-en-v1.5 (default) | 768 |
@cf/baai/bge-large-en-v1.5 | 1024 |
@cf/baai/bge-m3 | 1024 |
text-embedding-3-small | 1536 |
text-embedding-3-large | 3072 |
text-embedding-ada-002 | 1536 |
Recovery: drop the existing index, recreate at the new dim, and
re-dispatch IndexDocumentJob for every document:
php artisan vector:rebuild-index # interactive — asks before proceeding
php artisan vector:rebuild-index --force # for automation / CI
php artisan vector:rebuild-index --dim=1024 # override the resolved dim
The command resets every Source to pending, deletes
every Chunk row, drops the Vectorize index, recreates it at the
target dim, and queues a re-index job per document onto the
index queue. File-backed Documents re-index from the
persisted text on disk; URL-only Documents need a manual
Reindex click (which triggers CrawlPageJob
to re-fetch).
On the sources page, the Discover button takes a domain and probes it for crawlable pages without you having to list them. We:
robots.txt for sitemap declarations./about, /pricing, /features, /products, /faq, /docs, /help, /support, /contact.
Adding a Source of type sitemap dispatches one
CrawlPageJob per URL in the sitemap, staggered by a
small per-page delay so Cloudflare Browser Rendering doesn't
rate-limit on burst. The discoverer (SitemapDiscoverer)
handles three input shapes:
https://example.com) —
probes /sitemap.xml + /sitemap_index.xml.
https://example.com/sitemap.xml or
https://example.com/products/sitemap.xml) — fetched
verbatim. Pre-fix the discoverer used to append a second
/sitemap.xml here and 404 the request.
<sitemapindex>
XML many CMSes — WordPress, Shopify, Webflow — emit by default)
— recurses one level into each child sitemap and aggregates
page URLs.
Output is deduped (so a URL listed in two child sitemaps gets
indexed once) and capped at
services.crawl.max_pages_per_source (default 500,
override via CRAWL_MAX_PAGES_PER_SOURCE). The cap
used to be 25 — a buyer adding a 100-URL sitemap silently lost 75
pages — the new default is generous enough for most marketing /
docs sites. Very large catalogues should split the sitemap by
section anyway.
The crawler is provider-driven. In order of preference:
CLOUDFLARE_ACCOUNT_ID + CLOUDFLARE_API_TOKEN are set.BROWSERLESS_TOKEN is set. Same headless-Chrome behavior on a different vendor.
Once HTML is in hand, ReadabilityExtractor strips nav,
footer, ads, etc., leaving the article body. Pages under 200 chars or
detected as 404s are dropped.
Direct file uploads (Sources → Upload files) are parsed locally first, then handed to the same chunk + embed pipeline crawled pages use. The parser is picked by file extension:
| Extension | Parser | Network call? |
|---|---|---|
.pdf, .docx, .doc, .xlsx, .xls, .odt, .ods | Cloudflare Workers AI toMarkdown when CF creds are configured; Smalot\PdfParser / PhpOffice\PhpWord otherwise | Yes — one multipart POST per file to /ai/tomarkdown (free of cost, 0 Neurons) |
.csv | League\Csv — emits one segment per row formatted as col: value | col: value | No |
.md, .markdown, .txt | Plain text, split on H1/H2 headings | No |
The Cloudflare path is preferred for binary office formats because
Smalot and PhpWord are unreliable on real-world documents: Word-
exported PDFs that put body text in one big content stream, scanned
PDFs with a thin text layer, and DOCX files with nested tables or
text frames all tend to extract poorly. Workers AI's
toMarkdown returns structured markdown (headings, lists,
tables preserved) which feeds the chunker much better.
Pricing: toMarkdown is free for every format above.
Only image-to-markdown conversion consumes Workers AI Neurons (we
do not send images). When Cloudflare credentials are absent (BYOK
OpenAI customers, fresh installs), or when the Cloudflare call
fails, the in-process PHP parsers take over so PDF / DOCX / CSV /
TXT / MD uploads never silently break.
Spreadsheet uploads (.xlsx, .xls, .ods, .odt) require
Cloudflare Workers AI. There is no local fallback. When
those formats are uploaded to a workspace without
CLOUDFLARE_ACCOUNT_ID + CLOUDFLARE_API_TOKEN
configured, the source row is created with status=failed
and the error stamp includes an actionable hint:
"Spreadsheet / OpenDocument formats need Cloudflare Workers AI."
Admins who need Excel ingestion on a BYOK-OpenAI install should
export to CSV (the local League\Csv parser handles
that format with no external dependency).
Whichever parser ran, the resulting text is persisted under
storage/app/private/uploads/{source_id}/segment-N.txt.
That's the file the Reindex button reads — you don't need to
re-upload the original to re-index.
The extractor's text goes into Chunker, a recursive
splitter that prefers semantic boundaries:
Each chunk is embedded in a batch (default 100 chunks per call) and
upserted into the vector store with metadata: agent_id,
document_id, chunk_id, url,
workspace_id, source_id, lang.
Each CrawlPageJob attempts up to 3 times with
backoff [30s, 90s, 180s]. The retry path is split
by failure class:
fail() so the Source row gets the real reason immediately instead of being stranded behind two more retries that will deterministically fail.failOnTimeout=true so a worker SIGTERM still flips the source to failed with a customer-readable error.
Buyer-facing error messages on the Sources list are sanitized
via SourceErrorPresenter — raw upstream JSON
envelopes (Cloudflare 401 bodies, Browserless stack traces) get
rewritten to friendly lines like "We couldn't reach this page"
or "The crawl service is busy right now — we will retry
automatically." Operators still see the full raw message
under Show details.
From the sources list, each row has:
storage/app/private/uploads/{source_id}/segment-N.txt — no need to re-upload the original. If the persisted file is missing (pre-fix uploads or a disk wipe) the UI surfaces a "Re-upload" prompt.
Both use OAuth. Connect once from /app/integrations; the
token is encrypted at rest. After connecting, the source modal lets you
pick pages or documents directly.
Re-syncs are manual (per-source Reindex button) — we don't poll your Notion / Drive on a schedule. If you change a Notion page, click Reindex on that source.
Cloudflare Vectorize has eventual consistency on metadata-filtered
queries — even after an upsert returns 200 OK, an
agent_id-filtered query against that vector typically
returns 0 hits for the first 30 to 60 seconds while
the metadata index propagates across edge regions.
Practical consequence: a freshly uploaded file shows up as
status=indexed in the Sources page immediately, but the
agent won't be able to answer questions about it until the propagation
window closes. The upload-success banner reminds the admin of this.
If the agent still doesn't return relevant chunks after a minute,
open the source's Preview to confirm the extracted
text isn't empty — that's a parser-side issue, not a vector-side one.
Same gotcha applies to the very first upload after creating a Cloudflare Vectorize index for the first time — the index itself has a ~2 minute provisioning lag before any queries return results, even unfiltered ones.
Deleting a source cascades: documents, chunks, and vector points all go in one transaction. There's no soft-delete on sources.