This example indexes many GitHub repositories at once — one tenant per entry in a JSON config file — into a single Postgres + pgvector table. The config files themselves live in a GitHub repo: commit a change to *.json there and the affected tenants are added or removed on the next config poll, without restarting the app.
The whole pipeline is one CocoIndex App. A LiveMap decouples the producer (reads tenant configs) from the consumer (indexes each tenant), so the components stack as two halves joined by the map:
app_main
├── LiveMap[tenant_key, _RepoConfig] # in-memory shared map
│
├── coco.auto_refresh(produce_tenant_configs, 5min) # producer: poll config repo
│ └── _ConfigCollector (RepoVisitor) # walk config repo, read every *.json,
│ │ # merge tenant entries up the return chain
│ └── config_map.declare_entry(tenant_key, config) # one entry per tenant
│
└── coco.mount_each(process_tenant, config_map) # consumer: one comp per entry
└── per tenant_key:
└── coco.auto_refresh(sync_tenant_repo, 5min)
└── github.mount_each_file # walk the configured repo
└── per file: process_file → process_chunk
Producer. produce_tenant_configs reads the config files with a github.RepoVisitor (_ConfigCollector): commit.get_object(config_dir) resolves the config directory, then config_root.accept(...) walks just that subtree — visit_file reads one file's raw JSON, visit_directory merges its children, and the merged {tenant_key: raw_config} map flows up the return-value chain. The visitor returns raw JSON rather than typed _RepoConfig, because accept is memoized and its cache only round-trips generic JSON types; parse_tenant_configs does the typed parse afterward. The producer then declare_entrys one LiveMap entry per tenant. SHA-keyed memoization on accept means unchanged config files are not re-read between cycles. Config files are merged by tenant key, so they are deliberately not part of the component path — moving a tenant between config files keeps the same entry (and rows).
Consumer. coco.mount_each(process_tenant, config_map) mounts one component per map entry, keyed by tenant_key. CocoIndex tracks ownership of target rows by component path: a new entry creates a component; an entry that disappears (tenant removed from the config, or marked to_delete: true) drops the component and CocoIndex deletes its rows automatically. Entries are owned by the producer, so the map — and through it the consumer — follows the config repo with no manual flow management, thread locks, or safeguard timers.
Two GitHub poll cadences, refreshing independently:
- Config repo — polled every 5 minutes by
produce_tenant_configs. Governs which tenants exist (theLiveMap's membership). - Tenant repos — polled every 5 minutes per tenant by
sync_tenant_repo(mount_eachcan't take a live component directly, soprocess_tenantmounts theauto_refreshrepo walk). Governs each tenant's code. SHA-keyed memoization ingithub.File.acceptmeans unchanged blobs are not re-read or re-embedded between cycles.
The config repo holds one or more *.json files (under CONFIG_DIR). By default this points at this example's own example_configs/ on GitHub, so the app runs out of the box; point it at your own repo to manage real tenants. Each file is a map from tenant key to repo config:
{
"cocoindex_md": {
"repo_owner": "cocoindex-io",
"repo_name": "cocoindex",
"git_ref": "main",
"included_patterns": ["**/*.md", "**/*.mdx"],
"excluded_patterns": ["**/.*", "**/target", "**/node_modules"]
},
"cocoindex_py": {
"repo_owner": "cocoindex-io",
"repo_name": "cocoindex",
"git_ref": "main",
"included_patterns": ["python/**/*.py"],
"excluded_patterns": ["**/.*"]
},
"cocoindex_rs": {
"repo_owner": "cocoindex-io",
"repo_name": "cocoindex",
"git_ref": "main",
"to_delete": true
}
}included_patterns / excluded_patterns are passed straight to PatternFilePathMatcher. To restrict to a subdirectory, prefix the include pattern (e.g. python/**/*.py instead of v0's separate path field).
- Install Postgres with the
vectorextension. - A GitHub App with read access to both the config repo and the tenant repos you want to index. Save its App ID and the PEM private key path.
- A GitHub repo holding your
*.jsontenant config files. The defaults point at this example's ownmulti_github_code_indexing/example_configs; setCONFIG_REPO_OWNER/CONFIG_REPO_NAME/CONFIG_DIRto use your own.
Copy .env.example to .env and fill in:
POSTGRES_URL=postgres://...
GITHUB_APP_ID=...
GITHUB_PRIVATE_KEY_PATH=/path/to/key.pem
CONFIG_REPO_OWNER=cocoindex-io # defaults to this example's repo
CONFIG_REPO_NAME=cocoindex-plus-examples
CONFIG_GIT_REF=main # optional, defaults to main
CONFIG_DIR=multi_github_code_indexing/example_configs # optional, dir holding *.json
Install:
pip install -e .Catch-up mode — one pass, then exit:
cocoindex update mainLive mode — keeps the app running, polls the config repo for tenant membership changes and each tenant's GitHub ref for code changes, both every 5 minutes:
cocoindex update -L mainQuery (per-tenant filter optional via the SQL WHERE tenant_key = … clause shown in main.py):
python main.py "your search query"- Rate limiting: v1's GitHub connector doesn't yet have an in-process throttle. Several tenants walking the same App in parallel can briefly burst against the GitHub API; the 429-retry loop will recover. If you have many tenants, consider staggering
auto_refreshintervals or running fewer concurrent tenants. - Schema changes: the
code_embeddingstable now has atenant_keycolumn. If you ran a previous version of this example, drop the table before re-indexing.
Optional UI for inspecting the pipeline:
cocoindex server -ci main.pyThen open https://cocoindex.io/cocoinsight.