diff --git a/CHANGELOG.md b/CHANGELOG.md index a676740..ee114e4 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,5 +1,29 @@ # Changelog +## [2025-10-17T06:42:38-04:00 (America/New_York)] +### Changed +- Added `entity_label` to triplet CSV rows generated by `scripts/generate_synthetic_dataset.py` and refreshed ingestion + documentation (`docs/retrieval.md`, `README.md`, `docs/operations.md`, `docs/testing.md`, `SETUP.md`) plus planning collateral + (`PROJECT.md`, `PLAN.md`, `ROADMAP.md`, `SOT.md`, `ENVIRONMENT_NEEDS.md`, `NEEDED_FOR_TESTING.md`, `PLANNING_THOUGHTS.md`, + `ISSUES.md`, `TODO.md`, `RESUME_NOTES.md`) so synthetic dataset guidance stays accurate. + +## [2025-10-16T22:44:21-04:00 (America/New_York)] +### Changed +- Simplified roadmap section headings in `ROADMAP.md` by removing week estimates from the horizon labels to + emphasise qualitative prioritisation. + +## [2025-10-16T21:44:46-04:00 (America/New_York)] +### Added +- Documented a synthetic dataset ingestion workflow in `docs/retrieval.md` (including sample loader code) so benchmarking + runs can hydrate graph drivers without recomputing embeddings. + +### Changed +- Expanded operations, setup, and environment guides (`docs/operations.md`, `SETUP.md`, `ENVIRONMENT_NEEDS.md`, + `NEEDED_FOR_TESTING.md`) with batching/verification tips for loading generated JSONL/CSV corpora. +- Updated core documentation and planning artifacts (`README.md`, `PROJECT.md`, `PLAN.md`, `ROADMAP.md`, `SOT.md`, + `RECOMMENDATIONS.md`, `PLANNING_THOUGHTS.md`, `ISSUES.md`, `RESUME_NOTES.md`, `TODO.md`) to reference the ingestion workflow + and capture the follow-up automation task. + ## [2025-10-16T20:39:06-04:00 (America/New_York)] ### Added - Added live integration coverage for Memgraph, Neo4j, and Redis via `meshmind/tests/test_integration_live.py` and configured diff --git a/ENVIRONMENT_NEEDS.md b/ENVIRONMENT_NEEDS.md index 95abca7..0b8ed3e 100644 --- a/ENVIRONMENT_NEEDS.md +++ b/ENVIRONMENT_NEEDS.md @@ -24,7 +24,10 @@ consolidation heuristics and pagination under load. The new `scripts/generate_synthetic_dataset.py` utility produces JSONL/CSV corpora (defaults: 10k memories, 20k triplets, 384-dim embeddings) that can be copied to - shared storage for on-demand benchmarking. + shared storage for on-demand benchmarking. Triplet rows now embed `entity_label`, + so pairing the shared datasets with the ingestion workflow documented in + `docs/retrieval.md` lets operators seed environments quickly without recomputing + embeddings or rewriting CSV headers. - Maintain outbound package download access to PyPI and vendor repositories; this session confirmed package installation works when the network is open, and future sessions need the same capability to refresh locks or install new optional diff --git a/ISSUES.md b/ISSUES.md index db7bf9b..adba87d 100644 --- a/ISSUES.md +++ b/ISSUES.md @@ -35,7 +35,9 @@ - [ ] Validate the new Docker Compose stacks (root and `meshmind/tests/docker/`) on an environment with container support and document host requirements (ports, resources). ## Low Priority / Nice to Have +- [x] Align synthetic dataset triplet CSV headers with `Triplet` schema (added `entity_label`) and refresh ingestion docs. +- [x] Remove week-based horizon estimates from roadmap headings to avoid implying precise delivery dates in planning docs. - [x] Offer alternative storage backends (in-memory driver, SQLite, etc.) for easier local development. - [x] Provide an administrative dashboard or CLI commands for listing namespaces, counts, and maintenance statistics (CLI admin subcommands now expose predicates, telemetry, and graph checks). -- [ ] Publish onboarding guides and troubleshooting FAQs for contributors. +- [ ] Publish onboarding guides and troubleshooting FAQs for contributors (synthetic dataset ingestion docs landed in `docs/retrieval.md`, but a broader newcomer guide is still pending). - [ ] Explore plugin registration for embeddings and retrieval strategies to reduce manual wiring. diff --git a/NEEDED_FOR_TESTING.md b/NEEDED_FOR_TESTING.md index c58f70f..9257f22 100644 --- a/NEEDED_FOR_TESTING.md +++ b/NEEDED_FOR_TESTING.md @@ -69,7 +69,7 @@ external services are unavailable. - Use `meshmind/testing` fakes (`FakeMemgraphDriver`, `FakeRedisBroker`, `FakeEmbeddingEncoder`, `FakeLLMClient`) in tests or demos to eliminate external infrastructure requirements. Integration suites marked with `@pytest.mark.integration` exercise live Memgraph/Neo4j/Redis instances and expect the docker stack to be running. - Invoke `meshmind admin predicates` and `meshmind admin maintenance --max-attempts --base-delay --run ` during local runs to inspect predicate registries, telemetry, and tune maintenance retries without external services. -- Use the benchmarking utilities in `scripts/` (`evaluate_importance.py`, `consolidation_benchmark.py`, `benchmark_pagination.py`) to validate heuristics and driver performance offline before connecting to live infrastructure. Generate large corpora with `scripts/generate_synthetic_dataset.py` when you need ≥10k memories for stress tests. +- Use the benchmarking utilities in `scripts/` (`evaluate_importance.py`, `consolidation_benchmark.py`, `benchmark_pagination.py`) to validate heuristics and driver performance offline before connecting to live infrastructure. Generate large corpora with `scripts/generate_synthetic_dataset.py` when you need ≥10k memories for stress tests; triplet CSV rows now ship with `entity_label`, so the ingestion workflow in `docs/retrieval.md` can hydrate graph drivers without extra mutation. - Seed demo data as needed using the `examples/extract_preprocess_store_example.py` script after configuring environment variables. - Create a `.env` file storing the environment variables above for consistent local configuration. diff --git a/PLAN.md b/PLAN.md index ac41fde..c2f62b8 100644 --- a/PLAN.md +++ b/PLAN.md @@ -1,5 +1,7 @@ # Plan of Action +Roadmap milestones now reference qualitative horizons (Near/Mid/Long-Term) instead of week estimates to focus this plan on sequencing rather than timeboxing. + ## Phase 1 – Stabilize Runtime Basics ✅ 1. **Dependency Guards** – Implemented lazy driver factories, optional imports, and clear ImportErrors for missing packages. 2. **Default Encoder Registration** – Bootstraps register encoders/entities automatically and the CLI invokes them on startup. @@ -20,7 +22,8 @@ 2. **Maintenance Tasks** – Tasks emit telemetry, persist consolidation/compression results, and now retry conflicting writes with configurable exponential backoff (`MAINTENANCE_MAX_ATTEMPTS`, `MAINTENANCE_BASE_DELAY_SECONDS`). Synthetic benchmark scripts, the new `scripts/generate_synthetic_dataset.py`, and integration tests against live Memgraph/Neo4j validate behaviour on larger - workloads; next, replay production-like datasets to tune thresholds. + workloads. Fresh documentation in `docs/retrieval.md` and `docs/operations.md` now describes how to ingest those synthetic datasets + (with triplet CSVs that include `entity_label`) into the target backend; next, replay production-like datasets to tune thresholds. 3. **Importance Scoring Improvements** – Heuristic scoring is live, records distribution metrics via telemetry, and ships with `scripts/evaluate_importance.py` for synthetic/offline evaluation. Next: incorporate real feedback loops or LLM-assisted ranking to tune weights over time. diff --git a/PLANNING_THOUGHTS.md b/PLANNING_THOUGHTS.md index 8942954..1d165db 100644 --- a/PLANNING_THOUGHTS.md +++ b/PLANNING_THOUGHTS.md @@ -14,7 +14,7 @@ - **Pydantic Model Policy** – Follow the documented plan (target Pydantic 2.12+, refresh locks when 3.13 wheels land, record migration guidance) to avoid resurrecting compatibility shims. ## Upcoming Research -- Benchmark consolidation heuristics on synthetic datasets representing customer scale and capture telemetry snapshots (seed data via `scripts/generate_synthetic_dataset.py`). +- Benchmark consolidation heuristics on synthetic datasets representing customer scale and capture telemetry snapshots (seed data via `scripts/generate_synthetic_dataset.py`—whose triplet CSV now includes `entity_label`—and load it using the ingestion workflow documented in `docs/retrieval.md`). - Compare graph query latency across in-memory, SQLite, Memgraph, and Neo4j drivers when using pagination and filtering. - Evaluate rerank quality across LLM providers using a labelled evaluation set to determine optimal default models. - Investigate options for secure secret storage (e.g., Vault, AWS Secrets Manager) to standardise API key management. diff --git a/PROJECT.md b/PROJECT.md index ee60a39..cc319c3 100644 --- a/PROJECT.md +++ b/PROJECT.md @@ -78,7 +78,7 @@ - Docker Compose now provisions Memgraph, Neo4j, and Redis; integration-specific stacks (including the Celery worker) live under `meshmind/tests/docker/`. `pytest -m integration` exercises live services once the stack is running. See `ENVIRONMENT_NEEDS.md` and `SETUP.md` for enabling optional services locally. -- `scripts/generate_synthetic_dataset.py` produces large JSONL/CSV corpora (defaults: 10k memories, 20k triplets, 384-dim embeddings) to stress retrieval and consolidation flows prior to ingesting real datasets. +- `scripts/generate_synthetic_dataset.py` produces large JSONL/CSV corpora (defaults: 10k memories, 20k triplets, 384-dim embeddings) to stress retrieval and consolidation flows prior to ingesting real datasets. Triplet rows ship with `entity_label` so the ingestion workflow documented in `docs/retrieval.md` hydrates graph drivers without additional preprocessing. ## Roadmap Highlights - Push graph-backed retrieval deeper into the drivers (vector similarity, structured filters) so the new server-side filtering/pagination evolves into full backend-native ranking. diff --git a/README.md b/README.md index af6f219..6861178 100644 --- a/README.md +++ b/README.md @@ -202,7 +202,11 @@ Tasks instantiate the driver lazily, emit structured logs/metrics, and persist c ## Benchmarking & Evaluation - **Synthetic dataset generation** – `scripts/generate_synthetic_dataset.py` creates large JSONL/CSV corpora of memories/triplets (defaults: 10k memories, 20k triplets, 384-dim embeddings) so you can stress retrieval, consolidation, - and integration flows before ingesting real data. + and integration flows before ingesting real data. Triplet rows now ship with `entity_label` to match + `meshmind.core.types.Triplet`. +- **Synthetic dataset ingestion** – Follow the workflow documented in `docs/retrieval.md` to load the generated JSONL/CSV + payloads into MeshMind via the Python client. The operations guide walks through batching tips and post-ingestion + verification so benchmark runs start from a consistent baseline. - **Importance scoring** – `scripts/evaluate_importance.py` runs the heuristic against JSON or synthetic datasets and reports descriptive statistics for quick regression checks. - **Consolidation throughput** – `scripts/consolidation_benchmark.py` generates synthetic workloads to measure batch merging diff --git a/RECOMMENDATIONS.md b/RECOMMENDATIONS.md index ffb65ba..3600b01 100644 --- a/RECOMMENDATIONS.md +++ b/RECOMMENDATIONS.md @@ -30,7 +30,9 @@ ## Documentation & Onboarding - Keep `README.md`, `SOT.md`, `docs/`, and onboarding guides synchronized with each release; document rerank, retrieval, and - registry flows with diagrams when possible. + registry flows with diagrams when possible. The new synthetic dataset ingestion workflow in `docs/retrieval.md` should be + incorporated into future onboarding materials. +- Keep roadmap horizons qualitative (Near/Mid/Long-Term) instead of week-based estimates so planning docs emphasise sequencing and flexibility. - Maintain the troubleshooting section for optional tooling (ruff, pyright, typeguard, toml-sort, yamllint) now referenced in the Makefile and expand it as new developer utilities are introduced. Keep `SETUP.md` synchronized when dependencies change. - Provide walkthroughs for configuring LLM reranking, including sample prompts and response expectations. diff --git a/RESUME_NOTES.md b/RESUME_NOTES.md index 6d3726f..607fafd 100644 --- a/RESUME_NOTES.md +++ b/RESUME_NOTES.md @@ -10,9 +10,11 @@ ## Latest Changes +- Removed week-based estimates from roadmap section headings and refreshed planning docs (`PLAN.md`, `SOT.md`, `RECOMMENDATIONS.md`, `ISSUES.md`, `TODO.md`) to emphasise qualitative sequencing. - Added live integration coverage (`meshmind/tests/test_integration_live.py`) for Memgraph, Neo4j, and Redis, introduced a pytest marker configuration, and documented the workflow across README/SETUP/docs. - Generated a fresh `uv.lock`, pinned `.python-version` to 3.12, and updated install docs to standardise on `uv sync --all-extras`. -- Created `scripts/generate_synthetic_dataset.py` for large JSONL/CSV corpora and referenced it across benchmarking docs. +- Created `scripts/generate_synthetic_dataset.py` for large JSONL/CSV corpora, added `entity_label` to triplet CSV rows, and referenced it across benchmarking docs. +- Documented the synthetic dataset ingestion workflow across `docs/retrieval.md`, `docs/operations.md`, README, and supporting planning guides so benchmarks can load corpora without recomputing embeddings. - Updated documentation and planning collateral (README.md, SETUP.md, docs/development.md, docs/testing.md, docs/operations.md, PROJECT.md, PLAN.md, RECOMMENDATIONS.md, ROADMAP.md, ENVIRONMENT_NEEDS.md, NEEDED_FOR_TESTING.md, SOT.md, PLANNING_THOUGHTS.md, DUMMIES.md, TODO.md, RESUME_NOTES.md) to reflect the integration workflow, dataset generation, and the new Pydantic policy. ## Environment State @@ -26,5 +28,5 @@ 1. Address remaining `TODO.md` priority items (backend-native vector similarity, Celery worker integration, grpcurl end-to-end tests) now that graph services are accessible locally. 2. Automate the integration suite in CI and capture resource requirements for shared infrastructure. 3. Prepare grpcurl-based smoke tests for `meshmind serve-grpc` and plan protobuf client packaging once integration coverage extends beyond the Python stub. -4. Feed findings from large synthetic datasets into retry/backoff defaults and document recommended values in `ENVIRONMENT_NEEDS.md`. +4. Feed findings from large synthetic datasets into retry/backoff defaults and document recommended values in `ENVIRONMENT_NEEDS.md`, validating the new ingestion workflow as part of those runs. 5. Continue tracking shim retirements in `DUMMIES.md` and follow the cleanup plan in `CLEANUP.md` so remaining fakes can be removed when infrastructure allows. diff --git a/ROADMAP.md b/ROADMAP.md index be874ae..1c8d38d 100644 --- a/ROADMAP.md +++ b/ROADMAP.md @@ -5,21 +5,21 @@ - Support multiple graph backends (in-memory, SQLite, Memgraph, Neo4j) with consistent telemetry, maintenance, and LLM orchestration knobs. - Provide developers with reproducible tooling, comprehensive documentation, and automation scripts that keep local and CI environments aligned. -## Near-Term (0–2 Weeks) +## Near-Term - Automate the new integration suite (`pytest -m integration`) in CI so Memgraph/Neo4j/Redis regressions fail fast. -- Finalize maintenance write policies by implementing retry/backoff semantics and measuring consolidation accuracy against representative datasets (now aided by `scripts/generate_synthetic_dataset.py`). +- Finalize maintenance write policies by implementing retry/backoff semantics and measuring consolidation accuracy against representative datasets (now aided by `scripts/generate_synthetic_dataset.py`, whose triplet CSV exposes `entity_label`, and the documented ingestion workflow in `docs/retrieval.md`). - Publish ROADMAP and PLANNING_THOUGHTS artifacts, and seed the `research/` folder with competitive analysis to ground prioritization discussions. - Expand automated smoke tests for REST `/memories/counts`, CLI `meshmind admin counts`, and provisioning scripts to ensure guardrails stay trustworthy. - Capture outstanding shim retirement work (FastAPI tests now live; continue tracking FakeLLM/Fake drivers) in CLEANUP.md with precise acceptance criteria for each removal. -## Mid-Term (2–6 Weeks) +## Mid-Term - Run load tests against SQLite and hosted graph backends to tune pagination defaults, consolidation heuristics, and token compression strategies. - Implement backend-native vector similarity queries and schema indexes so embeddings never leave the database during scoring. - Finalise the gRPC surface by building on the new asyncio server helpers—exercise the `meshmind serve-grpc` CLI entry point within Docker Compose, publish generated clients (Python + additional languages), and add integration smoke tests so external agents can integrate without the in-process stub. - Instrument observability exports (Prometheus/OpenTelemetry) and wire dashboards/alerts for ingestion latency, queue depth, and error rates. - Replace compatibility shims with official Pydantic/FastAPI packages once dependency constraints are lifted, and backfill validation coverage. -## Long-Term (6+ Weeks) +## Long-Term - Build evaluation loops—analytics dashboards and LLM-assisted reviews—that continuously score memory importance heuristics and rerank quality. - Introduce human-in-the-loop tooling for conflict resolution, allowing operators to approve merges or override automated maintenance plans. - Explore federated deployments that synchronise multiple MeshMind instances, including replication strategies and eventual-consistency guarantees. diff --git a/SETUP.md b/SETUP.md index 942497a..b574dae 100644 --- a/SETUP.md +++ b/SETUP.md @@ -80,7 +80,10 @@ docker compose -f meshmind/tests/docker/memgraph.yml up -d ``` > Need synthetic load? Run `python scripts/generate_synthetic_dataset.py build/datasets/benchmark` -> to seed JSONL/CSV fixtures before loading them into Memgraph/Neo4j for stress tests. +> to seed JSONL/CSV fixtures before loading them into Memgraph/Neo4j for stress tests. Triplet rows +> now include `entity_label`, so the ingestion workflow in `docs/retrieval.md` can materialize +> `Triplet` models without mutating CSV fields. Follow the ingestion steps when copying fixtures so +> benchmarks reuse the same namespace/layout. ### 3.2 Cleaning up diff --git a/SOT.md b/SOT.md index 4f0d8d6..f20dbdc 100644 --- a/SOT.md +++ b/SOT.md @@ -28,11 +28,12 @@ Supporting assets: - `SETUP.md`: End-to-end provisioning instructions covering Python deps, environment variables, and Compose workflows. - `run/install_setup.sh`, `run/maintenance_setup.sh`: Automation scripts for provisioning fresh environments and refreshing cached workspaces. - `scripts/evaluate_importance.py`, `scripts/consolidation_benchmark.py`, `scripts/benchmark_pagination.py`: Evaluation and benchmarking tools for importance heuristics, consolidation throughput, and driver pagination performance. -- `scripts/generate_synthetic_dataset.py`: Produces large JSONL/CSV corpora (defaults: 10k memories, 20k triplets, 384-dim embeddings) for integration and benchmark scenarios. +- `scripts/generate_synthetic_dataset.py`: Produces large JSONL/CSV corpora (defaults: 10k memories, 20k triplets, 384-dim embeddings) for integration and benchmark scenarios. Triplet rows include `entity_label`, so the ingestion workflow in `docs/retrieval.md` stores the generated payloads without recomputing embeddings or mutating CSV fields. - `.github/workflows/ci.yml`: GitHub Actions workflow running linting/formatting checks and pytest. - `pyproject.toml`: Project metadata and dependency list (pins Python `>=3.11,<3.13`; see compatibility notes in `ISSUES.md`). - Documentation (`PROJECT.md`, `PLAN.md`, `SOT.md`, `README.md`, etc.) describing the system and roadmap. - Strategic context (`ROADMAP.md`, `PLANNING_THOUGHTS.md`, `research/overview.md`) summarising milestones, planning questions, and competitor analysis. + Roadmap horizons now use qualitative labels (Near/Mid/Long-Term) without week estimates to emphasise sequencing over exact timing. - `DUMMIES.md`: Catalog of temporary shims (REST/gRPC stubs, Celery dummies, fake drivers) with removal guidance and a retired section for historical compatibility layers. diff --git a/TODO.md b/TODO.md index 6bd8279..ad6167b 100644 --- a/TODO.md +++ b/TODO.md @@ -2,6 +2,8 @@ ## Completed +- [x] Ensure `scripts/generate_synthetic_dataset.py` emits `entity_label` for triplet CSV rows and refresh ingestion docs. +- [x] Remove week estimate qualifiers from roadmap horizon headings to keep milestone labels qualitative. - [x] Implement dependency guards and lazy imports for optional packages (`pymgclient`, `tiktoken`, `celery`, `sentence-transformers`). - [x] Add bootstrap helper for default encoder registration and call it from the CLI. - [x] Update OpenAI encoder implementation to align with latest SDK responses and retry semantics. @@ -73,6 +75,7 @@ - [x] Add packaging tests to guarantee `meshmind/protos/memory_service.proto` ships with the distribution and exposes the expected service definition. - [x] Document runtime and operational guidance for the gRPC server across README, SETUP, `docs/api.md`, and `docs/operations.md`. - [x] Add Makefile and CI targets (`make protos`, `make protos-check`) plus scripts to regenerate/verify protobuf bindings, failing CI when drift occurs. +- [x] Document ingestion workflows for the synthetic dataset generator across `docs/retrieval.md` and operations guides so benchmarking instructions stay cohesive. - [x] Replace the REST stub with the concrete FastAPI application and migrate smoke tests to `fastapi.testclient.TestClient`. - [x] Remove Celery dummy fallbacks by requiring the real app/beat imports and keeping docker-compose stacks in sync. - [x] Add a `serve-grpc` CLI subcommand and verify it delegates to the runtime helpers. @@ -95,9 +98,9 @@ - [ ] Add integration tests that spin up `meshmind serve-grpc` and exercise ingestion/search via grpcurl to complement the unit-level coverage (blocked until network-accessible infrastructure is ready). - [ ] Publish protobuf-generated client artifacts (Python wheel or language-neutral bundles) so external services can consume the API once infrastructure is available. - [ ] Automate the live integration suite (`pytest -m integration`) in CI so Memgraph/Neo4j/Redis regressions fail fast. -- [ ] Document ingestion workflows for the synthetic dataset generator across `docs/retrieval.md` and operations guides so benchmarking instructions stay cohesive. - [ ] Document the retired REST/Celery shims in release notes and communicate migration steps to downstream integrators. - [ ] Capture gRPC CLI usage examples (including docker-compose orchestration) in `docs/api.md` and `docs/operations.md` once integration smoke tests complete. +- [ ] Automate ingestion of synthetic dataset payloads (JSONL/CSV) via a CLI or script wrapper so benchmarking runs do not require custom snippets. ## Recommended Waiting for Approval Tasks diff --git a/docs/operations.md b/docs/operations.md index b0e652d..e379240 100644 --- a/docs/operations.md +++ b/docs/operations.md @@ -72,7 +72,10 @@ This guide covers operational tasks for MeshMind deployments. - `make benchmarks` runs the synthetic benchmarking scripts (`scripts/evaluate_importance.py`, `scripts/consolidation_benchmark.py`, `scripts/benchmark_pagination.py`) with fast defaults and stores JSON summaries in `build/benchmarks/`. - Override script flags to stress specific backends (for example `--backend neo4j` or higher iteration counts) once live services are provisioned, and capture findings in `FINDINGS.md` / `ENVIRONMENT_NEEDS.md`. -- Use `scripts/generate_synthetic_dataset.py` to produce large JSONL/CSV corpora (defaults: 10k memories, 20k triplets, 384-dim embeddings) before loading data into Memgraph/Neo4j for stress testing. +- Use `scripts/generate_synthetic_dataset.py` to produce large JSONL/CSV corpora (defaults: 10k memories, 20k triplets, 384-dim embeddings) before loading data into Memgraph/Neo4j for stress testing. Pair the generator with the ingestion snippet from `docs/retrieval.md` to hydrate graph backends quickly without recomputing embeddings. Triplet payloads now include `entity_label` so they align with `Triplet` validation without extra preprocessing. When loading via the MeshMind client: + - Batch writes (for example in chunks of 500 memories/triplets) to keep request payload sizes manageable. + - Align namespaces across the JSONL/CSV payloads and retrieval queries so pagination filters remain effective. + - Call `meshmind.cli.admin counts --namespace ` after ingestion to confirm memory distribution before executing benchmarks. ## Deployment Considerations diff --git a/docs/retrieval.md b/docs/retrieval.md index ecfed61..1d444ee 100644 --- a/docs/retrieval.md +++ b/docs/retrieval.md @@ -50,6 +50,62 @@ batch processing patterns. - `rerank_model` / `rerank_endpoint`: explicit overrides that take precedence over environment defaults when reranking. - `fields`: optional mapping for textual searches (regex, exact, fuzzy) to target metadata keys. +## Synthetic Dataset Ingestion Workflow + +Large-scale retrieval experiments rely on synthetic corpora so benchmarks stay reproducible. Use the following workflow to +seed data generated by `scripts/generate_synthetic_dataset.py` into your target backend: + +1. Generate the corpus: + + ```bash + python scripts/generate_synthetic_dataset.py build/datasets/benchmark \ + --memories 10000 \ + --triplets 20000 \ + --namespace benchmark + ``` + + This produces `memories.jsonl` (memory payloads) and `triplets.csv` (relationships) under `build/datasets/benchmark/`. + +2. Load memories with a short Python helper. The snippet below deserialises the JSONL payload and stores the objects directly + through the MeshMind client: + + ```python + from __future__ import annotations + + from pathlib import Path + + from meshmind.client import MeshMind + from meshmind.core.types import Memory + + + def load_memories(path: Path, namespace: str, batch_size: int = 500) -> None: + mm = MeshMind() + batch: list[Memory] = [] + with path.open("r", encoding="utf-8") as handle: + for line in handle: + payload = Memory.parse_raw(line) + payload.namespace = namespace + batch.append(payload) + if len(batch) >= batch_size: + mm.store_memories(list(batch)) + batch.clear() + if batch: + mm.store_memories(list(batch)) + + + load_memories(Path("build/datasets/benchmark/memories.jsonl"), namespace="benchmark") + ``` + +3. Persist relationships in a similar fashion using `MeshMind.store_triplets` and the generated CSV payload (for example, with + `csv.DictReader`). Each row now includes `subject`, `predicate`, `object`, `namespace`, `entity_label`, and `metadata`, so the + ingestion helper can instantiate `Triplet(**row)` without additional mutation. + +4. Run retrieval queries (`meshmind search`, REST/gRPC calls, or the MeshMind Python client) targeting the `benchmark` + namespace and optional `entity_labels` to exercise vector, hybrid, and metadata filters against the seeded dataset. + +The same JSONL/CSV payloads can be adapted for bulk ingestion APIs exposed by the REST/gRPC services if you prefer remote +loading. Make sure to keep namespaces aligned so pagination and label filters remain effective across benchmarking runs. + ## Extending Retrieval 1. Add a new module under `meshmind/retrieval` with a function that accepts `(query, memories, **kwargs)`. diff --git a/docs/testing.md b/docs/testing.md index a28bb5d..c432f29 100644 --- a/docs/testing.md +++ b/docs/testing.md @@ -69,7 +69,8 @@ The command stores JSON summaries under `build/benchmarks/`: Adjust the script flags (for example `--backend`, `--iterations`, or `--count`) to stress alternative drivers or larger datasets; see `scripts/*.py` for supported options. Document notable findings in `FINDINGS.md` or `ENVIRONMENT_NEEDS.md` when tuning defaults for new environments. Use `scripts/generate_synthetic_dataset.py` to generate large JSONL/CSV corpora -before loading them into Memgraph/Neo4j for scale testing. +before loading them into Memgraph/Neo4j for scale testing; the triplet CSV now includes `entity_label` so importing with +`Triplet(**row)` succeeds without extra preprocessing. ## Adding Tests diff --git a/scripts/generate_synthetic_dataset.py b/scripts/generate_synthetic_dataset.py index 397e221..de5cc91 100644 --- a/scripts/generate_synthetic_dataset.py +++ b/scripts/generate_synthetic_dataset.py @@ -56,11 +56,14 @@ def _write_jsonl(path: Path, rows: Iterable[dict[str, object]]) -> None: def _write_triplets(path: Path, rows: Iterable[dict[str, object]]) -> None: path.parent.mkdir(parents=True, exist_ok=True) with path.open("w", encoding="utf-8") as handle: - handle.write("subject,predicate,object,namespace,metadata\n") + handle.write("subject,predicate,object,namespace,entity_label,metadata\n") for row in rows: metadata = json.dumps(row.get("metadata", {}), ensure_ascii=False) handle.write( - f"{row['subject']},{row['predicate']},{row['object']},{row['namespace']},{metadata}\n" + ( + f"{row['subject']},{row['predicate']},{row['object']},{row['namespace']}," + f"{row['entity_label']},{metadata}\n" + ) ) @@ -79,16 +82,19 @@ def generate_dataset( memory_rows = [] triplet_rows = [] entity_ids: list[str] = [] + entity_labels: dict[str, str] = {} for _ in range(memories): uid = str(uuid4()) entity_ids.append(uid) + label = random.choice(["Note", "Task", "Observation"]) + entity_labels[uid] = label memory_rows.append( { "uuid": uid, "namespace": namespace, "name": _random_text(3).title(), - "entity_label": random.choice(["Note", "Task", "Observation"]), + "entity_label": label, "content": _random_text(random.randint(20, 60)), "embedding": _random_embedding(embedding_dim), "metadata": _random_metadata(), @@ -103,6 +109,7 @@ def generate_dataset( "predicate": random.choice(["references", "follows", "relates_to", "duplicates"]), "object": obj, "namespace": namespace, + "entity_label": entity_labels[subj], "metadata": { "confidence": round(random.uniform(0.5, 0.99), 2), "notes": _random_text(6),