Verbal flashcards API implementation#544
Conversation
This commit includes all implementation details of the verbal flashcards feature. If issues arise, it could be because some local-config was left out. Co-Authored-By: Spon <[email protected]>
mircealungu
left a comment
There was a problem hiding this comment.
Review from a walkthrough of the branch. Most findings are inline below; the main blocker and a few minor items are summarized here.
🚨 Blocker — endpoint routes are never registered
zeeguu/api/endpoints/__init__.py is the module that imports every endpoint file so the @api.route(...) decorators execute. verbal_flashcards isn't imported there, so in production none of /verbal_flashcards, /verbal_flashcards/transcribe, /verbal_flashcards/submit, /verbal_flashcards/check_pronunciation exist on the running Flask app.
The tests pass only because test_verbal_flashcards.py's autouse fixture does monkeypatch.setattr("zeeguu.api.endpoints.verbal_flashcards...", ...), which force-imports the module as a side effect and runs the decorators. In production that side-effect import never happens.
Fix: add from . import verbal_flashcards to zeeguu/api/endpoints/__init__.py.
Minor items
transcribe_audio_endpoint's genericexcept Exceptionechoesstr(e)back to the client, which can leak internal details (paths, stack-ish info). Log internally and return a generic"Transcription failed".default.env— the trailing blank-line deletions are unrelated churn; worth isolating or dropping.VERBAL_FLASHCARD_EXERCISE_SOURCE = "Verbal Flashcards"— check the existing rows inexercise_source; other sources may be snake_case and this would create a slightly inconsistent entry.
| de_core_news_md @ https://github.com/explosion/spacy-models/releases/download/de_core_news_md-3.7.0/de_core_news_md-3.7.0-py3-none-any.whl | ||
| da_core_news_md @ https://github.com/explosion/spacy-models/releases/download/da_core_news_md-3.7.0/da_core_news_md-3.7.0-py3-none-any.whl | ||
| nltk | ||
| nemo_toolkit[asr] |
There was a problem hiding this comment.
This pulls torch and several GB of model-tooling dependencies into the main API image. The dedicated asr_service/ already lists nemo_toolkit[asr] in its own requirements.txt — that's where it belongs. The main API only proxies audio to the worker; it just needs requests. Please remove this line.
|
|
||
| # Database | ||
| mysqlclient | ||
| mysqlclient==2.2.7 |
There was a problem hiding this comment.
This pin is unrelated to verbal flashcards. Either split it into its own PR or add a line in the PR description explaining why it's needed.
| logger = logging.getLogger(__name__) | ||
|
|
||
| DEFAULT_ASR_SERVICE_TIMEOUT = float(os.environ.get("ASR_SERVICE_TIMEOUT", "30")) | ||
| LOCAL_DEV_ASR_SERVICE_URLS = "da=http://127.0.0.1:5002" |
There was a problem hiding this comment.
Falling back silently to http://127.0.0.1:5002 when ASR_SERVICE_URLS is unset means a misconfigured production deploy quietly tries localhost and then fails with a 502 on connection refused. Prefer to only use this fallback in a dev context (e.g. when FLASK_ENV == 'development') and raise ASRServiceNotConfigured otherwise.
| @@ -0,0 +1,756 @@ | |||
| import traceback | |||
There was a problem hiding this comment.
This file is 756 lines and mixes four concerns: routes, flashcard-from-scheduled-words mapping, Danish text normalization, and fuzzy matching. Consider splitting into:
zeeguu/core/verbal_flashcards/text_normalization.py(canonical + ASR-tolerant)zeeguu/core/verbal_flashcards/fuzzy_match.py(DL + JW +score_word_match+calculate_accuracy)zeeguu/core/verbal_flashcards/flashcard_selection.py(collection +_ensure_schedule_for_verbal_flashcard)
and keep this file as a thin route layer.
| FUZZY_ACCEPTANCE_BUFFER = 0.08 | ||
|
|
||
|
|
||
| def canonical_danish_form(word): |
There was a problem hiding this comment.
These normalizers are Danish-specific but live behind generic-sounding names — sanitize_spoken_text, score_word_match, fuzzy_match_threshold all implicitly assume Danish. When this is extended to de/fr you'll need a per-language normalizer registry. Worth setting up the abstraction now (e.g. normalizer_for(language_code)) even while only Danish is implemented, so future languages don't require refactoring every call site.
|
|
||
| schedule = FourLevelsPerWord(user_word=user_word) | ||
| schedule.next_practice_time = datetime.now() | ||
| schedule.consecutive_correct_answers = 0 |
There was a problem hiding this comment.
This commits mid-request, and then report_exercise_outcome commits again a few lines later in the submit flow. Two commits per submit leaves a window where partial state lands on exception between them. Prefer db_session.flush() here and let the downstream report_exercise_outcome commit the aggregate.
| Transcribe audio by routing the request to the dedicated ASR worker that | ||
| owns the model for the user's learned language. | ||
| """ | ||
| audio_bytes = audio_file.read() |
There was a problem hiding this comment.
claude says: no file size limit on the upload — audio_file.read() reads the full body into memory unconditionally. A large or malicious POST can exhaust worker memory. Either set MAX_CONTENT_LENGTH on the Flask app or validate Content-Length before reading. please check.
| user, feature_gate = _current_verbal_flashcards_user() | ||
| if feature_gate: | ||
| return feature_gate | ||
| flashcard = _find_flashcard_for_user(user, flashcard_id) |
There was a problem hiding this comment.
claude says: the flashcard id is str(bookmark.id) and _find_flashcard_for_user re-runs get_flashcard_collection(user) — which calls BasicSRSchedule.user_words_to_study(user) — on every submit. If the schedule state changes between the initial GET and submit (e.g. a sibling exercise moves the word out of the 'to study' set), submit returns 404 for a card the user just practiced. Resolve by user_word_id + ownership check directly instead of rescanning the session collection. That also fixes the perf concern of re-running the scheduler query once per submit. please check.
| ) | ||
| except ValueError as exc: | ||
| return jsonify({"error": str(exc)}), 400 | ||
| except Exception as exc: |
There was a problem hiding this comment.
Status codes between /health and /transcribe disagree: when the model fails to load, /health returns 200 with 'status': 'degraded', but /transcribe raises RuntimeError which hits this generic except Exception → 500. Return 503 when ASR_AVAILABLE is false (or asr_model is None) so orchestrators and load balancers can route around a bad pod.
mircealungu
left a comment
There was a problem hiding this comment.
Two follow-up inline comments: worker model-load semantics, and the Flask-config fallback in the ASR client (context: comparing against readability vs stanza conventions).
|
|
||
|
|
||
| bind = os.environ.get("GUNICORN_BIND", "0.0.0.0:5002") | ||
| workers = int(os.environ.get("GUNICORN_WORKERS", "1")) |
There was a problem hiding this comment.
preload_app = False + workers = 1 is fine today, but the model is loaded at module import time in app.py (asr_model = ASRModel.from_pretrained(...)). With preload_app = True, gunicorn imports the app once in the master process and then forks workers — on Linux the forked children share the loaded model weights via copy-on-write, so one load serves N workers. With preload_app = False, each worker loads its own copy. If you ever bump workers above 1 for throughput, memory doubles needlessly. Flipping to preload_app = True now makes that a config change, not a code change.
| """Return configured worker URLs, falling back to the local Danish worker.""" | ||
| raw_value = os.environ.get("ASR_SERVICE_URLS", "") | ||
|
|
||
| if not raw_value and has_app_context(): |
There was a problem hiding this comment.
This has_app_context() / current_app.config.get(...) branch is a third config source that matches neither existing convention in Zeeguu:
readability_serveris hardcoded inzeeguu/core/content_retriever/parse_with_readability_server.py:12stanzais env-only (STANZA_SERVICE_URLin docker-compose.yml)
Hardcoding doesn't fit here because ASR is language-sharded and the map will grow. The env-var map (ASR_SERVICE_URLS) you already use matches the stanza pattern and is the right shape. Suggest dropping this Flask-config branch so env is the single source of truth.
mircealungu
left a comment
There was a problem hiding this comment.
Follow-up on #14 — concrete suggestion you can apply directly.
| def configured_asr_service_urls(): | ||
| """Return configured worker URLs, falling back to the local Danish worker.""" | ||
| raw_value = os.environ.get("ASR_SERVICE_URLS", "") | ||
|
|
||
| if not raw_value and has_app_context(): | ||
| raw_value = current_app.config.get("ASR_SERVICE_URLS", "") | ||
|
|
||
| if not raw_value: | ||
| raw_value = LOCAL_DEV_ASR_SERVICE_URLS | ||
|
|
||
| return parse_asr_service_urls(raw_value) |
There was a problem hiding this comment.
Matches the existing convention used by the stanza service in this codebase: it reads its URL from a single env var (STANZA_SERVICE_URL, set in docker-compose.yml) with no Flask-config fallback. The other external service — readability_server — hardcodes its URL in zeeguu/core/content_retriever/parse_with_readability_server.py and reads no env var at all. Neither of them uses current_app.config, so the branch you added here is a third pattern that nothing else in the codebase follows.
Hardcoding (readability's approach) won't work for ASR because the map has to grow per language (da → asr_da, de → asr_de, …), so env-var-as-map is the right shape. Just drop the Flask-config branch so env is the single source of truth, which mirrors stanza.
Suggested replacement for this function:
| def configured_asr_service_urls(): | |
| """Return configured worker URLs, falling back to the local Danish worker.""" | |
| raw_value = os.environ.get("ASR_SERVICE_URLS", "") | |
| if not raw_value and has_app_context(): | |
| raw_value = current_app.config.get("ASR_SERVICE_URLS", "") | |
| if not raw_value: | |
| raw_value = LOCAL_DEV_ASR_SERVICE_URLS | |
| return parse_asr_service_urls(raw_value) | |
| def configured_asr_service_urls(): | |
| """Return configured worker URLs, falling back to the local Danish worker.""" | |
| raw_value = os.environ.get("ASR_SERVICE_URLS", "") or LOCAL_DEV_ASR_SERVICE_URLS | |
| return parse_asr_service_urls(raw_value) |
Once applied, the from flask import has_app_context, current_app line at the top of the file becomes unused — remove it in the same commit.
Co-Authored-By: Spon <[email protected]>
…u-api-verbal-flashcards into verbal-flashcards
Co-Authored-By: Spon <[email protected]>
Co-Authored-By: Spon <[email protected]>
Co-Authored-By: Spon <[email protected]>
Functionality has been split into four modules, all with distinct responsibilities. The endpoint class now functions as a slimmer routing class with more general logic.
…u-api-verbal-flashcards into verbal-flashcards
…u-api-verbal-flashcards into verbal-flashcards
| ZEEGUU_DATA_FOLDER: /zeeguu-data/ | ||
| ZEEGUU_RESOURCES_FOLDER: /zeeguu-data/ | ||
| STANZA_SERVICE_URL: http://stanza:5001 | ||
| ASR_SERVICE_URLS: "${ASR_SERVICE_URLS:-da=http://asr_da:5002}" |
There was a problem hiding this comment.
The current shape da=http://asr_da:5002 is the worst of both worlds — language is in the host name (asr_da) AND in the env-var key (da=) AND there's an explicit non-default port. Pick a direction:
- Option 1: one ASR container, multiple languages. Container is named
asr. Language is just a request parameter. URL becomesda=http://asr(worker listens on 80 inside the container). Adding German later is justde=http://asr. - Option 2: one container per language. Container
asr_da, withasr_dejoining later — fine. But drop the explicit port: let each worker listen on 80, so the URL is justda=http://asr_da.
Either way, :5002 is leaking an internal implementation detail into config. Same shape also appears in default.env:31 — fix both. Not a hard blocker, but worth deciding now while there's only one entry.
| def _current_verbal_flashcards_user(): | ||
| user = User.find_by_id(flask.g.user_id) | ||
| return user, _ensure_verbal_flashcards_enabled(user) | ||
|
|
There was a problem hiding this comment.
_current_verbal_flashcards_user() returns (user, None) when the user passes the gate and (user, response_tuple) when they don't, leading every endpoint to do if feature_gate: return feature_gate. The function name promises "current user" but it's actually computing a 404 response on the side. Cleaner to split:
def _current_verbal_flashcards_user():
return User.find_by_id(flask.g.user_id)…and call _ensure_verbal_flashcards_enabled(user) explicitly at the top of each endpoint. Less clever, easier to read.
| try: | ||
| _ensure_request_audio_size_is_allowed() | ||
|
|
||
| if "file" not in request.files: | ||
| return json_result({"error": "No audio file provided"}), 400 | ||
|
|
||
| audio_file = request.files["file"] | ||
| if audio_file.filename == "": | ||
| return json_result({"error": "Empty filename"}), 400 | ||
|
|
||
| user, feature_gate = _current_verbal_flashcards_user() | ||
| if feature_gate: | ||
| return feature_gate |
There was a problem hiding this comment.
The feature gate runs after audio-size enforcement, file-presence check, and filename check. Cheap to flip — gate first, then validate. Two reasons:
- Saves reading a possibly-large audio body for a non-allowlisted user.
- Leaks "this feature exists" less to outsiders.
Same pattern in submit_answer (gate after JSON parse, line 247) and check_pronunciation (gate after JSON parse, line 306). Please move all three gates to the top of the endpoint.
| except Exception as e: | ||
| log(f"Get flashcards error: {e}") | ||
| traceback.print_exc() | ||
| return json_result({"error": str(e)}), 500 |
There was a problem hiding this comment.
Echoing str(e) to the client can leak ORM internals, file paths, and other implementation details. The transcribe endpoint already does this right (returns a static \"Transcription endpoint error\" on line 162). Please align the other three the same way: log with detail, respond with a generic message.
Same issue in:
submit_answer(lines 282-284)check_pronunciation(lines 329-331)
| flashcard_id = str(data.get("flashcard_id")) if data.get("flashcard_id") is not None else None | ||
| user_answer = data.get("user_answer", "") | ||
| is_correct = data.get("is_correct") | ||
| answer_source = data.get("answer_source", "unknown") | ||
| response_time = data.get("response_time_ms", 0) | ||
| session_id = data.get("session_id") | ||
|
|
||
| if not flashcard_id or is_correct is None: | ||
| return json_result({"error": "flashcard_id and is_correct are required"}), 400 |
There was a problem hiding this comment.
Two small things on the flashcard_id handling here:
data.get("flashcard_id")is called twice.- The truthiness check
not flashcard_idwould rejectflashcard_id == "0"(zero is falsy) — never an issue today since IDs are positive, but it's a footgun.
Cleaner:
flashcard_id = data.get("flashcard_id")
if flashcard_id is None or is_correct is None:
return json_result({"error": "flashcard_id and is_correct are required"}), 400
flashcard_id = str(flashcard_id)
Forward-compatibility for a likely Whisper switchThe per-language container architecture works for today's Danish-only experiment, but it's worth noting that scaling beyond Danish probably won't mean adding more Parakeet workers — there's no The good news: most of this PR's design layers are forward-compatible.
The contract What does leak Parakeet/per-language assumptions and is worth tightening: 1. Default to single URL; per-language overrides are the exception. 2. Worker's 3. README in None of these are PR-blockers. The current design ships fine for the Danish experiment. They're the kind of thing where doing them now costs ~30 minutes and doing them later (after another language is added with the wrong shape) costs hours of cleanup. |
JW is still maintained as part of diagnostics as it is an interesting metric to maybe keep an eye on in the future. But it does not measure correctness anymore
Verbal Flashcards Change SummaryFlashcard Response Shape
Endpoint Safety And Error Handling
Flashcard Selection And Submission Flow
Text Normalization
Fuzzy Matching And Pronunciation Scoring
ASR Client Configuration
ASR Worker Architecture
ASR Worker Dependencies And NeMo Output
Docker And Environment
Tests And Verification
Notes
|
|
Verbal flashcards now use Meaning pairs as possible answer also. This can in some cases fix translations errors of: "Ball" being translated to "bolden" instead of "bold", where the bookmark variant contains "bold". |
Co-Authored-By: Spon <[email protected]>
| texts.append(cleaned_text) | ||
|
|
||
|
|
||
| def answer_variants_for_bookmark(bookmark): |
There was a problem hiding this comment.
Nice idea for the inflection case — bold / bolden sharing the cue "ball" is exactly where strict matching frustrates learners.
But the current rule (any non-INVALID Meaning with same origin-language, translation-language, and lowercased translation text) can't distinguish inflectional variants from genuine homonyms. Concrete Danish example, for an English speaker:
spring → forår(the season)spring → fjeder(a mechanical spring)spring → kilde(a water source)
All three would now be accepted as correct answers for the cue "spring", regardless of which sense the learner actually scheduled. If they specifically picked spring → fjeder to drill engineering vocabulary, getting credit for kilde undermines the study.
To do this reliably, I think we need a meaning-family / inflection-group relationship in the data model — something that marks bold and bolden as forms of the same lemma, while keeping fjeder and kilde as unrelated meanings that just happen to share an English cue. Matching on cue text alone over-accepts on homonyms.
Until that data exists, one safer scoping option: limit variants to the same user_word.meaning.origin.lemma (if available), or only accept variants when the edit distance between origin contents is small (which would catch bold ↔ bolden but reject fjeder ↔ kilde). Worth thinking about before this lands as the default scoring policy.
There was a problem hiding this comment.
Good catch! This should be fixed now. The intermediary decision is to accept variants only if they are 2 or fewer edits from the original bookmark :)
Add backend support for Verbal Flashcards, pronunciation checking, and language-specific ASR workers
Why this Pull Request exists
This PR adds the backend required for a new spoken vocabulary exercise called Verbal Flashcards.
The new backend work does four jobs:
The result is not just “speech-to-text attached to flashcards.” It is a full backend workflow for spoken practice.
High-level architecture
The implementation is split into three parts:
1. Main Zeeguu API endpoint layer
This is the part that serves flashcards, receives answers, and integrates results into the existing scheduling model.
2. ASR service client layer
This decides which ASR worker should handle a transcription request based on language.
3. Dedicated ASR worker microservice
Each worker loads a single speech-recognition model and exposes transcription over HTTP.
This separation is a strong design choice because it keeps the core application free from model-specific dependencies and makes multilingual expansion much easier.
What is included in this PR
1. A new verbal flashcards endpoint module
The main new backend module introduces the feature's business logic, API routes, matching logic, and persistence behavior.
2. Feature-flag support
The feature is gated behind a dedicated feature toggle so it can be rolled out safely.
3. An ASR client for routing by language
The API can forward audio to different workers depending on the learner's studied language.
4. A dedicated ASR worker service
A small Flask microservice loads one ASR model and handles transcription requests.
5. Gunicorn config for the worker
The worker is packaged with deployment settings for production-style serving.
6. Tests
The feature includes focused tests for:
File-by-file explanation
verbal_flashcards.pyThis is the main backend implementation.
Core responsibilities
Feature access control
The module checks whether the current user is allowed to use verbal flashcards. If not, it returns a feature-disabled response.
Flashcard generation
The function
get_flashcard_collection(user)converts the learner's scheduled study words into spoken flashcards.Important behavior:
This is a smart choice because verbal practice is being layered onto Zeeguu's existing learning model instead of duplicating vocabulary state.
Schedule recovery for higher-level words
The helper
_ensure_schedule_for_verbal_flashcard(user_word)makes sure a word has a schedule row even if it is not currently in the normal exercise pipeline.This matters because verbal flashcards can target mature words. The feature therefore needs a way to write outcomes back into the existing scheduling system without resetting the learner's progress.
New API endpoints in
verbal_flashcards.pyGET /verbal_flashcardsReturns the learner's currently available verbal flashcards.
What it supports
limitoffsetWhy it matters
This is the entry point for the frontend session. It turns scheduled Zeeguu study words into a frontend-ready spoken exercise list.
POST /verbal_flashcards/transcribeAccepts an uploaded audio file and returns a transcription.
Main behavior
Error handling
It returns specific status codes for different failure modes:
400for missing or invalid file input,503when no ASR worker is configured for the language,502when the worker request fails,500for unexpected internal errors.POST /verbal_flashcards/check_pronunciationAccepts:
user_speechexpected_textand returns structured pronunciation analysis without storing progress.
Why this endpoint is useful
It separates evaluation from persistence.
That gives the frontend a safe way to:
This separation is one of the strongest architectural decisions in the PR.
POST /verbal_flashcards/submitAccepts the final attempt result and records it as an exercise outcome.
Expected payload includes
flashcard_iduser_answeris_correctanswer_sourceresponse_time_mssession_idImportant behavior
is_correct,Verbal Flashcards.This endpoint is what makes the feature part of Zeeguu's real learning pipeline rather than just a standalone pronunciation demo.
Matching and evaluation logic
ASR output is noisy, especially for foreign-language learners. If the system demanded exact string equality, the feature would feel unfair and would reject many answers that are close enough to be pedagogically useful.
This PR solves that problem by combining several layers of normalization and fuzzy matching.
sanitize_spoken_textNormalizes casing, punctuation, and spacing while preserving Danish characters.
canonical_danish_formConverts common alternative spellings into stable Danish written forms, such as:
aa->åae->æoe->øasr_tolerant_danish_formApplies even more permissive transformations for ASR comparison, such as:
æ->eø->oå->aThis is the feature's “grace for language learners” layer.
Fuzzy algorithms included
Damerau-Levenshtein distance
Used to measure edit distance with support for insertion, deletion, substitution, and transposition.
This is useful for small ASR spelling mistakes and near-miss transcriptions.
Jaro and Jaro-Winkler similarity
Used to capture similarity in a more flexible way, especially for short strings.
Boundary-aware Jaro-Winkler
The implementation also compares reversed strings so that dropped initial sounds are not punished too harshly.
Length-aware acceptance threshold
Short words use stricter thresholds, while longer words can tolerate more variation.
That is a very sensible design choice because one-letter differences matter much more in very short words.
Word-level accuracy analysis
The evaluation does not just compare full phrases.
Instead, for each expected word it:
The response includes:
This is what makes the frontend's word breakdown possible.
user_feature_toggles.pyThis file adds the feature toggle for
verbal_flashcards.How the toggle works
The feature is enabled only for users whose stored invitation code appears in the environment variable:
VERBAL_FLASHCARDS_INVITE_CODESWhy this matters
This is a good rollout strategy for an experimental feature because it allows:
asr_service_client.pyThis file is the bridge between the main API and the ASR workers.
Main responsibilities
Parse worker mappings
It accepts configuration strings such as:
da=http://asr-da:5002de=http://asr-de:5002fr=http://asr-fr:5002and converts them into a language-to-URL map.
Resolve the correct worker
Given a learner language code, it selects the correct worker URL.
Fallback local development setup
If nothing is configured, it falls back to a local Danish worker mapping.
Perform the transcription request
It sends the audio file plus
language_codeto the selected worker and normalizes request failures into specific exceptions.Why this matters
This file is what turns the system from “one speech service” into a language-aware worker architecture.
app.py(ASR worker microservice)This file implements the dedicated ASR worker.
Purpose
Each worker instance owns exactly one language model.
That means the main API does not run speech recognition directly. Instead, it forwards the request to the worker that owns the model for the learner's language.
Main environment variables
ASR_LANGUAGE_CODEASR_MODEL_NAMEASR_WORKER_NAMEASR_SERVICE_PORTModel loading
At startup, the worker tries to import:
nemo.collections.asrpyduband then loads the configured model using
ASRModel.from_pretrained(...).Audio preprocessing
Before transcription, the worker:
This is important because speech models are much more reliable when audio is normalized to the format they expect.
Worker endpoints
GET /healthReturns worker status, language, name, and whether the model is loaded.
POST /transcribeAccepts a file upload plus optional
language_code, validates that the request matches the worker's configured language, and returns a transcription.Why this matters
This worker is small, focused, and operationally clear. It isolates model dependencies and makes future multilingual scaling straightforward.
gunicorn.conf.pyThis file provides deployment config for the ASR worker.
What it sets
Why it matters
This turns the worker from a dev-only Flask app into something that is ready to run as a standalone service.
test_verbal_flashcards.pyThis file provides targeted coverage for the feature.
What the tests verify
Feature availability
404when the feature is disabled.Flashcard generation
Text normalization and matching
Endpoint behavior
503,400.Answer submission
404.