Skip to content

Verbal flashcards API implementation#544

Open
jakso535 wants to merge 49 commits intozeeguu:masterfrom
jakso535:verbal-flashcards
Open

Verbal flashcards API implementation#544
jakso535 wants to merge 49 commits intozeeguu:masterfrom
jakso535:verbal-flashcards

Conversation

@jakso535
Copy link
Copy Markdown

@jakso535 jakso535 commented Apr 18, 2026

Add backend support for Verbal Flashcards, pronunciation checking, and language-specific ASR workers

Why this Pull Request exists

This PR adds the backend required for a new spoken vocabulary exercise called Verbal Flashcards.

The new backend work does four jobs:

  1. expose verbal flashcards as a new exercise source,
  2. transcribe uploaded learner audio,
  3. evaluate the transcription against the expected answer with learner-friendly tolerance,
  4. store the final outcome inside Zeeguu's normal exercise and scheduling system.

The result is not just “speech-to-text attached to flashcards.” It is a full backend workflow for spoken practice.


High-level architecture

The implementation is split into three parts:

1. Main Zeeguu API endpoint layer

This is the part that serves flashcards, receives answers, and integrates results into the existing scheduling model.

2. ASR service client layer

This decides which ASR worker should handle a transcription request based on language.

3. Dedicated ASR worker microservice

Each worker loads a single speech-recognition model and exposes transcription over HTTP.

This separation is a strong design choice because it keeps the core application free from model-specific dependencies and makes multilingual expansion much easier.


What is included in this PR

1. A new verbal flashcards endpoint module

The main new backend module introduces the feature's business logic, API routes, matching logic, and persistence behavior.

2. Feature-flag support

The feature is gated behind a dedicated feature toggle so it can be rolled out safely.

3. An ASR client for routing by language

The API can forward audio to different workers depending on the learner's studied language.

4. A dedicated ASR worker service

A small Flask microservice loads one ASR model and handles transcription requests.

5. Gunicorn config for the worker

The worker is packaged with deployment settings for production-style serving.

6. Tests

The feature includes focused tests for:

  • endpoint behavior,
  • feature gating,
  • flashcard selection,
  • fuzzy matching,
  • worker routing,
  • error handling,
  • answer submission.

File-by-file explanation

verbal_flashcards.py

This is the main backend implementation.

Core responsibilities

Feature access control

The module checks whether the current user is allowed to use verbal flashcards. If not, it returns a feature-disabled response.

Flashcard generation

The function get_flashcard_collection(user) converts the learner's scheduled study words into spoken flashcards.

Important behavior:

  • only words at level 3 or above are included,
  • duplicate source words are removed,
  • each returned card includes both prompt and expected answer fields,
  • cards are derived from existing user words rather than from a separate flashcard database.

This is a smart choice because verbal practice is being layered onto Zeeguu's existing learning model instead of duplicating vocabulary state.

Schedule recovery for higher-level words

The helper _ensure_schedule_for_verbal_flashcard(user_word) makes sure a word has a schedule row even if it is not currently in the normal exercise pipeline.

This matters because verbal flashcards can target mature words. The feature therefore needs a way to write outcomes back into the existing scheduling system without resetting the learner's progress.


New API endpoints in verbal_flashcards.py

GET /verbal_flashcards

Returns the learner's currently available verbal flashcards.

What it supports

  • optional limit
  • optional offset
  • feature-gate enforcement
  • pagination metadata in the response

Why it matters

This is the entry point for the frontend session. It turns scheduled Zeeguu study words into a frontend-ready spoken exercise list.


POST /verbal_flashcards/transcribe

Accepts an uploaded audio file and returns a transcription.

Main behavior

  • validates that a file exists,
  • validates filename presence,
  • checks whether the feature is enabled,
  • reads the learner's learned language,
  • sends the audio to the correct ASR worker,
  • returns the transcribed text.

Error handling

It returns specific status codes for different failure modes:

  • 400 for missing or invalid file input,
  • 503 when no ASR worker is configured for the language,
  • 502 when the worker request fails,
  • 500 for unexpected internal errors.

POST /verbal_flashcards/check_pronunciation

Accepts:

  • user_speech
  • expected_text

and returns structured pronunciation analysis without storing progress.

Why this endpoint is useful

It separates evaluation from persistence.

That gives the frontend a safe way to:

  • show immediate feedback,
  • decide whether the learner should retry,
  • avoid writing partial attempts into the database too early.

This separation is one of the strongest architectural decisions in the PR.


POST /verbal_flashcards/submit

Accepts the final attempt result and records it as an exercise outcome.

Expected payload includes

  • flashcard_id
  • user_answer
  • is_correct
  • answer_source
  • response_time_ms
  • optional session_id

Important behavior

  • validates required fields,
  • resolves the flashcard for the current user,
  • coerces and validates timing/session data,
  • recalculates accuracy when a user answer is present,
  • allows fuzzy acceptance to override an initially false is_correct,
  • ensures a schedule row exists,
  • writes the result as a normal Zeeguu exercise outcome,
  • tags the exercise source as Verbal Flashcards.

This endpoint is what makes the feature part of Zeeguu's real learning pipeline rather than just a standalone pronunciation demo.


Matching and evaluation logic

ASR output is noisy, especially for foreign-language learners. If the system demanded exact string equality, the feature would feel unfair and would reject many answers that are close enough to be pedagogically useful.

This PR solves that problem by combining several layers of normalization and fuzzy matching.

sanitize_spoken_text

Normalizes casing, punctuation, and spacing while preserving Danish characters.

canonical_danish_form

Converts common alternative spellings into stable Danish written forms, such as:

  • aa -> å
  • ae -> æ
  • oe -> ø

asr_tolerant_danish_form

Applies even more permissive transformations for ASR comparison, such as:

  • æ -> e
  • ø -> o
  • å -> a
  • simplifying some initial and final consonant patterns.

This is the feature's “grace for language learners” layer.


Fuzzy algorithms included

Damerau-Levenshtein distance

Used to measure edit distance with support for insertion, deletion, substitution, and transposition.

This is useful for small ASR spelling mistakes and near-miss transcriptions.

Jaro and Jaro-Winkler similarity

Used to capture similarity in a more flexible way, especially for short strings.

Boundary-aware Jaro-Winkler

The implementation also compares reversed strings so that dropped initial sounds are not punished too harshly.

Length-aware acceptance threshold

Short words use stricter thresholds, while longer words can tolerate more variation.

That is a very sensible design choice because one-letter differences matter much more in very short words.


Word-level accuracy analysis

The evaluation does not just compare full phrases.

Instead, for each expected word it:

  • looks for the best unmatched spoken candidate,
  • computes similarity scores,
  • records whether the word matched,
  • records whether it was exact, normalized, fuzzy, or close,
  • builds a word-by-word result structure.

The response includes:

  • overall accuracy,
  • accepted accuracy,
  • accepted word count,
  • acceptance boolean,
  • feedback label,
  • detailed word matches.

This is what makes the frontend's word breakdown possible.


user_feature_toggles.py

This file adds the feature toggle for verbal_flashcards.

How the toggle works

The feature is enabled only for users whose stored invitation code appears in the environment variable:

  • VERBAL_FLASHCARDS_INVITE_CODES

Why this matters

This is a good rollout strategy for an experimental feature because it allows:

  • controlled access,
  • language-specific pilots,
  • staged testing with selected cohorts or invite codes.

asr_service_client.py

This file is the bridge between the main API and the ASR workers.

Main responsibilities

Parse worker mappings

It accepts configuration strings such as:

  • da=http://asr-da:5002
  • de=http://asr-de:5002
  • fr=http://asr-fr:5002

and converts them into a language-to-URL map.

Resolve the correct worker

Given a learner language code, it selects the correct worker URL.

Fallback local development setup

If nothing is configured, it falls back to a local Danish worker mapping.

Perform the transcription request

It sends the audio file plus language_code to the selected worker and normalizes request failures into specific exceptions.

Why this matters

This file is what turns the system from “one speech service” into a language-aware worker architecture.


app.py (ASR worker microservice)

This file implements the dedicated ASR worker.

Purpose

Each worker instance owns exactly one language model.

That means the main API does not run speech recognition directly. Instead, it forwards the request to the worker that owns the model for the learner's language.

Main environment variables

  • ASR_LANGUAGE_CODE
  • ASR_MODEL_NAME
  • ASR_WORKER_NAME
  • ASR_SERVICE_PORT

Model loading

At startup, the worker tries to import:

  • nemo.collections.asr
  • pydub

and then loads the configured model using ASRModel.from_pretrained(...).

Audio preprocessing

Before transcription, the worker:

  • reads the uploaded audio,
  • converts it to mono,
  • resamples it to 16 kHz,
  • exports it to a temporary WAV file,
  • runs transcription on that normalized file.

This is important because speech models are much more reliable when audio is normalized to the format they expect.

Worker endpoints

GET /health

Returns worker status, language, name, and whether the model is loaded.

POST /transcribe

Accepts a file upload plus optional language_code, validates that the request matches the worker's configured language, and returns a transcription.

Why this matters

This worker is small, focused, and operationally clear. It isolates model dependencies and makes future multilingual scaling straightforward.


gunicorn.conf.py

This file provides deployment config for the ASR worker.

What it sets

  • bind address
  • worker count
  • thread count
  • timeouts
  • log output
  • startup and ready hooks

Why it matters

This turns the worker from a dev-only Flask app into something that is ready to run as a standalone service.


test_verbal_flashcards.py

This file provides targeted coverage for the feature.

What the tests verify

Feature availability

  • users get 404 when the feature is disabled.

Flashcard generation

  • only level-3+ words are returned,
  • duplicate origin words are removed,
  • pagination works.

Text normalization and matching

  • Danish sanitization behaves correctly,
  • canonical spelling normalization works,
  • ASR-tolerant forms work,
  • fuzzy matching accepts common learner/ASR variants,
  • multi-word answers can still be accepted when word order differs.

Endpoint behavior

  • pronunciation checking validates required fields,
  • transcription endpoint returns expected data,
  • missing worker configuration returns 503,
  • invalid session ids return 400.

Answer submission

  • exercise outcomes are recorded,
  • fuzzy acceptance can override a false incoming correctness flag,
  • invalid response times are coerced safely,
  • non-eligible words return 404.

jakso535 and others added 2 commits April 18, 2026 16:09
This commit includes all implementation details of the verbal flashcards feature. If issues arise, it could be because some local-config was left out.

Co-Authored-By: Spon <[email protected]>
Copy link
Copy Markdown
Member

@mircealungu mircealungu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review from a walkthrough of the branch. Most findings are inline below; the main blocker and a few minor items are summarized here.

🚨 Blocker — endpoint routes are never registered

zeeguu/api/endpoints/__init__.py is the module that imports every endpoint file so the @api.route(...) decorators execute. verbal_flashcards isn't imported there, so in production none of /verbal_flashcards, /verbal_flashcards/transcribe, /verbal_flashcards/submit, /verbal_flashcards/check_pronunciation exist on the running Flask app.

The tests pass only because test_verbal_flashcards.py's autouse fixture does monkeypatch.setattr("zeeguu.api.endpoints.verbal_flashcards...", ...), which force-imports the module as a side effect and runs the decorators. In production that side-effect import never happens.

Fix: add from . import verbal_flashcards to zeeguu/api/endpoints/__init__.py.

Minor items

  • transcribe_audio_endpoint's generic except Exception echoes str(e) back to the client, which can leak internal details (paths, stack-ish info). Log internally and return a generic "Transcription failed".
  • default.env — the trailing blank-line deletions are unrelated churn; worth isolating or dropping.
  • VERBAL_FLASHCARD_EXERCISE_SOURCE = "Verbal Flashcards" — check the existing rows in exercise_source; other sources may be snake_case and this would create a slightly inconsistent entry.

Comment thread requirements.txt Outdated
de_core_news_md @ https://github.com/explosion/spacy-models/releases/download/de_core_news_md-3.7.0/de_core_news_md-3.7.0-py3-none-any.whl
da_core_news_md @ https://github.com/explosion/spacy-models/releases/download/da_core_news_md-3.7.0/da_core_news_md-3.7.0-py3-none-any.whl
nltk
nemo_toolkit[asr]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pulls torch and several GB of model-tooling dependencies into the main API image. The dedicated asr_service/ already lists nemo_toolkit[asr] in its own requirements.txt — that's where it belongs. The main API only proxies audio to the worker; it just needs requests. Please remove this line.

Comment thread requirements.txt Outdated

# Database
mysqlclient
mysqlclient==2.2.7
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pin is unrelated to verbal flashcards. Either split it into its own PR or add a line in the PR description explaining why it's needed.

logger = logging.getLogger(__name__)

DEFAULT_ASR_SERVICE_TIMEOUT = float(os.environ.get("ASR_SERVICE_TIMEOUT", "30"))
LOCAL_DEV_ASR_SERVICE_URLS = "da=http://127.0.0.1:5002"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Falling back silently to http://127.0.0.1:5002 when ASR_SERVICE_URLS is unset means a misconfigured production deploy quietly tries localhost and then fails with a 502 on connection refused. Prefer to only use this fallback in a dev context (e.g. when FLASK_ENV == 'development') and raise ASRServiceNotConfigured otherwise.

@@ -0,0 +1,756 @@
import traceback
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is 756 lines and mixes four concerns: routes, flashcard-from-scheduled-words mapping, Danish text normalization, and fuzzy matching. Consider splitting into:

  • zeeguu/core/verbal_flashcards/text_normalization.py (canonical + ASR-tolerant)
  • zeeguu/core/verbal_flashcards/fuzzy_match.py (DL + JW + score_word_match + calculate_accuracy)
  • zeeguu/core/verbal_flashcards/flashcard_selection.py (collection + _ensure_schedule_for_verbal_flashcard)

and keep this file as a thin route layer.

FUZZY_ACCEPTANCE_BUFFER = 0.08


def canonical_danish_form(word):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These normalizers are Danish-specific but live behind generic-sounding names — sanitize_spoken_text, score_word_match, fuzzy_match_threshold all implicitly assume Danish. When this is extended to de/fr you'll need a per-language normalizer registry. Worth setting up the abstraction now (e.g. normalizer_for(language_code)) even while only Danish is implemented, so future languages don't require refactoring every call site.


schedule = FourLevelsPerWord(user_word=user_word)
schedule.next_practice_time = datetime.now()
schedule.consecutive_correct_answers = 0
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This commits mid-request, and then report_exercise_outcome commits again a few lines later in the submit flow. Two commits per submit leaves a window where partial state lands on exception between them. Prefer db_session.flush() here and let the downstream report_exercise_outcome commit the aggregate.

Transcribe audio by routing the request to the dedicated ASR worker that
owns the model for the user's learned language.
"""
audio_bytes = audio_file.read()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

claude says: no file size limit on the upload — audio_file.read() reads the full body into memory unconditionally. A large or malicious POST can exhaust worker memory. Either set MAX_CONTENT_LENGTH on the Flask app or validate Content-Length before reading. please check.

user, feature_gate = _current_verbal_flashcards_user()
if feature_gate:
return feature_gate
flashcard = _find_flashcard_for_user(user, flashcard_id)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

claude says: the flashcard id is str(bookmark.id) and _find_flashcard_for_user re-runs get_flashcard_collection(user) — which calls BasicSRSchedule.user_words_to_study(user) — on every submit. If the schedule state changes between the initial GET and submit (e.g. a sibling exercise moves the word out of the 'to study' set), submit returns 404 for a card the user just practiced. Resolve by user_word_id + ownership check directly instead of rescanning the session collection. That also fixes the perf concern of re-running the scheduler query once per submit. please check.

Comment thread asr_service/app.py Outdated
)
except ValueError as exc:
return jsonify({"error": str(exc)}), 400
except Exception as exc:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Status codes between /health and /transcribe disagree: when the model fails to load, /health returns 200 with 'status': 'degraded', but /transcribe raises RuntimeError which hits this generic except Exception → 500. Return 503 when ASR_AVAILABLE is false (or asr_model is None) so orchestrators and load balancers can route around a bad pod.

Copy link
Copy Markdown
Member

@mircealungu mircealungu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two follow-up inline comments: worker model-load semantics, and the Flask-config fallback in the ASR client (context: comparing against readability vs stanza conventions).



bind = os.environ.get("GUNICORN_BIND", "0.0.0.0:5002")
workers = int(os.environ.get("GUNICORN_WORKERS", "1"))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

preload_app = False + workers = 1 is fine today, but the model is loaded at module import time in app.py (asr_model = ASRModel.from_pretrained(...)). With preload_app = True, gunicorn imports the app once in the master process and then forks workers — on Linux the forked children share the loaded model weights via copy-on-write, so one load serves N workers. With preload_app = False, each worker loads its own copy. If you ever bump workers above 1 for throughput, memory doubles needlessly. Flipping to preload_app = True now makes that a config change, not a code change.

"""Return configured worker URLs, falling back to the local Danish worker."""
raw_value = os.environ.get("ASR_SERVICE_URLS", "")

if not raw_value and has_app_context():
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has_app_context() / current_app.config.get(...) branch is a third config source that matches neither existing convention in Zeeguu:

  • readability_server is hardcoded in zeeguu/core/content_retriever/parse_with_readability_server.py:12
  • stanza is env-only (STANZA_SERVICE_URL in docker-compose.yml)

Hardcoding doesn't fit here because ASR is language-sharded and the map will grow. The env-var map (ASR_SERVICE_URLS) you already use matches the stanza pattern and is the right shape. Suggest dropping this Flask-config branch so env is the single source of truth.

Copy link
Copy Markdown
Member

@mircealungu mircealungu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up on #14 — concrete suggestion you can apply directly.

Comment on lines +61 to +71
def configured_asr_service_urls():
"""Return configured worker URLs, falling back to the local Danish worker."""
raw_value = os.environ.get("ASR_SERVICE_URLS", "")

if not raw_value and has_app_context():
raw_value = current_app.config.get("ASR_SERVICE_URLS", "")

if not raw_value:
raw_value = LOCAL_DEV_ASR_SERVICE_URLS

return parse_asr_service_urls(raw_value)
Copy link
Copy Markdown
Member

@mircealungu mircealungu Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Matches the existing convention used by the stanza service in this codebase: it reads its URL from a single env var (STANZA_SERVICE_URL, set in docker-compose.yml) with no Flask-config fallback. The other external service — readability_server — hardcodes its URL in zeeguu/core/content_retriever/parse_with_readability_server.py and reads no env var at all. Neither of them uses current_app.config, so the branch you added here is a third pattern that nothing else in the codebase follows.

Hardcoding (readability's approach) won't work for ASR because the map has to grow per language (da → asr_da, de → asr_de, …), so env-var-as-map is the right shape. Just drop the Flask-config branch so env is the single source of truth, which mirrors stanza.

Suggested replacement for this function:

Suggested change
def configured_asr_service_urls():
"""Return configured worker URLs, falling back to the local Danish worker."""
raw_value = os.environ.get("ASR_SERVICE_URLS", "")
if not raw_value and has_app_context():
raw_value = current_app.config.get("ASR_SERVICE_URLS", "")
if not raw_value:
raw_value = LOCAL_DEV_ASR_SERVICE_URLS
return parse_asr_service_urls(raw_value)
def configured_asr_service_urls():
"""Return configured worker URLs, falling back to the local Danish worker."""
raw_value = os.environ.get("ASR_SERVICE_URLS", "") or LOCAL_DEV_ASR_SERVICE_URLS
return parse_asr_service_urls(raw_value)

Once applied, the from flask import has_app_context, current_app line at the top of the file becomes unused — remove it in the same commit.

jakso535 and others added 20 commits April 23, 2026 08:42
Functionality has been split into four modules, all with distinct responsibilities. The endpoint class now functions as a slimmer routing class with more general logic.
Comment thread docker-compose.yml Outdated
ZEEGUU_DATA_FOLDER: /zeeguu-data/
ZEEGUU_RESOURCES_FOLDER: /zeeguu-data/
STANZA_SERVICE_URL: http://stanza:5001
ASR_SERVICE_URLS: "${ASR_SERVICE_URLS:-da=http://asr_da:5002}"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current shape da=http://asr_da:5002 is the worst of both worlds — language is in the host name (asr_da) AND in the env-var key (da=) AND there's an explicit non-default port. Pick a direction:

  • Option 1: one ASR container, multiple languages. Container is named asr. Language is just a request parameter. URL becomes da=http://asr (worker listens on 80 inside the container). Adding German later is just de=http://asr.
  • Option 2: one container per language. Container asr_da, with asr_de joining later — fine. But drop the explicit port: let each worker listen on 80, so the URL is just da=http://asr_da.

Either way, :5002 is leaking an internal implementation detail into config. Same shape also appears in default.env:31 — fix both. Not a hard blocker, but worth deciding now while there's only one entry.

Comment on lines +43 to +46
def _current_verbal_flashcards_user():
user = User.find_by_id(flask.g.user_id)
return user, _ensure_verbal_flashcards_enabled(user)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_current_verbal_flashcards_user() returns (user, None) when the user passes the gate and (user, response_tuple) when they don't, leading every endpoint to do if feature_gate: return feature_gate. The function name promises "current user" but it's actually computing a 404 response on the side. Cleaner to split:

def _current_verbal_flashcards_user():
    return User.find_by_id(flask.g.user_id)

…and call _ensure_verbal_flashcards_enabled(user) explicitly at the top of each endpoint. Less clever, easier to read.

Comment on lines +122 to +134
try:
_ensure_request_audio_size_is_allowed()

if "file" not in request.files:
return json_result({"error": "No audio file provided"}), 400

audio_file = request.files["file"]
if audio_file.filename == "":
return json_result({"error": "Empty filename"}), 400

user, feature_gate = _current_verbal_flashcards_user()
if feature_gate:
return feature_gate
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The feature gate runs after audio-size enforcement, file-presence check, and filename check. Cheap to flip — gate first, then validate. Two reasons:

  1. Saves reading a possibly-large audio body for a non-allowlisted user.
  2. Leaks "this feature exists" less to outsiders.

Same pattern in submit_answer (gate after JSON parse, line 247) and check_pronunciation (gate after JSON parse, line 306). Please move all three gates to the top of the endpoint.

Comment on lines +207 to +210
except Exception as e:
log(f"Get flashcards error: {e}")
traceback.print_exc()
return json_result({"error": str(e)}), 500
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Echoing str(e) to the client can leak ORM internals, file paths, and other implementation details. The transcribe endpoint already does this right (returns a static \"Transcription endpoint error\" on line 162). Please align the other three the same way: log with detail, respond with a generic message.

Same issue in:

  • submit_answer (lines 282-284)
  • check_pronunciation (lines 329-331)

Comment on lines +237 to +245
flashcard_id = str(data.get("flashcard_id")) if data.get("flashcard_id") is not None else None
user_answer = data.get("user_answer", "")
is_correct = data.get("is_correct")
answer_source = data.get("answer_source", "unknown")
response_time = data.get("response_time_ms", 0)
session_id = data.get("session_id")

if not flashcard_id or is_correct is None:
return json_result({"error": "flashcard_id and is_correct are required"}), 400
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two small things on the flashcard_id handling here:

  1. data.get("flashcard_id") is called twice.
  2. The truthiness check not flashcard_id would reject flashcard_id == "0" (zero is falsy) — never an issue today since IDs are positive, but it's a footgun.

Cleaner:

flashcard_id = data.get("flashcard_id")
if flashcard_id is None or is_correct is None:
    return json_result({"error": "flashcard_id and is_correct are required"}), 400
flashcard_id = str(flashcard_id)

@mircealungu
Copy link
Copy Markdown
Member

Forward-compatibility for a likely Whisper switch

The per-language container architecture works for today's Danish-only experiment, but it's worth noting that scaling beyond Danish probably won't mean adding more Parakeet workers — there's no parakeet-rnnt-110m for the other 14 Zeeguu languages. The realistic next step is a single multilingual model (Whisper-large or Faster-Whisper-large) covering all languages in one container.

The good news: most of this PR's design layers are forward-compatible.

  • ✅ Main API treats ASR as a black box — only knows "language_code in, transcription out".
  • ✅ Text normalization is a per-language registry, not hardcoded into callers.
  • ✅ Fuzzy matching is ASR-agnostic.
  • ✅ Domain layer (flashcard_selection, submission) doesn't know ASR exists.

The contract POST /transcribe { audio, language_code } → { transcription } is exactly right — Whisper, Parakeet, AWS Transcribe all fit it. Don't change the contract.

What does leak Parakeet/per-language assumptions and is worth tightening:

1. ASR_SERVICE_URLS shape (da=URL,de=URL,fr=URL,...) is the strongest leak. The format mandates per-language URL configuration even when there's only one backend. With Whisper you'd have to configure 15 entries pointing to the same URL, which is awkward and signals "we built around the wrong abstraction." Forward-compatible shape:

ASR_SERVICE_URL=http://asr                  # common case — one backend, all languages
ASR_LANGUAGE_OVERRIDES=da=http://asr-da     # only when you really need a per-language override

Default to single URL; per-language overrides are the exception.

2. Worker's ASR_LANGUAGE_CODE + rejection at app.py:73-77 encodes "this worker handles exactly one language." Trivial loosening: ASR_SUPPORTED_LANGUAGES (comma-separated, defaults to one). The rejection logic becomes requested_language_code not in SUPPORTED. No operational difference today, no lock-in tomorrow.

3. README in asr_service/ documenting the contract: "this is one possible ASR backend; the contract is POST /transcribe with audio + language_code, returning {transcription}". Future-you (or whoever swaps in Whisper) reads that and knows what surface to preserve.

None of these are PR-blockers. The current design ships fine for the Danish experiment. They're the kind of thing where doing them now costs ~30 minutes and doing them later (after another language is added with the wrong shape) costs hours of cleanup.

@jakso535
Copy link
Copy Markdown
Author

jakso535 commented May 2, 2026

Verbal Flashcards Change Summary

Flashcard Response Shape

  • Reduced each verbal flashcard payload to the fields used by the frontend:
    • id
    • prompt
    • answer
  • Removed duplicated/unused fields from the response shape, including:
    • expectedText
    • bookmark_id
    • user_word_id
    • level
    • from
    • to
    • origin
    • translation
  • Updated the frontend verbal flashcards page to read card.answer instead of
    card.expectedText.
  • Removed the frontend fallback that compared spoken text against the prompt
    when the answer was missing.
  • Changed submission lookup so the backend resolves the submitted flashcard by
    bookmark directly instead of depending on the public flashcard dictionary.
  • Removed an unused flashcard lookup helper after confirming it had no callers.

Endpoint Safety And Error Handling

  • Moved verbal flashcard feature-gate checks to the top of the relevant
    endpoints before parsing bodies or validating uploaded audio.
  • Stopped returning raw exception text to clients in verbal flashcard endpoint
    errors.
  • Replaced public str(exc) responses with generic, stable error messages.
  • Kept detailed error information in server-side logs where useful.
  • Updated submit_answer validation so flashcard_id is read once, checked
    explicitly against None, and then coerced to string.
  • Kept session_id validation explicit while avoiding exception echoing.

Flashcard Selection And Submission Flow

  • Made the level-3 word requirement configurable with
    VERBAL_FLASHCARDS_REQUIRE_LEVEL_3.
  • Restored production default behavior to require level-3 words.
  • Allowed experiments to include lower-level words by setting the env var to
    false.
  • Made the level gate consistent between flashcard selection and submission.
  • Replaced commented-out experiment code with the config flag.
  • Switched schedule creation to the canonical scheduler helper:
    FourLevelsPerWord.find_or_create(db.session, user_word).
  • Removed hand-rolled schedule field initialization.
  • Changed predictable bad flashcard data handling to return None, while
    avoiding blanket exception swallowing in the selection loop.

Text Normalization

  • Updated the per-language normalizer registry.
  • Made unknown normalizer languages fail loudly instead of silently using Danish
    rules.
  • Updated tests to assert that unknown languages raise an unsupported-language
    error.
  • Kept Danish-specific normalization isolated behind the registry.

Fuzzy Matching And Pronunciation Scoring

  • Reworked scoring to use Optimal String Alignment edit distance as the
    acceptance criterion.
  • Renamed/documented the previous Damerau-Levenshtein implementation as OSA /
    restricted edit distance.
  • Kept Jaro-Winkler as a diagnostic signal rather than part of the acceptance
    decision.
  • Simplified the acceptance policy:
    • words of length 1-2 require exact match
    • words of length 3+ allow one OSA edit
  • Added/updated regression tests for accepted and rejected edit-distance cases.
  • Removed the duplicate wordAccuracy response field and kept accuracy.
  • Added a three-tier feedback message policy:
    • all words accepted: success-style feedback
    • no words accepted: "Didn't catch that, try again"
    • partial match: "Very close, try again"

ASR Client Configuration

  • Removed Flask app-config fallback from the ASR service client.

  • Kept ASR configuration as environment-driven, matching the existing service
    convention used elsewhere in the codebase.

  • Replaced deprecated FLASK_ENV local-dev detection with FLASK_DEBUG.

  • Parsed ASR configuration at module import instead of on every transcription
    call.

  • Changed ASR routing config from language-map-first:

    ASR_SERVICE_URLS=da=http://asr

    to the forward-compatible shape:

    ASR_SERVICE_URL=http://asr
    ASR_LANGUAGE_OVERRIDES=
  • Kept per-language overrides available only for exceptions, for example:

    ASR_LANGUAGE_OVERRIDES=da=http://asr-da
  • Added a local debug fallback to http://127.0.0.1:5002 only when
    FLASK_DEBUG=1.

  • Updated start.py to set FLASK_DEBUG=1 for local python start.py runs so
    the local ASR worker fallback works.

ASR Worker Architecture

  • Kept the ASR worker contract generic:
    • request: audio file plus language_code
    • response: { "success": true, "transcription": "..." }
  • Added ASR_SUPPORTED_LANGUAGES, allowing:
    • one language such as da
    • a comma-separated list
    • * for a future multilingual worker
  • Changed the worker health endpoint to return HTTP 503 when the model is not
    loaded.
  • Updated health payload from singular worker_language to
    worker_languages.
  • Kept Docker production routing on the service name asr without leaking
    :5002 into production config.
  • Kept direct local worker execution available on port 5002.
  • Unified Docker worker port handling around ASR_SERVICE_PORT.
  • Kept gunicorn bound to ASR_SERVICE_PORT.
  • Added ASR service documentation describing the worker contract and config
    model.

ASR Worker Dependencies And NeMo Output

  • Pinned ASR worker dependencies to the versions currently verified locally:
    • nemo_toolkit[asr]==2.7.3
    • huggingface_hub==0.36.2
    • pydub==0.25.1
  • Verified the pinned NeMo Danish Parakeet model returns list[Hypothesis].
  • Replaced shotgun transcription output parsing with an explicit parser for the
    pinned output shape.
  • Made unexpected NeMo output shapes fail loudly instead of being guessed at.
  • Added ASR worker tests for the accepted Hypothesis shape and rejected legacy
    shapes.

Docker And Environment

  • Updated docker-compose.yml so the main API uses:

    ASR_SERVICE_URL=http://asr
    ASR_LANGUAGE_OVERRIDES=
  • Updated default.env to the same config shape.

  • Kept the ASR service container named asr.

  • Kept internal container port usage at 80 for production-style Docker
    routing.

  • Added ASR_SUPPORTED_LANGUAGES=da to the current ASR worker environment.

  • Confirmed docker compose config is valid after the config changes.

Tests And Verification

  • Restored previously commented-out tests by updating their assertions instead
    of leaving coverage disabled.
  • Added tests for:
    • lower-level experiment flashcards
    • ASR config parsing and routing
    • local debug ASR fallback
    • ASR worker health status codes
    • ASR worker supported-language checks
    • ASR worker generic error responses
    • NeMo output shape extraction
    • OSA scoring behavior
  • Repeatedly verified the focused suites during the work:
    • asr_service/test_app.py
    • zeeguu/api/test/test_verbal_flashcards.py

Notes

  • The local direct ASR worker port 5002 remains only for development.
  • Production ASR routing should use ASR_SERVICE_URL and optional
    ASR_LANGUAGE_OVERRIDES.

@jakso535
Copy link
Copy Markdown
Author

jakso535 commented May 3, 2026

Verbal flashcards now use Meaning pairs as possible answer also. This can in some cases fix translations errors of:

"Ball" being translated to "bolden" instead of "bold", where the bookmark variant contains "bold".

texts.append(cleaned_text)


def answer_variants_for_bookmark(bookmark):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice idea for the inflection case — bold / bolden sharing the cue "ball" is exactly where strict matching frustrates learners.

But the current rule (any non-INVALID Meaning with same origin-language, translation-language, and lowercased translation text) can't distinguish inflectional variants from genuine homonyms. Concrete Danish example, for an English speaker:

  • spring → forår (the season)
  • spring → fjeder (a mechanical spring)
  • spring → kilde (a water source)

All three would now be accepted as correct answers for the cue "spring", regardless of which sense the learner actually scheduled. If they specifically picked spring → fjeder to drill engineering vocabulary, getting credit for kilde undermines the study.

To do this reliably, I think we need a meaning-family / inflection-group relationship in the data model — something that marks bold and bolden as forms of the same lemma, while keeping fjeder and kilde as unrelated meanings that just happen to share an English cue. Matching on cue text alone over-accepts on homonyms.

Until that data exists, one safer scoping option: limit variants to the same user_word.meaning.origin.lemma (if available), or only accept variants when the edit distance between origin contents is small (which would catch boldbolden but reject fjederkilde). Worth thinking about before this lands as the default scoring policy.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! This should be fixed now. The intermediary decision is to accept variants only if they are 2 or fewer edits from the original bookmark :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants