GitHub - sigfault-byte/PCS

Current Status

Modular pipeline stages implemented
Multiple full session experiments conducted
Merge logic under active redesign (prioritizing precision to eliminate hallucination)
No unified end-to-end runner yet

Goal

Build a pipeline that ingests raw parliamentary sessions (video/audio), and transforms them into structured, searchable data:

transcription
speaker diarization
segment merging
speaker attribution
storage and retrieval

The goal is to produce a reliable approximation of the truth

Parliamentary audio is inherently noisy:

multiple speakers overlap
interruptions are frequent
speakers cut each other mid-sentence
the session president often interjects
background noise and applause introduce ambiguity

This results in conflicting and incomplete signals that must be reconciled carefully.

The pipeline prioritizes:

precision over recall, using multiple validation layers to reduce uncertainty.

Approach

The pipeline was redesigned after earlier iterations produced unreliable results during:

segment merging
speaker identification
NER extraction

The current design follows a multi-stage, validation-driven approach:

00 VAD (Silero)

Initial pass to extract candidate speech regions.

Future work includes additional signal-based filtering (e.g. RMS, spectral features via librosa) to better identify “clean” audio segments.

01 Diarization (pyannote)

segment speakers
compute per-segment embeddings
compute initial speaker centroids

02 Transcription (Whisper)

VAD-enabled transcription (1000 ms)
extract:
- segments
- tokens
- confidence proxies (avg_log_prob, no_speech_prob, compression ratio)

03 Heuristic filtering

Identify “clean” segments based on:

no speaker overlap
stable confidence metrics
low noise indicators

04 Alignment

Align transcription with diarization (Silero + pyannote):

detect silence gaps
identify mismatches
filter likely hallucinated tokens

05 NLP (CamemBERT)

extract named entities (speakers)
infer speaker attribution from context
validate against diarization turns
extract keywords / topics

06 Embedding refinement

recompute speaker centroids using only high-confidence segments
compute semantic embeddings for clean segments

07 Storage

Store structured segments in SQLite:

timestamps
speaker attribution
text
embeddings
keywords

The main challenge is that truth is not directly observable

Each component provides an imperfect signal:

Whisper : probabilistic text (can hallucinate)
Diarization: approximate speaker boundaries
VAD : imperfect segmentation
NLP : contextual but fallible inference

The problem becomes : reconciling multiple weak signals into a consistent, high confidence representation

This is less about “running models” and more about:

signal validation
conflict resolution
uncertainty management

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
data		data
docs		docs
experiments		experiments
notes		notes
src/assemblybot		src/assemblybot
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Current Status

Goal

Approach

00 VAD (Silero)

01 Diarization (pyannote)

02 Transcription (Whisper)

03 Heuristic filtering

04 Alignment

05 NLP (CamemBERT)

06 Embedding refinement

07 Storage

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Current Status

Goal

Approach

00 VAD (Silero)

01 Diarization (pyannote)

02 Transcription (Whisper)

03 Heuristic filtering

04 Alignment

05 NLP (CamemBERT)

06 Embedding refinement

07 Storage

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages