horchctl process FILE.wav runs every configured wakeword against an
audio file. The daemon spins up a separate isolated inference state for
the call — your live mic stream isn't disturbed, and detections from
the file aren't broadcast to D-Bus subscribers (they're returned to
horchctl and printed).
Use cases
- Regression-test wakeword models — keep a curated set of "should fire" and "should not fire" recordings in CI; a model update that shifts the score envelope is caught immediately.
- Audit past detections — record a couple of hours of mic input
alongside the daemon's
journalctl, then replay the file later to confirm whether a flagged Detection actually had the wakeword in it. - Debug false positives — capture a moment when the daemon fired
unexpectedly, replay it with
--threshold-override(planned) to characterize the score curve.
Audio format
The WAV must be 16 kHz mono int16. If yours isn't, convert:
ffmpeg -i in.flac -ar 16000 -ac 1 -sample_fmt s16 fixed.wav
Human-readable output (default)
$ horchctl process tests/alexa-utterance.wav
0.320s alexa score=0.974
1.840s alexa score=0.812
JSONL output (for jq, CI)
$ horchctl process tests/alexa-utterance.wav --json | jq
{
"timestamp_s": 0.32,
"name": "alexa",
"score": 0.974
}
{
"timestamp_s": 1.84,
"name": "alexa",
"score": 0.812
}
Assert in CI that a known-good utterance fires:
horchctl process fixtures/alexa.wav --json \
| jq -e 'select(.name == "alexa" and .score > 0.5)' >/dev/null \
|| { echo "alexa regression"; exit 1; }
Isolation guarantee
The file pipeline owns its own Preprocessor + Classifier set
(loaded fresh from the same config paths as the live mic pipeline) and
its own detector state. Two consequences:
- The live mic pipeline keeps streaming at full FPS during the call —
confirmed by watching
horchctl statuswhile a long file processes. - Each
horchctl processinvocation pays a ~200 ms one-time setup cost for loading the ONNX sessions. For sub-second files that can dominate end-to-end time; for multi-second files it's lost in the inference cost.
Cooldown is virtual-time correct
Each input frame advances the detector's clock by exactly 80 ms of nominal audio time, regardless of how fast the file processes. Two utterances 1.5 s apart in the recording register 1.5 s apart in detector time even if the whole file finishes in 100 ms of CPU — so both detections fire (assuming both clear the threshold).