How Do i Make Stt Work for my ai Vtuber on Discord Vc calls?

i’ve been struggling to make discord stt work but I’ve figured out everything else out (is there a solution for a free stt that can be programmed into the vc call on discord that won’t have issues on call?)

logs:
:memo: Transcribing…
:magnifying_glass_tilted_left: Audio analysis - size: 18000, max_amplitude: 1.236272
:microphone: Incoming audio | amp=1.236272 | samples=18000
:musical_note: Processing audio: 18000 samples, 1.236272 max amplitude
:magnifying_glass_tilted_left: Transcribing with simple Whisper STT…
Transcribing audio: 18000 samples, max amplitude: 1.2363
Transcription too short
:test_tube: RAW STT OUTPUT: ‘’
:memo: Empty transcription
:memo: Transcribed text: ‘’
:bust_in_silhouette: You:
:outbox_tray: Sending to Ollama: ‘…’

nothing shows up and when i get it to work it’s off as heck from what i actually said

Hmm… Even among files with the .wav extension, there can be significant variation in the internal format and quality. This is because the .wav format is relatively flexible and can accommodate various forms depending on the RIFF header.

Therefore, it’s likely that transcription will be more successful if you first use a standard program to clean up or normalize the .wav file, in other words, perform preprocessing, before passing it to STT…


How to make STT work for an AI VTuber in Discord VC

Short answer

There is no free STT that can be dropped into Discord VC and “just won’t have issues.”

For a reliable free/local setup, the solution is not simply “use Whisper.” The solution is a proper live voice pipeline:

Discord VC or local mic audio
→ decode to PCM
→ convert to mono 16 kHz float32
→ normalize/clip safely
→ VAD / speech detection
→ buffer a complete utterance
→ transcribe with faster-whisper
→ reject blank / garbage / too-short transcripts
→ send only valid text to Ollama

The most important point:

Do not transcribe raw Discord audio chunks directly.

Discord gives you a live audio stream. Whisper-style STT wants a clean speech segment. Your job is to turn many tiny audio frames into one clean utterance before STT sees it.

Useful references:


Your log

📝 Transcribing...
🔍 Audio analysis - size: 18000, max_amplitude: 1.236272
🎤 Incoming audio | amp=1.236272 | samples=18000
🎵 Processing audio: 18000 samples, 1.236272 max amplitude
🔍 Transcribing with simple Whisper STT...
Transcribing audio: 18000 samples, max amplitude: 1.2363
Transcription too short
🧪 RAW STT OUTPUT: ''
📝 Empty transcription
📝 Transcribed text: ''
👤 You:
📤 Sending to Ollama: '...'

This log strongly suggests four problems:

  1. The audio chunk is probably too short.
  2. The audio amplitude/normalization looks suspicious.
  3. You are probably transcribing chunks instead of completed speech turns.
  4. Empty STT output is still being sent downstream to Ollama.

That combination can explain both symptoms:

nothing shows up

and:

when it works, the transcription is way off

1. The biggest clue: 18000 samples

18000 samples is probably not enough audio.

The actual duration depends on sample rate and channel layout:

Interpretation Approximate duration
18,000 mono samples at 48 kHz 0.375 seconds
18,000 stereo-interleaved samples at 48 kHz 0.1875 seconds per channel
18,000 mono samples at 16 kHz 1.125 seconds

Even the best case is short. The likely Discord case may be less than half a second.

That explains this line:

Transcription too short

Whisper-style models can sometimes decode short clips, but they are not reliable when fed tiny fragments like:

"he"
"hel"
"lo ca"
"n you"

The Whisper-Streaming paper exists because Whisper is not natively designed for real-time transcription. It needs a streaming/buffering policy around it.

Fix

Do not do this:

Discord chunk → Whisper → text

Do this:

Discord chunks
→ VAD
→ utterance buffer
→ complete speech segment
→ Whisper/faster-whisper
→ text

Good starter thresholds:

Setting Recommended starting value
Minimum speech before STT 0.8–1.2 seconds
End-of-turn silence 700–1200 ms
Pre-roll before speech start 200–400 ms
Post-roll after speech end 150–300 ms
Max utterance length 8–15 seconds

2. max_amplitude=1.236272 is suspicious

For normalized float audio, you usually want samples roughly inside:

-1.0 to +1.0

Your log says:

max_amplitude: 1.236272

That does not automatically prove the audio is broken, but it is suspicious enough to inspect before changing STT models.

Possible causes:

Cause Result
int16 PCM interpreted as float32 static/noise/garbage waveform
stereo interleaved audio treated as mono distorted waveform
audio normalized twice clipping / harsh audio
gain too high values exceed normal float range
wrong resampling path pitch/speed/time distortion
wrong dtype/endian nonsense transcription

If the waveform going into STT is malformed, Whisper will produce blanks or nonsense no matter which model you use.


3. Discord audio is not Whisper-ready audio

Discord voice and Whisper have different assumptions.

Discord voice docs describe voice audio around Opus, stereo, and 48 kHz. See:

Whisper/faster-whisper generally wants prepared speech audio:

mono
16 kHz
float32
roughly [-1.0, +1.0]
speech-only segment

See:

Correct conversion path:

Discord Opus / decoded PCM
→ 48 kHz stereo PCM
→ downmix to mono
→ resample to 16 kHz
→ convert to float32
→ normalize/clip safely
→ VAD
→ utterance buffer
→ STT

Bad conversion can make speech sound sped up, slowed down, crunchy, clipped, or like static. Whisper then guesses.


4. Discord receive is separately fragile

Discord voice sending is relatively common.

Discord voice receiving is harder.

If your AI VTuber only needs to hear you, the most stable first version is:

local mic
→ VAD
→ faster-whisper
→ Ollama
→ TTS
→ Discord voice output

The bot still talks in Discord, but it listens to your local microphone instead of trying to receive Discord VC audio.

If the AI must hear everyone in VC, then you need the harder path:

Discord VC receive
→ per-user audio stream
→ decode
→ mono 16 kHz conversion
→ per-user VAD
→ per-user utterance buffer
→ per-user STT
→ speaker-labeled transcript
→ Ollama
→ TTS
→ Discord voice send

Important Discord receive references:

Also note: Pycord’s voice docs currently warn that recording/listening may not work as expected because of DAVE end-to-end encryption for voice calls:

So if the bot joins VC but hears nothing, do not immediately blame Whisper. First prove that your receive library is producing valid decoded audio.


5. Empty transcript should never reach Ollama

This part of your log is a controller bug:

RAW STT OUTPUT: ''
Empty transcription
Transcribed text: ''
Sending to Ollama: '...'

An empty STT result should mean:

return to listening

Not:

send blank prompt to Ollama

Add a hard gate:

def should_send_to_ollama(text: str) -> bool:
    text = text.strip()

    if not text:
        return False

    if len(text) < 2:
        return False

    common_bad_outputs = {
        ".",
        "...",
        "you",
        "thank you",
        "thanks for watching",
        "subscribe",
    }

    if text.lower() in common_bad_outputs:
        return False

    return True

Then:

text = transcribe_utterance(audio_16k)

if not should_send_to_ollama(text):
    print("No valid transcript. Returning to listening.")
    return

send_to_ollama(text)

This one change prevents the AI from reacting to silence, failed STT, or garbage.


6. Recommended free STT stack

Use this first:

Silero VAD
+ faster-whisper
+ proper audio conversion
+ utterance buffering
+ transcript filtering

Why faster-whisper?

faster-whisper is a practical local Whisper implementation built on CTranslate2. It is commonly used because it is faster and more memory-efficient than the original Whisper implementation, and it supports CPU/GPU quantized inference.

Why Silero VAD?

Silero VAD is lightweight and fast. It is designed for voice activity detection and can process small chunks efficiently. Use it to decide when speech starts and ends.

Why not “simple Whisper STT”?

Simple Whisper STT usually assumes:

clean audio file in
→ transcript out

Discord VC is different:

tiny frames
+ live speech
+ silence/noise
+ possible stereo 48 kHz audio
+ possible Discord receive problems
+ possible bot audio feedback

So you need a live voice pipeline, not a file-transcription pipeline.


7. The architecture I would use

Best first version: AI hears only you

Use your local mic for STT:

Local mic
→ audio conversion
→ Silero VAD
→ utterance buffer
→ faster-whisper
→ transcript filter
→ Ollama
→ TTS
→ Discord VC output

This avoids:

  • Discord receive instability
  • DAVE/E2EE receive issues
  • per-user audio mapping
  • mixed-speaker audio
  • Discord packet/decode bugs
  • bot hearing its own Discord output

This is the most reliable first working version for an AI VTuber.

Harder version: AI hears everyone in Discord VC

Use per-user receive buffers:

Discord VC receive
→ per-user PCM
→ per-user mono 16 kHz conversion
→ per-user VAD
→ per-user utterance buffer
→ per-user STT
→ speaker-labeled text
→ Ollama
→ TTS
→ Discord voice output

Speaker-labeled transcript example:

Alice: Can you hear me?
Bob: Yeah, I can hear you.
You: Ask the AI what it thinks.

Do not mix all users into one stream unless you are okay with bad diarization and confused transcripts.


8. Correct audio validation before STT

Before calling STT, log these values:

import numpy as np

def audio_stats(audio, sample_rate: int) -> dict:
    audio = np.asarray(audio, dtype=np.float32)

    if len(audio) == 0:
        return {
            "samples": 0,
            "sample_rate": sample_rate,
            "duration": 0.0,
            "min": 0.0,
            "max": 0.0,
            "peak": 0.0,
            "rms": 0.0,
        }

    return {
        "samples": int(len(audio)),
        "sample_rate": sample_rate,
        "duration": float(len(audio) / sample_rate),
        "min": float(audio.min()),
        "max": float(audio.max()),
        "peak": float(np.max(np.abs(audio))),
        "rms": float(np.sqrt(np.mean(audio ** 2))),
    }

Target before STT:

sample_rate: 16000
channels: mono
dtype: float32
duration: usually 0.8s or longer
peak: usually <= 1.0
rms: not near zero

Reject bad audio before STT:

import numpy as np

def valid_audio_for_stt(audio_16k, sr=16000):
    audio_16k = np.asarray(audio_16k, dtype=np.float32)

    duration = len(audio_16k) / sr
    peak = float(np.max(np.abs(audio_16k))) if len(audio_16k) else 0.0
    rms = float(np.sqrt(np.mean(audio_16k ** 2))) if len(audio_16k) else 0.0

    if duration < 0.8:
        return False, f"too short: {duration:.2f}s"

    if peak < 0.015:
        return False, f"too quiet: peak={peak:.4f}"

    if rms < 0.003:
        return False, f"too quiet: rms={rms:.4f}"

    if peak > 1.05:
        return False, f"bad normalization: peak={peak:.4f}"

    return True, "ok"

Given your log:

max_amplitude=1.236272

this would probably reject the chunk and force you to inspect your conversion/normalization.

That is good. You want to catch bad audio before Whisper sees it.


9. Save the exact STT input as WAV

This is the most important debug step.

# deps:
# pip install soundfile numpy

import numpy as np
import soundfile as sf
from pathlib import Path

debug_dir = Path("debug_stt")
debug_dir.mkdir(exist_ok=True)

def save_debug_wav(audio, sample_rate: int, filename: str):
    audio = np.asarray(audio, dtype=np.float32)
    audio = np.clip(audio, -1.0, 1.0)
    sf.write(debug_dir / filename, audio, sample_rate)

When STT returns blank:

save_debug_wav(audio_16k, 16000, "empty_transcript.wav")

Then listen to the file.

What the WAV sounds like Likely cause
Silence wrong source / VAD issue / receive issue
Static dtype/decode issue
Fast voice sample-rate mismatch
Slow voice sample-rate mismatch
Distorted/clipped gain/normalization issue
Half-word only chunking/buffering issue
Bot’s own voice audio feedback loop
Multiple speakers mixed need per-user buffers
Clean full sentence STT settings/model issue

Do this before switching models.


10. Safer PCM conversion example

If you have decoded 48 kHz stereo signed 16-bit PCM bytes, convert like this:

# deps:
# pip install numpy scipy

import numpy as np
from scipy.signal import resample_poly

def pcm_s16le_48k_stereo_to_16k_mono_float32(pcm_bytes: bytes) -> np.ndarray:
    audio_i16 = np.frombuffer(pcm_bytes, dtype=np.int16)

    if audio_i16.size == 0:
        return np.zeros(0, dtype=np.float32)

    # Drop incomplete stereo frame if needed.
    usable = (audio_i16.size // 2) * 2
    audio_i16 = audio_i16[:usable]

    # frames x channels
    stereo = audio_i16.reshape(-1, 2)

    # stereo → mono
    mono = stereo.astype(np.float32).mean(axis=1)

    # int16 → float32
    mono = mono / 32768.0

    # safety clamp
    mono = np.clip(mono, -1.0, 1.0)

    # 48 kHz → 16 kHz
    mono_16k = resample_poly(mono, up=1, down=3)

    return mono_16k.astype(np.float32)

Common bad conversions:

# Bad if pcm_bytes are int16 PCM.
audio = np.frombuffer(pcm_bytes, dtype=np.float32)

# Bad if stereo is interleaved and never reshaped/downmixed.
audio = np.frombuffer(pcm_bytes, dtype=np.int16).astype(np.float32) / 32768.0

# Bad if the data is still 48 kHz but you tell Whisper it is 16 kHz.
processor(audio, sampling_rate=16000)

11. faster-whisper starter config

# deps:
# pip install faster-whisper numpy

from faster_whisper import WhisperModel
import numpy as np

model = WhisperModel(
    "small.en",
    device="cpu",        # use "cuda" if available
    compute_type="int8", # use "float16" on CUDA
)

def transcribe_utterance(audio_16k: np.ndarray) -> str:
    ok, reason = valid_audio_for_stt(audio_16k)
    if not ok:
        print("Skipping STT:", reason)
        return ""

    audio_16k = np.asarray(audio_16k, dtype=np.float32)
    audio_16k = np.clip(audio_16k, -1.0, 1.0)

    segments, info = model.transcribe(
        audio_16k,
        language="en",
        task="transcribe",
        beam_size=1,
        temperature=0.0,
        condition_on_previous_text=False,
        vad_filter=True,
        vad_parameters={
            "min_silence_duration_ms": 700,
            "speech_pad_ms": 300,
        },
        no_speech_threshold=0.6,
        compression_ratio_threshold=1.35,
        log_prob_threshold=-1.0,
    )

    text = " ".join(segment.text.strip() for segment in segments).strip()
    return text

Why these settings:

Setting Why it helps
language="en" Avoids unstable language detection on short clips
task="transcribe" Prevents accidental translation behavior
beam_size=1 Lower latency
temperature=0.0 More deterministic
condition_on_previous_text=False Reduces carry-over hallucination across short turns
vad_filter=True Extra silence cleanup
min_silence_duration_ms=700 Reasonable conversational endpoint start
speech_pad_ms=300 Avoids cutting word edges
no_speech_threshold=0.6 Helps skip no-speech segments
compression_ratio_threshold=1.35 Helps catch repetitive output
log_prob_threshold=-1.0 Helps catch low-confidence output

References:


12. Add an utterance buffer

Conceptual version:

import numpy as np

class UtteranceBuffer:
    def __init__(self, sample_rate=16000, end_silence_ms=900):
        self.sample_rate = sample_rate
        self.end_silence_ms = end_silence_ms
        self.frames = []
        self.speaking = False
        self.silence_ms = 0.0

    def _frame_ms(self, audio_frame):
        return 1000.0 * len(audio_frame) / self.sample_rate

    def push(self, audio_frame_16k, is_speech: bool):
        audio_frame_16k = np.asarray(audio_frame_16k, dtype=np.float32)

        if is_speech:
            self.speaking = True
            self.silence_ms = 0.0
            self.frames.append(audio_frame_16k)
            return None

        if self.speaking:
            self.silence_ms += self._frame_ms(audio_frame_16k)

            # Keep some silence tail so the utterance does not sound chopped.
            self.frames.append(audio_frame_16k)

            if self.silence_ms >= self.end_silence_ms:
                utterance = np.concatenate(self.frames) if self.frames else np.zeros(0, dtype=np.float32)
                self.frames = []
                self.speaking = False
                self.silence_ms = 0.0
                return utterance

        return None

This is the missing bridge between live audio and STT.


13. Do not run STT inside the Discord receive callback

Bad:

def on_audio_frame(user_id, pcm_bytes):
    text = transcribe_with_whisper(pcm_bytes)
    send_to_ollama(text)

Good:

def on_audio_frame(user_id, pcm_bytes):
    audio_queue.put_nowait((user_id, pcm_bytes))

Then a worker handles:

decode/convert
→ VAD
→ buffering
→ STT
→ filtering
→ Ollama

This prevents the Discord audio receive path from being blocked by Whisper inference.


14. Recommended controller state machine

A Discord AI VTuber should not be:

transcribe → send to Ollama → speak

It should be a state machine:

LISTENING
  - waiting for speech

USER_SPEAKING
  - VAD says speech is active
  - buffer audio
  - do not call STT yet

ENDPOINT_CANDIDATE
  - silence started
  - wait 700–1200 ms
  - if speech resumes, go back to USER_SPEAKING

TRANSCRIBING
  - run STT on completed utterance

VALIDATING
  - reject empty / too-short / suspicious transcript

THINKING
  - send valid transcript to Ollama

SPEAKING
  - TTS plays into Discord

INTERRUPTED
  - user speaks while bot speaks
  - cancel TTS / LLM output
  - return to listening or user-speaking

See:


15. Prevent the bot from hearing itself

This is a common AI VTuber / Discord bot failure.

Bad routing:

Discord output / desktop mix
→ STT input
→ bot hears itself
→ bot replies to itself
→ loop

Better routing:

Human mic or per-user Discord receive
→ STT

Bot TTS
→ Discord output only

Practical safeguards:

  • Do not use desktop audio mix as STT input.
  • Use headphones while testing.
  • Keep STT input and bot TTS output on separate devices/routes.
  • If using Discord receive, ignore the bot’s own user ID.
  • Pause STT while bot is speaking unless implementing proper barge-in.
  • If allowing barge-in, only human speech should interrupt the bot.

16. What to build first

Phase 1 — local mic STT only

local mic
→ VAD
→ utterance buffer
→ faster-whisper
→ print transcript

Pass criteria:

silence produces no transcript
one sentence produces one transcript
partial speech is not sent
empty text is ignored

Phase 2 — add Ollama

local mic
→ STT
→ Ollama
→ print reply

Pass criteria:

Ollama is called only for real speech
blank transcripts are ignored

Phase 3 — add TTS locally

local mic
→ STT
→ Ollama
→ TTS
→ local playback

Pass criteria:

the bot does not hear itself
the bot does not respond to its own voice

Phase 4 — send TTS to Discord

local mic
→ STT
→ Ollama
→ TTS
→ Discord VC output

This gives you a working AI VTuber that speaks in VC without needing Discord receive yet.

Phase 5 — add Discord receive only after the local loop works

First test:

Discord receive
→ save clean WAV per speaker

Only after those WAV files sound correct:

Discord receive
→ per-user VAD
→ per-user STT
→ speaker-labeled transcript

17. How to interpret your exact log

size: 18000

Likely means you are feeding a chunk, not a complete utterance.

Fix:

buffer until VAD says the user finished speaking

max_amplitude: 1.236272

Likely means conversion or normalization needs inspection.

Fix:

save WAV
print dtype/min/max/rms/sample_rate/channels
verify float32 mono 16 kHz

Transcription too short

Probably correct behavior from the STT layer.

Fix:

do not call STT for audio under ~0.8 seconds

RAW STT OUTPUT: ''

Normal possible STT result.

Fix:

treat blank STT as “no event”

Sending to Ollama: '...'

Controller bug.

Fix:

never send empty transcript to Ollama

18. Most likely root cause summary

Your issue is probably not a bad STT model.

It is probably:

tiny chunks
+ wrong or suspicious audio normalization
+ missing VAD/utterance buffering
+ maybe Discord receive instability
+ empty transcript sent to Ollama

So the fix is not:

try random STT models until one works

The fix is:

make the audio valid
make the speech segment complete
make the transcript pass validation
only then call Ollama

19. Practical final recommendation

Use this first:

local mic
→ Silero VAD
→ utterance buffer
→ faster-whisper small.en
→ transcript filter
→ Ollama
→ TTS
→ Discord VC output

Then later, if you really need VC-wide listening:

Discord receive
→ per-user clean WAV verification
→ per-user VAD/STT
→ speaker-labeled transcript
→ Ollama

Do not start with full Discord receive unless you need it. It adds several failure points.


20. Resource list

Core docs

Streaming / real-time STT

Hallucination / silence handling

Voice-agent turn-taking

Discord receive


Final summary

  • There is no magic free STT that avoids Discord VC problems by itself.
  • The best free/local stack is Silero VAD + faster-whisper + utterance buffering.
  • Your 18000 samples chunk is probably too short.
  • Your max_amplitude=1.236272 suggests a possible audio conversion/normalization issue.
  • Discord audio must be converted from the Discord voice format into clean mono 16 kHz float32 before STT.
  • Do not transcribe tiny chunks.
  • Do not send empty STT output to Ollama.
  • Use local mic STT first if the bot only needs to hear you.
  • Use Discord receive only after you prove you can save clean per-user WAV files.
  • Treat this as a voice pipeline problem, not just a model-selection problem.

Omg my hero thank u

\# Limit audio length

max_samples = 16000 \* 15  # 15 seconds max

if len(audio_np) > max_samples:

    audio_np = audio_np\[-max_samples:\]

    print(f"\[TRIM\] Trimmed audio to last {max_samples} samples")

could i turn max samples up to fix this?

also i am using chatterbox so i am unsure if that is affecting anything

:memo: Transcribed text: ‘’
:bust_in_silhouette: You:
:outbox_tray: Sending to Ollama: ‘…’
:inbox_tray: Ollama response status: 200
:page_facing_up: Ollama response data: {‘model’: ‘drivedenpadev/deepseek-v3.2’, ‘created_at’: ‘2026-05-02T00:29:13.5354636Z’, ‘response’: “What’s good, chat? Ready to get this conversation started!”, ‘done’: True, ‘done_reason’: ‘stop’, ‘context’: [128006, 9125, 128007, 271, 38766, 1303, 33025, 2696, 25, 6790, 220, 2366, 18, 271, 2675, 527, 264, 11919, 18328, 13, 128009, 128006, 882, 128007, 1432, 2675, 527, 264, 15526, 34051, 30970, 13, 13969, 31737, 1234, 220, 975, 4339, 13, 2360, 100166, 13, 3298, 3823, 1432, 1502, 25, 720, 15836, 25, 128009, 128006, 78191, 128007, 271, 3923, 596, 1695, 11, 6369, 30, 32082, 311, 636, 420, 10652, 3940, 0], ‘total_duration’: 763554400, ‘load_duration’: 115772300, ‘prompt_eval_count’: 56, ‘prompt_eval_duration’: 45008000, ‘eval_count’: 14, ‘eval_duration’: 592494200}
:speech_balloon: Extracted response: ‘What’s good, chat? Ready to get this conversation started!’
:robot: AI: What’s good, chat? Ready to get this conversation started!
:input_latin_letters: Sanitized text: ‘What’s good, chat? Ready to get this conversation started!’
:musical_note: Generating audio…
2026-05-01 17:29:14,307 - WARNING - CFG, min_p and exaggeration are not supported by Turbo version and will be ignored.
9%|██████▉ | 86/1000 [00:04<00:46, 19.59it/s]
S3 Token → Mel Inference…
100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00, 1.69it/s]
:musical_note: TTS result: sr=24000, audio_shape=(86400,)
:white_check_mark: TTS generated successfully: C:\Users\…\AppData\Local\Temp\tmpqxnxny4k.wav
[CONNECT] (‘127.0.0.1’, 60151)
:memo: Transcribing…
[SEARCH] Audio analysis - size: 5760, max_amplitude: 1.304917
[MIC] Incoming audio | amp=1.304917 | samples=5760
[PROCESS] Processing audio: 5760 samples, 1.304917 max amplitude
[STT] Transcribing with improved local STT…
:microphone: Incoming audio | amp=1.304917 | samples=5760
:musical_note: Processing audio: 5760 samples, 1.304917 max amplitude
:magnifying_glass_tilted_left: Transcribing with simple Whisper STT…
:cross_mark: Transcription error: name ‘transcribe_audio’ is not defined
Traceback (most recent call last):
File “C:\Users\…\Downloads\AliTurbo\vtuber_core_fixed.py”, line 130, in safe_transcribe
NameError: name ‘transcribe_audio’ is not defined
:memo: Transcribed text: ‘’
:bust_in_silhouette: You:
:outbox_tray: Sending to Ollama: ‘…’
:inbox_tray: Ollama response status: 200
:page_facing_up: Ollama response data: {‘model’: ‘drivedenpadev/deepseek-v3.2’, ‘created_at’: ‘2026-05-02T00:29:28.6229976Z’, ‘response’: “What’s up, newbie? Ready to get this chat started?”, ‘done’: True, ‘done_reason’: ‘stop’, ‘context’: [128006, 9125, 128007, 271, 38766, 1303, 33025, 2696, 25, 6790, 220, 2366, 18, 271, 2675, 527, 264, 11919, 18328, 13, 128009, 128006, 882, 128007, 1432, 2675, 527, 264, 15526, 34051, 30970, 13, 13969, 31737, 1234, 220, 975, 4339, 13, 2360, 100166, 13, 3298, 3823, 1432, 1502, 25, 720, 15836, 25, 128009, 128006, 78191, 128007, 271, 3923, 596, 709, 11, 95678, 30, 32082, 311, 636, 420, 6369, 3940, 30], ‘total_duration’: 712363500, ‘load_duration’: 85605400, ‘prompt_eval_count’: 56, ‘prompt_eval_duration’: 45423700, ‘eval_count’: 14, ‘eval_duration’: 569795200}
:speech_balloon: Extracted response: ‘What’s up, newbie? Ready to get this chat started?’
:robot: AI: What’s up, newbie? Ready to get this chat started?
:input_latin_letters: Sanitized text: ‘What’s up, newbie? Ready to get this chat started?’
:musical_note: Generating audio…

still looking at it erm

In this case, the most critical issue is the “short duration of the audio,” so I don’t think adjusting the options passed to Whisper alone will solve the problem.
You’ll likely need to make the WAV file itself longer.

Specifically, given the number of samples in that WAV file, even if the sampling rate is 16 kHz, the audio duration is only about one second; if the sampling rate were higher, it would be less than one second.

“Generating text from audio that’s less than a second long” is probably a bit outside the scope of the model’s design…

Whisper can handle “chunks,” but there’s probably a slight discrepancy in how you and the Whisper model interpret about “chunks”.

If you can use a different model that prioritizes real-time performance, that’s a viable option; however, if Whisper is already integrated and you can’t use another model, I think you’ll need to create a buffer.
It comes down to whether you prioritize the final transcription quality or real-time streaming.


Can Whisper transcribe chunks? Could another Hugging Face model handle streaming better?

Short answer

Yes, Whisper can transcribe chunks in the basic sense: you can pass it a short audio array or short WAV segment and it may return text.

But Whisper is not a true streaming ASR model in the sense of:

20 ms Discord audio frame in
→ stable partial transcript out
→ updated transcript as more audio arrives

That difference is the important part.

Your current chunk is roughly:

5760 samples / 16000 Hz = 0.36 seconds

That is extremely short for independent Whisper transcription. It may contain only part of a syllable, a clipped word edge, a breath, keyboard noise, silence, Discord compression residue, or TTS feedback.

So the practical answer is:

Whisper can transcribe chunks, but not tiny independent Discord callback chunks reliably.

For Whisper/faster-whisper, “chunking” should usually mean:

small audio frames
→ buffered into a larger speech window
→ optional VAD trimming
→ optional overlap/stride
→ transcribe meaningful segment

Not:

tiny Discord frame
→ independent STT call
→ send result to Ollama

Useful references:


1. The important distinction: chunks vs streaming

These are not the same thing.

Term Meaning Good fit for tiny Discord chunks?
Independent chunk transcription Each audio chunk is treated as a complete standalone clip Usually bad
Chunked transcription with overlap/stride Larger chunks are decoded with left/right context so boundary errors are reduced Better
Utterance-based STT VAD detects speech start/end, then STT transcribes the completed utterance Best first version
True streaming ASR Model keeps state/cache and emits partial/final text incrementally Best for low-latency live captions
Raw Discord frame transcription Every small callback/frame goes straight to STT Usually the failure mode

Your current system seems closest to this:

small audio callback
→ immediate STT
→ empty or bad transcript
→ empty transcript still sent to Ollama

That is the wrong shape for Whisper.


2. Why Whisper struggles with your current chunks

Whisper is a sequence-to-sequence model. It is strong, but it expects enough audio context to infer words.

It works best with something like:

1–15 seconds of speech-like audio
mostly intact word boundaries
reasonable volume
correct sample rate
silence/noise trimmed

It works badly with:

0.12–0.36 seconds of audio
half a word
wrong sample rate
wrong dtype
clipping / over-amplification
silence or no-speech
bot TTS leaking into input
Discord receive artifacts

The Whisper-Streaming paper states the key issue directly: Whisper is not designed for real-time transcription, so the authors built a streaming wrapper around it using local agreement and adaptive latency.

That means:

Whisper can be used in streaming systems,
but Whisper itself is not a native streaming recognizer.

3. What “chunking” should mean for Whisper

Bad Whisper chunking:

chunk 1 alone → text?
chunk 2 alone → text?
chunk 3 alone → text?

Better Whisper chunking:

audio stream
→ collect 1–5 seconds
→ add overlap/padding
→ transcribe
→ commit only stable/final text

Best first version for your AI VTuber:

audio stream
→ VAD detects speech start
→ buffer while user speaks
→ wait for 700–1200 ms silence
→ transcribe the completed utterance
→ reject empty/garbage
→ send valid text to Ollama

This is not “true streaming,” but it is usually the best first working design for a Discord AI VTuber.


4. Why overlap/stride matters

If you cut audio at arbitrary boundaries, words get chopped.

Example:

chunk 1: "can you hea"
chunk 2: "r me now"

A model may misread both chunks because neither one has the full word boundary context.

Hugging Face’s ASR chunking guide explains this for CTC models such as Wav2Vec2: chunks are decoded with stride/overlap so the model has context around the cut points, and the unreliable edges can be dropped/merged.

The same general idea matters for Whisper too, even though Whisper is not CTC-based:

do not decode arbitrary tiny independent slices

Use:

VAD padding
overlap
larger windows
or utterance-level transcription

5. Can another Hugging Face model handle chunks better?

Yes, but the details matter.

There are three realistic paths:

  1. Stay with Whisper/faster-whisper, but add VAD + utterance buffering.
  2. Try CTC models like Wav2Vec2 with chunking/stride.
  3. Use true streaming ASR models like NVIDIA Parakeet/Nemotron-style RNN-T/FastConformer models.

6. Option A — Stay with faster-whisper + VAD buffering

This is still my recommended first fix.

Use:

Silero VAD
+ utterance buffer
+ faster-whisper
+ transcript validation

References:

Why this is best first:

Reason Explanation
Easier setup Much easier than integrating a true streaming ASR runtime
Good accuracy Whisper-family models are strong when audio is clean
Good enough latency Utterance-based latency is acceptable for conversational bots
Fewer moving parts You can debug audio conversion, VAD, STT, and Ollama separately

Recommended flow:

Discord/local mic chunks
→ convert to mono 16 kHz float32
→ VAD
→ buffer complete utterance
→ faster-whisper
→ reject blank/garbage
→ Ollama

This will likely solve more of your current issue than switching models immediately.


7. Option B — Wav2Vec2 / CTC models with chunking + stride

CTC models can be more natural for chunking than Whisper.

Examples:

Wav2Vec2
HuBERT
WavLM-style ASR checkpoints

Why CTC models can work better for chunked audio:

  • they produce frame-level logits,
  • overlapping chunks can be merged more naturally,
  • boundary handling is simpler than seq2seq decoding,
  • Hugging Face pipelines support chunking/stride for many CTC ASR models.

References:

Example shape:

from transformers import pipeline

pipe = pipeline(
    "automatic-speech-recognition",
    model="facebook/wav2vec2-base-960h",
)

result = pipe(
    audio_16k,
    chunk_length_s=5,
    stride_length_s=1,
)

But this still does not mean:

0.36-second Discord chunk → independent transcript

It means:

larger windows
+ overlap/stride
+ merge outputs

CTC chunking may be worth testing, but it does not remove the need for:

audio conversion
buffering
VAD/endpointing
empty transcript filtering
feedback prevention

8. Option C — True streaming ASR models

This is the route closest to what you are asking for.

Streaming ASR models are designed around:

small incoming chunks
+ preserved model state/cache
+ partial/final transcript updates

Common architectures include:

RNN-T / Transducer
Conformer / FastConformer
cache-aware streaming encoders

These are better suited to live voice agents than naive Whisper-per-chunk.


NVIDIA Parakeet Unified

nvidia/parakeet-unified-en-0.6b is a strong example.

The model card describes it as an English ASR model based on transducer architecture / RNN-T / FastConformer, supporting both offline and streaming inference in one model. It also mentions a minimum latency of 160 ms and configurable streaming chunk sizes from 2080 ms down to 160 ms in 80 ms steps.

Useful links:

Why it matters:

This is much closer to “streaming ASR” than plain Whisper.

Caveat:

You still need to integrate its streaming/buffered-streaming API correctly.
Do not call it independently on every Discord chunk as if each chunk is a complete utterance.

NVIDIA Nemotron Speech Streaming

NVIDIA’s Nemotron Speech streaming ASR is another relevant route.

The Hugging Face blog describes cache-aware streaming inference for voice agents, with latency modes such as 80 ms, 160 ms, 560 ms, and 1.12 s. It also explains why cache-aware streaming is more efficient than repeatedly re-encoding overlapping windows.

Useful links:

Why it matters:

This is the kind of architecture built for live voice agents.

Caveat:

It is more engineering-heavy than faster-whisper.
Expect NeMo/runtime-specific setup and more integration work.

9. Practical comparison

Approach Can handle tiny chunks directly? Setup difficulty Good for Discord AI VTuber? Recommendation
Naive Whisper per chunk No Low Bad Avoid
faster-whisper + VAD utterance buffering Not directly; buffers into utterances Medium-low Good Best first working route
Whisper-Streaming More streaming-like with local agreement Medium-high Good if you need partials Try after basic STT works
Wav2Vec2/CTC + chunk/stride Better chunk merging than Whisper Medium Maybe Worth testing
NVIDIA Parakeet/Nemotron streaming Yes, designed for streaming modes Higher Strong candidate Best true-streaming HF route
Cloud STT Yes Low-medium Technically good Not free/local long-term

10. Important: streaming ASR still needs a controller

Even with a true streaming model, you still need:

correct audio conversion
RMS/peak validation
VAD or endpointing
partial/final transcript handling
empty transcript filtering
bot_is_speaking guard
TTS feedback prevention
Discord receive debugging

Streaming ASR can help with this:

tiny chunks are too small for independent Whisper transcription

It does not automatically fix this:

empty transcript sent to Ollama
bot hears itself
wrong sample rate
bad amplitude
Discord receive broken
missing transcribe_audio function

Your current logs show controller/audio-path issues clearly, so switching models first may hide the real bug.


11. What I would do in your exact case

Step 1 — Fix the current pipeline first

Before changing models, fix these:

define transcribe_audio()
block empty Ollama calls
save actual STT input WAVs
validate sample rate / duration / RMS / peak
prevent Chatterbox feedback

Your current error:

NameError: name 'transcribe_audio' is not defined

means model choice is not the first blocker.


Step 2 — Make faster-whisper work on full utterances

Use:

audio chunks
→ mono 16 kHz conversion
→ VAD
→ utterance buffer
→ faster-whisper
→ reject blank
→ Ollama

This is the best first stable version.


Step 3 — If you need live partial captions, try Whisper-Streaming

Use these:

This keeps Whisper-like behavior but adds a streaming policy layer.


Step 4 — If you need true low-latency streaming, test Parakeet/Nemotron

Start here:

These are better for true streaming, but they are more complex than faster-whisper.


12. Best mental model

Whisper/faster-whisper route

small chunks
→ buffer into complete utterance
→ transcribe utterance

Streaming ASR route

small chunks
→ preserve model state/cache
→ emit partial/final transcript
→ controller decides when text is final enough

Bad route

small chunk
→ independent STT call
→ send result to Ollama

That last route is what you should avoid.


Bottom line

Whisper can transcribe chunks, but it is bad at transcribing tiny independent live chunks like 5760 samples.

If you want to stay simple:

Use faster-whisper + Silero VAD + utterance buffering.

If you want streaming behavior while keeping Whisper-like transcription:

Try Whisper-Streaming.

If you want a real Hugging Face streaming ASR model:

Try NVIDIA Parakeet Unified or Nemotron Speech Streaming.

But do not skip the basics:

define transcribe_audio()
block empty Ollama calls
buffer audio
validate audio
save debug WAVs
prevent TTS feedback

A streaming model can improve latency. It will not fix a broken controller or bad audio path.

There’s some other error in that log…


Why raising max_samples will not fix short Discord STT chunks

Short answer

No, simply turning max_samples up will not fix your current issue.

Your code:

# Limit audio length
max_samples = 16000 * 15  # 15 seconds max

if len(audio_np) > max_samples:
    audio_np = audio_np[-max_samples:]
    print(f"[TRIM] Trimmed audio to last {max_samples} samples")

only handles audio that is too long.

It says:

If audio is longer than 15 seconds, trim it down to the last 15 seconds.

It does not say:

Wait until I have enough audio before transcribing.

Your new log shows the opposite problem:

[SEARCH] Audio analysis - size: 5760, max_amplitude: 1.304917
[MIC] Incoming audio | amp=1.304917 | samples=5760
[PROCESS] Processing audio: 5760 samples, 1.304917 max amplitude
[STT] Transcribing with improved local STT...
...
Transcription error: name 'transcribe_audio' is not defined
...
Transcribed text: ''
Sending to Ollama: '...'

5760 samples is very short.

At different sample rates, that means:

Sample rate assumption Duration
16 kHz 5760 / 16000 = 0.36s
24 kHz 5760 / 24000 = 0.24s
48 kHz 5760 / 48000 = 0.12s

So raising max_samples from 15 seconds to 30 seconds would not help. Your audio is not being cut because it is too long. It is being sent to STT before enough speech has accumulated.

What you need is not a bigger maximum. You need:

minimum duration gate
+ audio chunk buffering
+ VAD / speech-end detection
+ transcript validation
+ empty transcript blocking

Useful references:


What your new log says

You now have several separate problems at the same time.

1. The audio chunk is too short

This line matters:

samples=5760

At 16 kHz, that is only 0.36 seconds.

That is not a complete utterance. It might be a breath, half a syllable, background noise, a clipped word, or a small piece of the bot’s own audio.

Whisper-style models are not good at:

tiny fragment in
→ reliable transcript out

Whisper-style models are better at:

complete speech segment in
→ transcript out

The Whisper-Streaming paper is relevant because it explicitly says Whisper is not designed for native real-time transcription. It wraps Whisper with a streaming policy so it can work on live/unsegmented speech.

For your bot, the practical translation is:

Do not transcribe tiny chunks.
Buffer chunks into completed speech turns.

2. The audio amplitude is still suspicious

Your log says:

max_amplitude: 1.304917

For normalized float audio going into STT, you usually want roughly:

-1.0 to +1.0

A peak above 1.0 can happen if there is gain/normalization, but it is suspicious enough to inspect. It may mean:

Possible issue Result
int16 PCM converted incorrectly static / garbage waveform
stereo interleaved audio treated as mono distorted audio
gain too high clipping
double normalization harsh waveform
wrong dtype nonsense values
wrong sample-rate path sped-up or slowed-down speech

This can explain “when it works, it is off as heck.”

Before changing models, save the exact STT input as a WAV and listen to it.


3. Your STT function path is broken

This is a hard code bug:

NameError: name 'transcribe_audio' is not defined

That means your code tried to call:

transcribe_audio(...)

but no such function exists in that scope.

So that run did not prove anything about Whisper quality. The STT path crashed before a real transcription could happen.

You need either:

def transcribe_audio(audio_16k):
    ...

or change your code to call the function that actually exists.

Example:

def safe_transcribe(audio_16k):
    try:
        return transcribe_audio(audio_16k)
    except Exception as e:
        print(f"[STT] Transcription error: {e}")
        return ""

If transcribe_audio is not defined, every STT attempt becomes an empty transcript.


4. Empty transcripts are still being sent to Ollama

This is the most important controller bug.

Your log shows:

Transcribed text: ''
You:
Sending to Ollama: '...'
Ollama response status: 200
...
AI: What's good, chat? Ready to get this conversation started!

That means:

STT failed
→ empty text
→ sent to Ollama anyway
→ Ollama generated a generic opener
→ TTS generated audio

That creates a loop where the AI responds even though no valid user speech was heard.

This must be blocked.


What max_samples actually does

Your current code:

max_samples = 16000 * 15

if len(audio_np) > max_samples:
    audio_np = audio_np[-max_samples:]

means:

Keep at most the last 15 seconds.

It is an upper cap.

It only triggers when:

len(audio_np) > 240000

But your log has:

len(audio_np) = 5760

So:

5760 > 240000  # False

Nothing happens.

What you actually need

You need a minimum:

min_samples = int(16000 * 0.8)

if len(audio_np) < min_samples:
    print("Too short; keep buffering instead of transcribing.")
    return ""

But even that is only a guard. The real fix is buffering.


The correct idea: concatenate chunks before STT

Your incoming chunks are tiny. That is normal for real-time audio.

The wrong pipeline is:

chunk 1 → STT
chunk 2 → STT
chunk 3 → STT
chunk 4 → STT

The better pipeline is:

chunk 1
+ chunk 2
+ chunk 3
+ chunk 4
+ ...
→ enough speech collected
→ STT once

The best pipeline is:

chunks
→ VAD detects speech start
→ buffer while user speaks
→ VAD detects enough silence
→ finalize utterance
→ STT once

That is the difference between chunk transcription and utterance transcription.


Minimal fix order

Do these in this order.

1. Define or correctly call transcribe_audio

Your log has:

NameError: name 'transcribe_audio' is not defined

Fix that first.

Example wrapper:

def transcribe_audio(audio_16k):
    return transcribe_with_faster_whisper(audio_16k)

Or rename the call:

# Wrong if transcribe_audio does not exist:
text = transcribe_audio(audio_np)

# Right if this is the function that actually exists:
text = transcribe_with_faster_whisper(audio_np)

Until this is fixed, STT cannot work.


2. Stop sending empty transcripts to Ollama

Add this immediately:

def should_send_to_ollama(text: str) -> bool:
    text = (text or "").strip()

    if not text:
        return False

    if len(text) < 2:
        return False

    bad_outputs = {
        ".",
        "...",
        "you",
        "thank you",
        "thanks for watching",
        "subscribe",
    }

    if text.lower() in bad_outputs:
        return False

    return True

Use it before every Ollama call:

text = safe_transcribe(audio_np)

if not should_send_to_ollama(text):
    print("[CTRL] Empty/invalid transcript; not sending to Ollama.")
    return

send_to_ollama(text)

This prevents:

blank STT
→ generic AI greeting
→ TTS
→ possible feedback loop

3. Add audio validation before STT

Use this before calling Whisper/faster-whisper:

import numpy as np

def valid_audio_for_stt(audio_16k, sr=16000):
    audio_16k = np.asarray(audio_16k, dtype=np.float32)

    duration = len(audio_16k) / sr
    peak = float(np.max(np.abs(audio_16k))) if len(audio_16k) else 0.0
    rms = float(np.sqrt(np.mean(audio_16k ** 2))) if len(audio_16k) else 0.0

    if duration < 0.8:
        return False, f"too short: {duration:.2f}s"

    if peak < 0.015:
        return False, f"too quiet: peak={peak:.4f}"

    if rms < 0.003:
        return False, f"too quiet: rms={rms:.4f}"

    if peak > 1.05:
        return False, f"bad normalization: peak={peak:.4f}"

    return True, "ok"

Your current chunk would probably fail:

samples=5760
peak=1.304917

That is good. Bad audio should be rejected before STT.


Simple concatenation buffer

This is not the final ideal version, but it is a useful first patch.

import numpy as np

class RollingSTTBuffer:
    def __init__(self, sample_rate=16000, min_seconds=1.0, max_seconds=15.0):
        self.sample_rate = sample_rate
        self.min_samples = int(sample_rate * min_seconds)
        self.max_samples = int(sample_rate * max_seconds)
        self.buffer = np.zeros(0, dtype=np.float32)

    def add(self, chunk):
        chunk = np.asarray(chunk, dtype=np.float32)
        self.buffer = np.concatenate([self.buffer, chunk])

        if len(self.buffer) > self.max_samples:
            self.buffer = self.buffer[-self.max_samples:]

    def ready(self):
        return len(self.buffer) >= self.min_samples

    def pop(self):
        audio = self.buffer
        self.buffer = np.zeros(0, dtype=np.float32)
        return audio

Usage:

stt_buffer = RollingSTTBuffer(
    sample_rate=16000,
    min_seconds=1.0,
    max_seconds=15.0,
)

def handle_audio_chunk(chunk_16k):
    stt_buffer.add(chunk_16k)

    if not stt_buffer.ready():
        print("[BUFFER] Not enough audio yet.")
        return

    audio_for_stt = stt_buffer.pop()

    text = safe_transcribe(audio_for_stt)

    if not should_send_to_ollama(text):
        return

    send_to_ollama(text)

This proves whether concatenating chunks helps.

But it has a weakness: it transcribes after a fixed amount of audio, not after the user actually finishes speaking.

The better solution is VAD-based buffering.


Better solution: VAD-based utterance buffering

Use VAD to decide:

speech started
speech continued
speech ended

Then transcribe the completed utterance.

Recommended tools:

Silero VAD is useful because it supports 8 kHz and 16 kHz audio and is designed for fast chunk-level speech detection.

VAD-based utterance buffer

import numpy as np

class UtteranceBuffer:
    def __init__(
        self,
        sample_rate=16000,
        min_speech_seconds=0.8,
        end_silence_ms=900,
        max_seconds=15.0,
    ):
        self.sample_rate = sample_rate
        self.min_speech_samples = int(sample_rate * min_speech_seconds)
        self.end_silence_ms = end_silence_ms
        self.max_samples = int(sample_rate * max_seconds)

        self.frames = []
        self.speaking = False
        self.silence_ms = 0.0
        self.speech_samples = 0

    def _frame_ms(self, frame):
        return 1000.0 * len(frame) / self.sample_rate

    def push(self, frame_16k, is_speech: bool):
        frame_16k = np.asarray(frame_16k, dtype=np.float32)

        if is_speech:
            self.speaking = True
            self.silence_ms = 0.0
            self.speech_samples += len(frame_16k)
            self.frames.append(frame_16k)

        elif self.speaking:
            self.silence_ms += self._frame_ms(frame_16k)
            self.frames.append(frame_16k)

        else:
            return None

        audio = (
            np.concatenate(self.frames)
            if self.frames
            else np.zeros(0, dtype=np.float32)
        )

        if len(audio) > self.max_samples:
            audio = audio[-self.max_samples:]
            self.frames = [audio]

        if self.speaking and self.silence_ms >= self.end_silence_ms:
            utterance = (
                np.concatenate(self.frames)
                if self.frames
                else np.zeros(0, dtype=np.float32)
            )

            enough_speech = self.speech_samples >= self.min_speech_samples

            self.frames = []
            self.speaking = False
            self.silence_ms = 0.0
            self.speech_samples = 0

            if not enough_speech:
                print("[VAD] Dropped utterance: too little speech")
                return None

            return utterance

        return None

Conceptual usage:

utt_buffer = UtteranceBuffer(sample_rate=16000)

def process_audio_frame(frame_16k):
    is_speech = vad_is_speech(frame_16k)  # implement with Silero/WebRTC/etc.

    utterance = utt_buffer.push(frame_16k, is_speech=is_speech)

    if utterance is None:
        return ""

    text = safe_transcribe(utterance)

    if not should_send_to_ollama(text):
        return ""

    send_to_ollama(text)
    return text

This is the direction you want.


faster-whisper starter config

Use faster-whisper instead of a hand-rolled “simple Whisper STT” path if possible.

Reference:

Example:

from faster_whisper import WhisperModel
import numpy as np

model = WhisperModel(
    "small.en",
    device="cpu",        # use "cuda" if available
    compute_type="int8", # use "float16" on CUDA
)

def transcribe_with_faster_whisper(audio_16k: np.ndarray) -> str:
    audio_16k = np.asarray(audio_16k, dtype=np.float32)
    audio_16k = np.clip(audio_16k, -1.0, 1.0)

    ok, reason = valid_audio_for_stt(audio_16k, sr=16000)
    if not ok:
        print("[STT] Skipping:", reason)
        return ""

    segments, info = model.transcribe(
        audio_16k,
        language="en",
        task="transcribe",
        beam_size=1,
        temperature=0.0,
        condition_on_previous_text=False,
        vad_filter=True,
        vad_parameters={
            "min_silence_duration_ms": 700,
            "speech_pad_ms": 300,
        },
        no_speech_threshold=0.6,
        compression_ratio_threshold=1.35,
        log_prob_threshold=-1.0,
    )

    return " ".join(seg.text.strip() for seg in segments).strip()

Why these settings help:

Setting Reason
language="en" Avoids unstable language detection on short clips
task="transcribe" Prevents accidental translation
beam_size=1 Lower latency
temperature=0.0 More deterministic
condition_on_previous_text=False Reduces carry-over hallucination between short turns
vad_filter=True Extra silence cleanup
min_silence_duration_ms=700 Reasonable conversational silence threshold
speech_pad_ms=300 Avoids cutting word edges
no_speech_threshold=0.6 Helps ignore no-speech chunks
compression_ratio_threshold=1.35 Helps catch repetitive hallucinations
log_prob_threshold=-1.0 Helps catch low-confidence output

Save the exact STT input as WAV

This is still the most important debug step.

# deps:
# pip install soundfile numpy

import numpy as np
import soundfile as sf
from pathlib import Path

debug_dir = Path("debug_stt")
debug_dir.mkdir(exist_ok=True)

def save_debug_wav(audio, sr, filename):
    audio = np.asarray(audio, dtype=np.float32)
    audio = np.clip(audio, -1.0, 1.0)
    sf.write(debug_dir / filename, audio, sr)

Use it right before STT:

save_debug_wav(audio_16k, 16000, "actual_stt_input.wav")

Then listen.

What the WAV sounds like Diagnosis
Silence wrong source / VAD issue / Discord receive issue
Static dtype/decode issue
Fast voice sample-rate mismatch
Slow voice sample-rate mismatch
Distorted/clipped normalization/gain issue
Half-word only chunking problem
Bot voice TTS feedback loop
Multiple speakers need per-user buffers
Clean sentence STT settings/model issue

Do not skip this. It usually reveals the actual problem faster than changing models.


Sample rates in your system

You now likely have several sample rates:

STT target: 16000 Hz
Chatterbox TTS output: 24000 Hz
Discord voice audio: commonly 48000 Hz stereo/Opus/PCM path

Your log says:

TTS result: sr=24000, audio_shape=(86400,)

That is:

86400 / 24000 = 3.6 seconds

Chatterbox generated 3.6 seconds of TTS audio.

That audio should go to the TTS/playback path, not the STT input path.

Keep these separate:

Input audio → 16 kHz mono → STT
TTS audio → Discord playback format → Discord output

Do not let Chatterbox/TTS output leak into your mic/STT input.


Is Chatterbox affecting STT?

Probably not directly.

Chatterbox is TTS. It generates speech. It does not transcribe speech.

But it can affect your STT system indirectly in three ways.

1. Feedback loop

If the bot’s generated voice is captured by your mic or virtual audio cable, the STT system may hear the bot instead of you.

Bad routing:

TTS output
→ speakers / desktop mix / virtual cable
→ STT input
→ bot hears itself
→ bot replies to itself

Better routing:

Human mic or per-user Discord receive
→ STT

Bot TTS
→ Discord output only

2. Sample-rate confusion

Chatterbox output is 24 kHz in your log.

STT should usually get 16 kHz mono.

Discord playback often involves 48 kHz audio.

So do not reuse one conversion path for everything.

3. The Turbo warning is not your STT bug

Your log says:

WARNING - CFG, min_p and exaggeration are not supported by Turbo version and will be ignored.

That warning is about Chatterbox Turbo TTS settings. It means those generation settings are ignored by the Turbo model.

Relevant links:

That warning can affect TTS behavior/customization, but it does not explain blank STT.


Add a bot_is_speaking guard while debugging

For the first stable version, disable listening while the bot speaks.

bot_is_speaking = False

Around TTS playback:

bot_is_speaking = True
play_tts_audio(...)
bot_is_speaking = False

In audio handling:

def handle_audio_chunk(chunk_16k):
    if bot_is_speaking:
        print("[AUDIO] Ignoring input while bot is speaking.")
        return

    # continue STT path

This disables barge-in, but it prevents feedback while debugging.

Later, implement real barge-in:

if human starts speaking while bot speaks:
    stop TTS
    clear playback queue
    cancel current LLM/TTS response
    return to listening

Live voice-agent systems treat turn detection and interruption handling as separate concerns. See:


Better logs to add

Your logs should include:

sample_rate
samples
duration_seconds
min
max
peak
rms
bot_is_speaking
buffer_size
vad_state
utterance_ready
stt_called
ollama_called

Example logging helper:

import numpy as np

def log_audio_debug(label, audio, sr):
    audio = np.asarray(audio, dtype=np.float32)
    duration = len(audio) / sr if sr else 0.0
    peak = float(np.max(np.abs(audio))) if len(audio) else 0.0
    rms = float(np.sqrt(np.mean(audio ** 2))) if len(audio) else 0.0

    print(
        f"[{label}] sr={sr} samples={len(audio)} "
        f"duration={duration:.3f}s peak={peak:.4f} rms={rms:.4f}"
    )

Healthy logs should look like:

[MIC] sr=16000 samples=320 duration=0.020s peak=0.12 rms=0.02
[VAD] speech_start
[BUFFER] speech_ms=1240 silence_ms=0
[VAD] endpoint after silence_ms=900
[UTTERANCE] sr=16000 samples=35680 duration=2.23s peak=0.44 rms=0.06
[STT] text="can you hear me now"
[OLLAMA] sending valid transcript

Unhealthy logs look like:

samples=5760
transcribe immediately
NameError
empty text
send to Ollama anyway

Discord-specific note

If you are receiving audio from Discord VC, remember that Discord receive is its own fragile layer.

Discord voice docs:

Pycord warns recording/listening may be affected by DAVE:

Receive extension:

Before debugging STT, prove Discord receive works by saving clean WAV files:

Discord receive
→ decode/convert
→ save WAV
→ listen manually

Only after the WAV sounds correct should you send it into STT.


Recommended build order

Phase 1: local mic STT only

local mic
→ VAD
→ utterance buffer
→ faster-whisper
→ print transcript

Pass criteria:

silence produces no transcript
one sentence produces one transcript
partial speech is not sent
empty text is ignored

Phase 2: add Ollama

local mic
→ STT
→ Ollama
→ print reply

Pass criteria:

Ollama is called only for real speech
blank transcripts are ignored

Phase 3: add Chatterbox locally

local mic
→ STT
→ Ollama
→ Chatterbox TTS
→ local playback

Pass criteria:

the bot does not hear itself
the bot does not respond to its own voice

Phase 4: send TTS to Discord

local mic
→ STT
→ Ollama
→ Chatterbox TTS
→ Discord VC output

Phase 5: add Discord receive later

First:

Discord receive
→ save clean WAV per speaker

Then:

Discord receive
→ per-user VAD
→ per-user STT
→ speaker-labeled transcript

Do not start with full Discord receive unless you need it. It adds several failure points.


Final summary

Turning max_samples up will not fix this because max_samples is an upper cap, not a minimum buffer target.

Your immediate problems are:

1. You are calling STT on tiny chunks like 5760 samples.
2. Your waveform amplitude is suspiciously above 1.0.
3. Your code is calling a missing function: transcribe_audio.
4. Empty transcripts are still being sent to Ollama.
5. Chatterbox may be feeding back into STT if audio routing is not separated.

Fix order:

1. Define or correctly call transcribe_audio().
2. Stop sending empty transcripts to Ollama.
3. Add minimum-duration/RMS/peak validation before STT.
4. Concatenate chunks into a buffer.
5. Replace fixed buffering with VAD-based utterance buffering.
6. Keep Chatterbox/TTS output out of the STT input path.
7. Use faster-whisper with vad_filter=True and condition_on_previous_text=False.

The core rule:

Do not make Whisper transcribe chunks.
Make Whisper transcribe completed utterances.