How Do i Make Stt Work for my ai Vtuber on Discord Vc calls?

Bro77Xp · April 28, 2026, 1:27am

i’ve been struggling to make discord stt work but I’ve figured out everything else out (is there a solution for a free stt that can be programmed into the vc call on discord that won’t have issues on call?)

logs:
Transcribing…
Audio analysis - size: 18000, max_amplitude: 1.236272
Incoming audio | amp=1.236272 | samples=18000
Processing audio: 18000 samples, 1.236272 max amplitude
Transcribing with simple Whisper STT…
Transcribing audio: 18000 samples, max amplitude: 1.2363
Transcription too short
RAW STT OUTPUT: ‘’
Empty transcription
Transcribed text: ‘’
You:
Sending to Ollama: ‘…’

nothing shows up and when i get it to work it’s off as heck from what i actually said

John6666 · April 29, 2026, 5:36am

Hmm… Even among files with the .wav extension, there can be significant variation in the internal format and quality. This is because the .wav format is relatively flexible and can accommodate various forms depending on the RIFF header.

Therefore, it’s likely that transcription will be more successful if you first use a standard program to clean up or normalize the .wav file, in other words, perform preprocessing, before passing it to STT…

How to make STT work for an AI VTuber in Discord VC

Short answer

There is no free STT that can be dropped into Discord VC and “just won’t have issues.”

For a reliable free/local setup, the solution is not simply “use Whisper.” The solution is a proper live voice pipeline:

Discord VC or local mic audio
→ decode to PCM
→ convert to mono 16 kHz float32
→ normalize/clip safely
→ VAD / speech detection
→ buffer a complete utterance
→ transcribe with faster-whisper
→ reject blank / garbage / too-short transcripts
→ send only valid text to Ollama

The most important point:

Do not transcribe raw Discord audio chunks directly.

Discord gives you a live audio stream. Whisper-style STT wants a clean speech segment. Your job is to turn many tiny audio frames into one clean utterance before STT sees it.

Useful references:

Your log

📝 Transcribing...
🔍 Audio analysis - size: 18000, max_amplitude: 1.236272
🎤 Incoming audio | amp=1.236272 | samples=18000
🎵 Processing audio: 18000 samples, 1.236272 max amplitude
🔍 Transcribing with simple Whisper STT...
Transcribing audio: 18000 samples, max amplitude: 1.2363
Transcription too short
🧪 RAW STT OUTPUT: ''
📝 Empty transcription
📝 Transcribed text: ''
👤 You:
📤 Sending to Ollama: '...'

This log strongly suggests four problems:

The audio chunk is probably too short.
The audio amplitude/normalization looks suspicious.
You are probably transcribing chunks instead of completed speech turns.
Empty STT output is still being sent downstream to Ollama.

That combination can explain both symptoms:

nothing shows up

and:

when it works, the transcription is way off

1. The biggest clue: `18000 samples`

18000 samples is probably not enough audio.

The actual duration depends on sample rate and channel layout:

Interpretation	Approximate duration
18,000 mono samples at 48 kHz	0.375 seconds
18,000 stereo-interleaved samples at 48 kHz	0.1875 seconds per channel
18,000 mono samples at 16 kHz	1.125 seconds

Even the best case is short. The likely Discord case may be less than half a second.

That explains this line:

Transcription too short

Whisper-style models can sometimes decode short clips, but they are not reliable when fed tiny fragments like:

"he"
"hel"
"lo ca"
"n you"

The Whisper-Streaming paper exists because Whisper is not natively designed for real-time transcription. It needs a streaming/buffering policy around it.

Fix

Do not do this:

Discord chunk → Whisper → text

Do this:

Discord chunks
→ VAD
→ utterance buffer
→ complete speech segment
→ Whisper/faster-whisper
→ text

Good starter thresholds:

Setting	Recommended starting value
Minimum speech before STT	0.8–1.2 seconds
End-of-turn silence	700–1200 ms
Pre-roll before speech start	200–400 ms
Post-roll after speech end	150–300 ms
Max utterance length	8–15 seconds

2. `max_amplitude=1.236272` is suspicious

For normalized float audio, you usually want samples roughly inside:

-1.0 to +1.0

Your log says:

max_amplitude: 1.236272

That does not automatically prove the audio is broken, but it is suspicious enough to inspect before changing STT models.

Possible causes:

Cause	Result
int16 PCM interpreted as float32	static/noise/garbage waveform
stereo interleaved audio treated as mono	distorted waveform
audio normalized twice	clipping / harsh audio
gain too high	values exceed normal float range
wrong resampling path	pitch/speed/time distortion
wrong dtype/endian	nonsense transcription

If the waveform going into STT is malformed, Whisper will produce blanks or nonsense no matter which model you use.

3. Discord audio is not Whisper-ready audio

Discord voice and Whisper have different assumptions.

Discord voice docs describe voice audio around Opus, stereo, and 48 kHz. See:

Discord Voice Connections docs

Whisper/faster-whisper generally wants prepared speech audio:

mono
16 kHz
float32
roughly [-1.0, +1.0]
speech-only segment

See:

Correct conversion path:

Discord Opus / decoded PCM
→ 48 kHz stereo PCM
→ downmix to mono
→ resample to 16 kHz
→ convert to float32
→ normalize/clip safely
→ VAD
→ utterance buffer
→ STT

Bad conversion can make speech sound sped up, slowed down, crunchy, clipped, or like static. Whisper then guesses.

4. Discord receive is separately fragile

Discord voice sending is relatively common.

Discord voice receiving is harder.

If your AI VTuber only needs to hear you, the most stable first version is:

local mic
→ VAD
→ faster-whisper
→ Ollama
→ TTS
→ Discord voice output

The bot still talks in Discord, but it listens to your local microphone instead of trying to receive Discord VC audio.

If the AI must hear everyone in VC, then you need the harder path:

Discord VC receive
→ per-user audio stream
→ decode
→ mono 16 kHz conversion
→ per-user VAD
→ per-user utterance buffer
→ per-user STT
→ speaker-labeled transcript
→ Ollama
→ TTS
→ Discord voice send

Important Discord receive references:

Also note: Pycord’s voice docs currently warn that recording/listening may not work as expected because of DAVE end-to-end encryption for voice calls:

Pycord voice recording/listening docs

So if the bot joins VC but hears nothing, do not immediately blame Whisper. First prove that your receive library is producing valid decoded audio.

5. Empty transcript should never reach Ollama

This part of your log is a controller bug:

RAW STT OUTPUT: ''
Empty transcription
Transcribed text: ''
Sending to Ollama: '...'

An empty STT result should mean:

return to listening

Not:

send blank prompt to Ollama

Add a hard gate:

def should_send_to_ollama(text: str) -> bool:
    text = text.strip()

    if not text:
        return False

    if len(text) < 2:
        return False

    common_bad_outputs = {
        ".",
        "...",
        "you",
        "thank you",
        "thanks for watching",
        "subscribe",
    }

    if text.lower() in common_bad_outputs:
        return False

    return True

Then:

text = transcribe_utterance(audio_16k)

if not should_send_to_ollama(text):
    print("No valid transcript. Returning to listening.")
    return

send_to_ollama(text)

This one change prevents the AI from reacting to silence, failed STT, or garbage.

6. Recommended free STT stack

Use this first:

Silero VAD
+ faster-whisper
+ proper audio conversion
+ utterance buffering
+ transcript filtering

Why faster-whisper?

faster-whisper is a practical local Whisper implementation built on CTranslate2. It is commonly used because it is faster and more memory-efficient than the original Whisper implementation, and it supports CPU/GPU quantized inference.

Why Silero VAD?

Silero VAD is lightweight and fast. It is designed for voice activity detection and can process small chunks efficiently. Use it to decide when speech starts and ends.

Why not “simple Whisper STT”?

Simple Whisper STT usually assumes:

clean audio file in
→ transcript out

Discord VC is different:

tiny frames
+ live speech
+ silence/noise
+ possible stereo 48 kHz audio
+ possible Discord receive problems
+ possible bot audio feedback

So you need a live voice pipeline, not a file-transcription pipeline.

7. The architecture I would use

Best first version: AI hears only you

Use your local mic for STT:

Local mic
→ audio conversion
→ Silero VAD
→ utterance buffer
→ faster-whisper
→ transcript filter
→ Ollama
→ TTS
→ Discord VC output

This avoids:

Discord receive instability
DAVE/E2EE receive issues
per-user audio mapping
mixed-speaker audio
Discord packet/decode bugs
bot hearing its own Discord output

This is the most reliable first working version for an AI VTuber.

Harder version: AI hears everyone in Discord VC

Use per-user receive buffers:

Discord VC receive
→ per-user PCM
→ per-user mono 16 kHz conversion
→ per-user VAD
→ per-user utterance buffer
→ per-user STT
→ speaker-labeled text
→ Ollama
→ TTS
→ Discord voice output

Speaker-labeled transcript example:

Alice: Can you hear me?
Bob: Yeah, I can hear you.
You: Ask the AI what it thinks.

Do not mix all users into one stream unless you are okay with bad diarization and confused transcripts.

8. Correct audio validation before STT

Before calling STT, log these values:

import numpy as np

def audio_stats(audio, sample_rate: int) -> dict:
    audio = np.asarray(audio, dtype=np.float32)

    if len(audio) == 0:
        return {
            "samples": 0,
            "sample_rate": sample_rate,
            "duration": 0.0,
            "min": 0.0,
            "max": 0.0,
            "peak": 0.0,
            "rms": 0.0,
        }

    return {
        "samples": int(len(audio)),
        "sample_rate": sample_rate,
        "duration": float(len(audio) / sample_rate),
        "min": float(audio.min()),
        "max": float(audio.max()),
        "peak": float(np.max(np.abs(audio))),
        "rms": float(np.sqrt(np.mean(audio ** 2))),
    }

Target before STT:

sample_rate: 16000
channels: mono
dtype: float32
duration: usually 0.8s or longer
peak: usually <= 1.0
rms: not near zero

Reject bad audio before STT:

import numpy as np

def valid_audio_for_stt(audio_16k, sr=16000):
    audio_16k = np.asarray(audio_16k, dtype=np.float32)

    duration = len(audio_16k) / sr
    peak = float(np.max(np.abs(audio_16k))) if len(audio_16k) else 0.0
    rms = float(np.sqrt(np.mean(audio_16k ** 2))) if len(audio_16k) else 0.0

    if duration < 0.8:
        return False, f"too short: {duration:.2f}s"

    if peak < 0.015:
        return False, f"too quiet: peak={peak:.4f}"

    if rms < 0.003:
        return False, f"too quiet: rms={rms:.4f}"

    if peak > 1.05:
        return False, f"bad normalization: peak={peak:.4f}"

    return True, "ok"

Given your log:

max_amplitude=1.236272

this would probably reject the chunk and force you to inspect your conversion/normalization.

That is good. You want to catch bad audio before Whisper sees it.

9. Save the exact STT input as WAV

This is the most important debug step.

# deps:
# pip install soundfile numpy

import numpy as np
import soundfile as sf
from pathlib import Path

debug_dir = Path("debug_stt")
debug_dir.mkdir(exist_ok=True)

def save_debug_wav(audio, sample_rate: int, filename: str):
    audio = np.asarray(audio, dtype=np.float32)
    audio = np.clip(audio, -1.0, 1.0)
    sf.write(debug_dir / filename, audio, sample_rate)

When STT returns blank:

save_debug_wav(audio_16k, 16000, "empty_transcript.wav")

Then listen to the file.

What the WAV sounds like	Likely cause
Silence	wrong source / VAD issue / receive issue
Static	dtype/decode issue
Fast voice	sample-rate mismatch
Slow voice	sample-rate mismatch
Distorted/clipped	gain/normalization issue
Half-word only	chunking/buffering issue
Bot’s own voice	audio feedback loop
Multiple speakers mixed	need per-user buffers
Clean full sentence	STT settings/model issue

Do this before switching models.

10. Safer PCM conversion example

If you have decoded 48 kHz stereo signed 16-bit PCM bytes, convert like this:

# deps:
# pip install numpy scipy

import numpy as np
from scipy.signal import resample_poly

def pcm_s16le_48k_stereo_to_16k_mono_float32(pcm_bytes: bytes) -> np.ndarray:
    audio_i16 = np.frombuffer(pcm_bytes, dtype=np.int16)

    if audio_i16.size == 0:
        return np.zeros(0, dtype=np.float32)

    # Drop incomplete stereo frame if needed.
    usable = (audio_i16.size // 2) * 2
    audio_i16 = audio_i16[:usable]

    # frames x channels
    stereo = audio_i16.reshape(-1, 2)

    # stereo → mono
    mono = stereo.astype(np.float32).mean(axis=1)

    # int16 → float32
    mono = mono / 32768.0

    # safety clamp
    mono = np.clip(mono, -1.0, 1.0)

    # 48 kHz → 16 kHz
    mono_16k = resample_poly(mono, up=1, down=3)

    return mono_16k.astype(np.float32)

Common bad conversions:

# Bad if pcm_bytes are int16 PCM.
audio = np.frombuffer(pcm_bytes, dtype=np.float32)

# Bad if stereo is interleaved and never reshaped/downmixed.
audio = np.frombuffer(pcm_bytes, dtype=np.int16).astype(np.float32) / 32768.0

# Bad if the data is still 48 kHz but you tell Whisper it is 16 kHz.
processor(audio, sampling_rate=16000)

11. faster-whisper starter config

# deps:
# pip install faster-whisper numpy

from faster_whisper import WhisperModel
import numpy as np

model = WhisperModel(
    "small.en",
    device="cpu",        # use "cuda" if available
    compute_type="int8", # use "float16" on CUDA
)

def transcribe_utterance(audio_16k: np.ndarray) -> str:
    ok, reason = valid_audio_for_stt(audio_16k)
    if not ok:
        print("Skipping STT:", reason)
        return ""

    audio_16k = np.asarray(audio_16k, dtype=np.float32)
    audio_16k = np.clip(audio_16k, -1.0, 1.0)

    segments, info = model.transcribe(
        audio_16k,
        language="en",
        task="transcribe",
        beam_size=1,
        temperature=0.0,
        condition_on_previous_text=False,
        vad_filter=True,
        vad_parameters={
            "min_silence_duration_ms": 700,
            "speech_pad_ms": 300,
        },
        no_speech_threshold=0.6,
        compression_ratio_threshold=1.35,
        log_prob_threshold=-1.0,
    )

    text = " ".join(segment.text.strip() for segment in segments).strip()
    return text

Why these settings:

Setting	Why it helps
`language="en"`	Avoids unstable language detection on short clips
`task="transcribe"`	Prevents accidental translation behavior
`beam_size=1`	Lower latency
`temperature=0.0`	More deterministic
`condition_on_previous_text=False`	Reduces carry-over hallucination across short turns
`vad_filter=True`	Extra silence cleanup
`min_silence_duration_ms=700`	Reasonable conversational endpoint start
`speech_pad_ms=300`	Avoids cutting word edges
`no_speech_threshold=0.6`	Helps skip no-speech segments
`compression_ratio_threshold=1.35`	Helps catch repetitive output
`log_prob_threshold=-1.0`	Helps catch low-confidence output

References:

12. Add an utterance buffer

Conceptual version:

import numpy as np

class UtteranceBuffer:
    def __init__(self, sample_rate=16000, end_silence_ms=900):
        self.sample_rate = sample_rate
        self.end_silence_ms = end_silence_ms
        self.frames = []
        self.speaking = False
        self.silence_ms = 0.0

    def _frame_ms(self, audio_frame):
        return 1000.0 * len(audio_frame) / self.sample_rate

    def push(self, audio_frame_16k, is_speech: bool):
        audio_frame_16k = np.asarray(audio_frame_16k, dtype=np.float32)

        if is_speech:
            self.speaking = True
            self.silence_ms = 0.0
            self.frames.append(audio_frame_16k)
            return None

        if self.speaking:
            self.silence_ms += self._frame_ms(audio_frame_16k)

            # Keep some silence tail so the utterance does not sound chopped.
            self.frames.append(audio_frame_16k)

            if self.silence_ms >= self.end_silence_ms:
                utterance = np.concatenate(self.frames) if self.frames else np.zeros(0, dtype=np.float32)
                self.frames = []
                self.speaking = False
                self.silence_ms = 0.0
                return utterance

        return None

This is the missing bridge between live audio and STT.

13. Do not run STT inside the Discord receive callback

Bad:

def on_audio_frame(user_id, pcm_bytes):
    text = transcribe_with_whisper(pcm_bytes)
    send_to_ollama(text)

Good:

def on_audio_frame(user_id, pcm_bytes):
    audio_queue.put_nowait((user_id, pcm_bytes))

Then a worker handles:

decode/convert
→ VAD
→ buffering
→ STT
→ filtering
→ Ollama

This prevents the Discord audio receive path from being blocked by Whisper inference.

14. Recommended controller state machine

A Discord AI VTuber should not be:

transcribe → send to Ollama → speak

It should be a state machine:

LISTENING
  - waiting for speech

USER_SPEAKING
  - VAD says speech is active
  - buffer audio
  - do not call STT yet

ENDPOINT_CANDIDATE
  - silence started
  - wait 700–1200 ms
  - if speech resumes, go back to USER_SPEAKING

TRANSCRIBING
  - run STT on completed utterance

VALIDATING
  - reject empty / too-short / suspicious transcript

THINKING
  - send valid transcript to Ollama

SPEAKING
  - TTS plays into Discord

INTERRUPTED
  - user speaks while bot speaks
  - cancel TTS / LLM output
  - return to listening or user-speaking

See:

15. Prevent the bot from hearing itself

This is a common AI VTuber / Discord bot failure.

Bad routing:

Discord output / desktop mix
→ STT input
→ bot hears itself
→ bot replies to itself
→ loop

Better routing:

Human mic or per-user Discord receive
→ STT

Bot TTS
→ Discord output only

Practical safeguards:

Do not use desktop audio mix as STT input.
Use headphones while testing.
Keep STT input and bot TTS output on separate devices/routes.
If using Discord receive, ignore the bot’s own user ID.
Pause STT while bot is speaking unless implementing proper barge-in.
If allowing barge-in, only human speech should interrupt the bot.

16. What to build first

Phase 1 — local mic STT only

local mic
→ VAD
→ utterance buffer
→ faster-whisper
→ print transcript

Pass criteria:

silence produces no transcript
one sentence produces one transcript
partial speech is not sent
empty text is ignored

Phase 2 — add Ollama

local mic
→ STT
→ Ollama
→ print reply

Pass criteria:

Ollama is called only for real speech
blank transcripts are ignored

Phase 3 — add TTS locally

local mic
→ STT
→ Ollama
→ TTS
→ local playback

Pass criteria:

the bot does not hear itself
the bot does not respond to its own voice

Phase 4 — send TTS to Discord

local mic
→ STT
→ Ollama
→ TTS
→ Discord VC output

This gives you a working AI VTuber that speaks in VC without needing Discord receive yet.

Phase 5 — add Discord receive only after the local loop works

First test:

Discord receive
→ save clean WAV per speaker

Only after those WAV files sound correct:

Discord receive
→ per-user VAD
→ per-user STT
→ speaker-labeled transcript

17. How to interpret your exact log

`size: 18000`

Likely means you are feeding a chunk, not a complete utterance.

Fix:

buffer until VAD says the user finished speaking

`max_amplitude: 1.236272`

Likely means conversion or normalization needs inspection.

Fix:

save WAV
print dtype/min/max/rms/sample_rate/channels
verify float32 mono 16 kHz

`Transcription too short`

Probably correct behavior from the STT layer.

Fix:

do not call STT for audio under ~0.8 seconds

`RAW STT OUTPUT: ''`

Normal possible STT result.

Fix:

treat blank STT as “no event”

`Sending to Ollama: '...'`

Controller bug.

Fix:

never send empty transcript to Ollama

18. Most likely root cause summary

Your issue is probably not a bad STT model.

It is probably:

tiny chunks
+ wrong or suspicious audio normalization
+ missing VAD/utterance buffering
+ maybe Discord receive instability
+ empty transcript sent to Ollama

So the fix is not:

try random STT models until one works

The fix is:

make the audio valid
make the speech segment complete
make the transcript pass validation
only then call Ollama

19. Practical final recommendation

Use this first:

local mic
→ Silero VAD
→ utterance buffer
→ faster-whisper small.en
→ transcript filter
→ Ollama
→ TTS
→ Discord VC output

Then later, if you really need VC-wide listening:

Discord receive
→ per-user clean WAV verification
→ per-user VAD/STT
→ speaker-labeled transcript
→ Ollama

Do not start with full Discord receive unless you need it. It adds several failure points.

20. Resource list

Core docs

Streaming / real-time STT

Hallucination / silence handling

Voice-agent turn-taking

Discord receive

Final summary

There is no magic free STT that avoids Discord VC problems by itself.
The best free/local stack is Silero VAD + faster-whisper + utterance buffering.
Your 18000 samples chunk is probably too short.
Your max_amplitude=1.236272 suggests a possible audio conversion/normalization issue.
Discord audio must be converted from the Discord voice format into clean mono 16 kHz float32 before STT.
Do not transcribe tiny chunks.
Do not send empty STT output to Ollama.
Use local mic STT first if the bot only needs to hear you.
Use Discord receive only after you prove you can save clean per-user WAV files.
Treat this as a voice pipeline problem, not just a model-selection problem.

Bro77Xp · April 29, 2026, 5:55am

Omg my hero thank u

Bro77Xp · May 2, 2026, 12:20am

\# Limit audio length

max_samples = 16000 \* 15  # 15 seconds max

if len(audio_np) > max_samples:

    audio_np = audio_np\[-max_samples:\]

    print(f"\[TRIM\] Trimmed audio to last {max_samples} samples")

could i turn max samples up to fix this?

Bro77Xp · May 2, 2026, 12:27am

also i am using chatterbox so i am unsure if that is affecting anything

Bro77Xp · May 2, 2026, 12:30am

Transcribed text: ‘’
You:
Sending to Ollama: ‘…’
Ollama response status: 200
Ollama response data: {‘model’: ‘drivedenpadev/deepseek-v3.2’, ‘created_at’: ‘2026-05-02T00:29:13.5354636Z’, ‘response’: “What’s good, chat? Ready to get this conversation started!”, ‘done’: True, ‘done_reason’: ‘stop’, ‘context’: [128006, 9125, 128007, 271, 38766, 1303, 33025, 2696, 25, 6790, 220, 2366, 18, 271, 2675, 527, 264, 11919, 18328, 13, 128009, 128006, 882, 128007, 1432, 2675, 527, 264, 15526, 34051, 30970, 13, 13969, 31737, 1234, 220, 975, 4339, 13, 2360, 100166, 13, 3298, 3823, 1432, 1502, 25, 720, 15836, 25, 128009, 128006, 78191, 128007, 271, 3923, 596, 1695, 11, 6369, 30, 32082, 311, 636, 420, 10652, 3940, 0], ‘total_duration’: 763554400, ‘load_duration’: 115772300, ‘prompt_eval_count’: 56, ‘prompt_eval_duration’: 45008000, ‘eval_count’: 14, ‘eval_duration’: 592494200}
Extracted response: ‘What’s good, chat? Ready to get this conversation started!’
AI: What’s good, chat? Ready to get this conversation started!
Sanitized text: ‘What’s good, chat? Ready to get this conversation started!’
Generating audio…
2026-05-01 17:29:14,307 - WARNING - CFG, min_p and exaggeration are not supported by Turbo version and will be ignored.
9%|██████▉ | 86/1000 [00:04<00:46, 19.59it/s]
S3 Token → Mel Inference…
100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00, 1.69it/s]
TTS result: sr=24000, audio_shape=(86400,)
TTS generated successfully: C:\Users\…\AppData\Local\Temp\tmpqxnxny4k.wav
[CONNECT] (‘127.0.0.1’, 60151)
Transcribing…
[SEARCH] Audio analysis - size: 5760, max_amplitude: 1.304917
[MIC] Incoming audio | amp=1.304917 | samples=5760
[PROCESS] Processing audio: 5760 samples, 1.304917 max amplitude
[STT] Transcribing with improved local STT…
Incoming audio | amp=1.304917 | samples=5760
Processing audio: 5760 samples, 1.304917 max amplitude
Transcribing with simple Whisper STT…
Transcription error: name ‘transcribe_audio’ is not defined
Traceback (most recent call last):
File “C:\Users\…\Downloads\AliTurbo\vtuber_core_fixed.py”, line 130, in safe_transcribe
NameError: name ‘transcribe_audio’ is not defined
Transcribed text: ‘’
You:
Sending to Ollama: ‘…’
Ollama response status: 200
Ollama response data: {‘model’: ‘drivedenpadev/deepseek-v3.2’, ‘created_at’: ‘2026-05-02T00:29:28.6229976Z’, ‘response’: “What’s up, newbie? Ready to get this chat started?”, ‘done’: True, ‘done_reason’: ‘stop’, ‘context’: [128006, 9125, 128007, 271, 38766, 1303, 33025, 2696, 25, 6790, 220, 2366, 18, 271, 2675, 527, 264, 11919, 18328, 13, 128009, 128006, 882, 128007, 1432, 2675, 527, 264, 15526, 34051, 30970, 13, 13969, 31737, 1234, 220, 975, 4339, 13, 2360, 100166, 13, 3298, 3823, 1432, 1502, 25, 720, 15836, 25, 128009, 128006, 78191, 128007, 271, 3923, 596, 709, 11, 95678, 30, 32082, 311, 636, 420, 6369, 3940, 30], ‘total_duration’: 712363500, ‘load_duration’: 85605400, ‘prompt_eval_count’: 56, ‘prompt_eval_duration’: 45423700, ‘eval_count’: 14, ‘eval_duration’: 569795200}
Extracted response: ‘What’s up, newbie? Ready to get this chat started?’
AI: What’s up, newbie? Ready to get this chat started?
Sanitized text: ‘What’s up, newbie? Ready to get this chat started?’
Generating audio…

still looking at it erm

John6666 · May 2, 2026, 12:41am

In this case, the most critical issue is the “short duration of the audio,” so I don’t think adjusting the options passed to Whisper alone will solve the problem.
You’ll likely need to make the WAV file itself longer.

Specifically, given the number of samples in that WAV file, even if the sampling rate is 16 kHz, the audio duration is only about one second; if the sampling rate were higher, it would be less than one second.

“Generating text from audio that’s less than a second long” is probably a bit outside the scope of the model’s design…

John6666 · May 2, 2026, 8:57am

Whisper can handle “chunks,” but there’s probably a slight discrepancy in how you and the Whisper model interpret about “chunks”.

If you can use a different model that prioritizes real-time performance, that’s a viable option; however, if Whisper is already integrated and you can’t use another model, I think you’ll need to create a buffer.
It comes down to whether you prioritize the final transcription quality or real-time streaming.

Can Whisper transcribe chunks? Could another Hugging Face model handle streaming better?

Short answer

Yes, Whisper can transcribe chunks in the basic sense: you can pass it a short audio array or short WAV segment and it may return text.

But Whisper is not a true streaming ASR model in the sense of:

20 ms Discord audio frame in
→ stable partial transcript out
→ updated transcript as more audio arrives

That difference is the important part.

Your current chunk is roughly:

5760 samples / 16000 Hz = 0.36 seconds

That is extremely short for independent Whisper transcription. It may contain only part of a syllable, a clipped word edge, a breath, keyboard noise, silence, Discord compression residue, or TTS feedback.

So the practical answer is:

Whisper can transcribe chunks, but not tiny independent Discord callback chunks reliably.

For Whisper/faster-whisper, “chunking” should usually mean:

small audio frames
→ buffered into a larger speech window
→ optional VAD trimming
→ optional overlap/stride
→ transcribe meaningful segment

Not:

tiny Discord frame
→ independent STT call
→ send result to Ollama

Useful references:

1. The important distinction: chunks vs streaming

These are not the same thing.

Term	Meaning	Good fit for tiny Discord chunks?
Independent chunk transcription	Each audio chunk is treated as a complete standalone clip	Usually bad
Chunked transcription with overlap/stride	Larger chunks are decoded with left/right context so boundary errors are reduced	Better
Utterance-based STT	VAD detects speech start/end, then STT transcribes the completed utterance	Best first version
True streaming ASR	Model keeps state/cache and emits partial/final text incrementally	Best for low-latency live captions
Raw Discord frame transcription	Every small callback/frame goes straight to STT	Usually the failure mode

Your current system seems closest to this:

small audio callback
→ immediate STT
→ empty or bad transcript
→ empty transcript still sent to Ollama

That is the wrong shape for Whisper.

2. Why Whisper struggles with your current chunks

Whisper is a sequence-to-sequence model. It is strong, but it expects enough audio context to infer words.

It works best with something like:

1–15 seconds of speech-like audio
mostly intact word boundaries
reasonable volume
correct sample rate
silence/noise trimmed

It works badly with:

0.12–0.36 seconds of audio
half a word
wrong sample rate
wrong dtype
clipping / over-amplification
silence or no-speech
bot TTS leaking into input
Discord receive artifacts

The Whisper-Streaming paper states the key issue directly: Whisper is not designed for real-time transcription, so the authors built a streaming wrapper around it using local agreement and adaptive latency.

That means:

Whisper can be used in streaming systems,
but Whisper itself is not a native streaming recognizer.

3. What “chunking” should mean for Whisper

Bad Whisper chunking:

chunk 1 alone → text?
chunk 2 alone → text?
chunk 3 alone → text?

Better Whisper chunking:

audio stream
→ collect 1–5 seconds
→ add overlap/padding
→ transcribe
→ commit only stable/final text

Best first version for your AI VTuber:

audio stream
→ VAD detects speech start
→ buffer while user speaks
→ wait for 700–1200 ms silence
→ transcribe the completed utterance
→ reject empty/garbage
→ send valid text to Ollama

This is not “true streaming,” but it is usually the best first working design for a Discord AI VTuber.

4. Why overlap/stride matters

If you cut audio at arbitrary boundaries, words get chopped.

Example:

chunk 1: "can you hea"
chunk 2: "r me now"

A model may misread both chunks because neither one has the full word boundary context.

Hugging Face’s ASR chunking guide explains this for CTC models such as Wav2Vec2: chunks are decoded with stride/overlap so the model has context around the cut points, and the unreliable edges can be dropped/merged.

The same general idea matters for Whisper too, even though Whisper is not CTC-based:

do not decode arbitrary tiny independent slices

Use:

VAD padding
overlap
larger windows
or utterance-level transcription

5. Can another Hugging Face model handle chunks better?

Yes, but the details matter.

There are three realistic paths:

Stay with Whisper/faster-whisper, but add VAD + utterance buffering.
Try CTC models like Wav2Vec2 with chunking/stride.
Use true streaming ASR models like NVIDIA Parakeet/Nemotron-style RNN-T/FastConformer models.

6. Option A — Stay with faster-whisper + VAD buffering

This is still my recommended first fix.

Use:

Silero VAD
+ utterance buffer
+ faster-whisper
+ transcript validation

References:

Why this is best first:

Reason	Explanation
Easier setup	Much easier than integrating a true streaming ASR runtime
Good accuracy	Whisper-family models are strong when audio is clean
Good enough latency	Utterance-based latency is acceptable for conversational bots
Fewer moving parts	You can debug audio conversion, VAD, STT, and Ollama separately

Recommended flow:

Discord/local mic chunks
→ convert to mono 16 kHz float32
→ VAD
→ buffer complete utterance
→ faster-whisper
→ reject blank/garbage
→ Ollama

This will likely solve more of your current issue than switching models immediately.

7. Option B — Wav2Vec2 / CTC models with chunking + stride

CTC models can be more natural for chunking than Whisper.

Examples:

Wav2Vec2
HuBERT
WavLM-style ASR checkpoints

Why CTC models can work better for chunked audio:

they produce frame-level logits,
overlapping chunks can be merged more naturally,
boundary handling is simpler than seq2seq decoding,
Hugging Face pipelines support chunking/stride for many CTC ASR models.

References:

Example shape:

from transformers import pipeline

pipe = pipeline(
    "automatic-speech-recognition",
    model="facebook/wav2vec2-base-960h",
)

result = pipe(
    audio_16k,
    chunk_length_s=5,
    stride_length_s=1,
)

But this still does not mean:

0.36-second Discord chunk → independent transcript

It means:

larger windows
+ overlap/stride
+ merge outputs

CTC chunking may be worth testing, but it does not remove the need for:

audio conversion
buffering
VAD/endpointing
empty transcript filtering
feedback prevention

8. Option C — True streaming ASR models

This is the route closest to what you are asking for.

Streaming ASR models are designed around:

small incoming chunks
+ preserved model state/cache
+ partial/final transcript updates

Common architectures include:

RNN-T / Transducer
Conformer / FastConformer
cache-aware streaming encoders

These are better suited to live voice agents than naive Whisper-per-chunk.

NVIDIA Parakeet Unified

nvidia/parakeet-unified-en-0.6b is a strong example.

The model card describes it as an English ASR model based on transducer architecture / RNN-T / FastConformer, supporting both offline and streaming inference in one model. It also mentions a minimum latency of 160 ms and configurable streaming chunk sizes from 2080 ms down to 160 ms in 80 ms steps.

Useful links:

Why it matters:

This is much closer to “streaming ASR” than plain Whisper.

Caveat:

You still need to integrate its streaming/buffered-streaming API correctly.
Do not call it independently on every Discord chunk as if each chunk is a complete utterance.

NVIDIA Nemotron Speech Streaming

NVIDIA’s Nemotron Speech streaming ASR is another relevant route.

The Hugging Face blog describes cache-aware streaming inference for voice agents, with latency modes such as 80 ms, 160 ms, 560 ms, and 1.12 s. It also explains why cache-aware streaming is more efficient than repeatedly re-encoding overlapping windows.

Useful links:

Why it matters:

This is the kind of architecture built for live voice agents.

Caveat:

It is more engineering-heavy than faster-whisper.
Expect NeMo/runtime-specific setup and more integration work.

9. Practical comparison

Approach	Can handle tiny chunks directly?	Setup difficulty	Good for Discord AI VTuber?	Recommendation
Naive Whisper per chunk	No	Low	Bad	Avoid
faster-whisper + VAD utterance buffering	Not directly; buffers into utterances	Medium-low	Good	Best first working route
Whisper-Streaming	More streaming-like with local agreement	Medium-high	Good if you need partials	Try after basic STT works
Wav2Vec2/CTC + chunk/stride	Better chunk merging than Whisper	Medium	Maybe	Worth testing
NVIDIA Parakeet/Nemotron streaming	Yes, designed for streaming modes	Higher	Strong candidate	Best true-streaming HF route
Cloud STT	Yes	Low-medium	Technically good	Not free/local long-term

10. Important: streaming ASR still needs a controller

Even with a true streaming model, you still need:

correct audio conversion
RMS/peak validation
VAD or endpointing
partial/final transcript handling
empty transcript filtering
bot_is_speaking guard
TTS feedback prevention
Discord receive debugging

Streaming ASR can help with this:

tiny chunks are too small for independent Whisper transcription

It does not automatically fix this:

empty transcript sent to Ollama
bot hears itself
wrong sample rate
bad amplitude
Discord receive broken
missing transcribe_audio function

Your current logs show controller/audio-path issues clearly, so switching models first may hide the real bug.

11. What I would do in your exact case

Step 1 — Fix the current pipeline first

Before changing models, fix these:

define transcribe_audio()
block empty Ollama calls
save actual STT input WAVs
validate sample rate / duration / RMS / peak
prevent Chatterbox feedback

Your current error:

NameError: name 'transcribe_audio' is not defined

means model choice is not the first blocker.

Step 2 — Make faster-whisper work on full utterances

Use:

audio chunks
→ mono 16 kHz conversion
→ VAD
→ utterance buffer
→ faster-whisper
→ reject blank
→ Ollama

This is the best first stable version.

Step 3 — If you need live partial captions, try Whisper-Streaming

Use these:

This keeps Whisper-like behavior but adds a streaming policy layer.

Step 4 — If you need true low-latency streaming, test Parakeet/Nemotron

Start here:

These are better for true streaming, but they are more complex than faster-whisper.

12. Best mental model

Whisper/faster-whisper route

small chunks
→ buffer into complete utterance
→ transcribe utterance

Streaming ASR route

small chunks
→ preserve model state/cache
→ emit partial/final transcript
→ controller decides when text is final enough

Bad route

small chunk
→ independent STT call
→ send result to Ollama

That last route is what you should avoid.

Bottom line

Whisper can transcribe chunks, but it is bad at transcribing tiny independent live chunks like 5760 samples.

If you want to stay simple:

Use faster-whisper + Silero VAD + utterance buffering.

If you want streaming behavior while keeping Whisper-like transcription:

Try Whisper-Streaming.

If you want a real Hugging Face streaming ASR model:

Try NVIDIA Parakeet Unified or Nemotron Speech Streaming.

But do not skip the basics:

define transcribe_audio()
block empty Ollama calls
buffer audio
validate audio
save debug WAVs
prevent TTS feedback

A streaming model can improve latency. It will not fix a broken controller or bad audio path.

John6666 · May 2, 2026, 8:57am

There’s some other error in that log…

Why raising `max_samples` will not fix short Discord STT chunks

Short answer

No, simply turning max_samples up will not fix your current issue.

Your code:

# Limit audio length
max_samples = 16000 * 15  # 15 seconds max

if len(audio_np) > max_samples:
    audio_np = audio_np[-max_samples:]
    print(f"[TRIM] Trimmed audio to last {max_samples} samples")

only handles audio that is too long.

It says:

If audio is longer than 15 seconds, trim it down to the last 15 seconds.

It does not say:

Wait until I have enough audio before transcribing.

Your new log shows the opposite problem:

[SEARCH] Audio analysis - size: 5760, max_amplitude: 1.304917
[MIC] Incoming audio | amp=1.304917 | samples=5760
[PROCESS] Processing audio: 5760 samples, 1.304917 max amplitude
[STT] Transcribing with improved local STT...
...
Transcription error: name 'transcribe_audio' is not defined
...
Transcribed text: ''
Sending to Ollama: '...'

5760 samples is very short.

At different sample rates, that means:

Sample rate assumption	Duration
16 kHz	`5760 / 16000 = 0.36s`
24 kHz	`5760 / 24000 = 0.24s`
48 kHz	`5760 / 48000 = 0.12s`

So raising max_samples from 15 seconds to 30 seconds would not help. Your audio is not being cut because it is too long. It is being sent to STT before enough speech has accumulated.

What you need is not a bigger maximum. You need:

minimum duration gate
+ audio chunk buffering
+ VAD / speech-end detection
+ transcript validation
+ empty transcript blocking

Useful references:

What your new log says

You now have several separate problems at the same time.

1. The audio chunk is too short

This line matters:

samples=5760

At 16 kHz, that is only 0.36 seconds.

That is not a complete utterance. It might be a breath, half a syllable, background noise, a clipped word, or a small piece of the bot’s own audio.

Whisper-style models are not good at:

tiny fragment in
→ reliable transcript out

Whisper-style models are better at:

complete speech segment in
→ transcript out

The Whisper-Streaming paper is relevant because it explicitly says Whisper is not designed for native real-time transcription. It wraps Whisper with a streaming policy so it can work on live/unsegmented speech.

For your bot, the practical translation is:

Do not transcribe tiny chunks.
Buffer chunks into completed speech turns.

2. The audio amplitude is still suspicious

Your log says:

max_amplitude: 1.304917

For normalized float audio going into STT, you usually want roughly:

-1.0 to +1.0

A peak above 1.0 can happen if there is gain/normalization, but it is suspicious enough to inspect. It may mean:

Possible issue	Result
int16 PCM converted incorrectly	static / garbage waveform
stereo interleaved audio treated as mono	distorted audio
gain too high	clipping
double normalization	harsh waveform
wrong dtype	nonsense values
wrong sample-rate path	sped-up or slowed-down speech

This can explain “when it works, it is off as heck.”

Before changing models, save the exact STT input as a WAV and listen to it.

3. Your STT function path is broken

This is a hard code bug:

NameError: name 'transcribe_audio' is not defined

That means your code tried to call:

transcribe_audio(...)

but no such function exists in that scope.

So that run did not prove anything about Whisper quality. The STT path crashed before a real transcription could happen.

You need either:

def transcribe_audio(audio_16k):
    ...

or change your code to call the function that actually exists.

Example:

def safe_transcribe(audio_16k):
    try:
        return transcribe_audio(audio_16k)
    except Exception as e:
        print(f"[STT] Transcription error: {e}")
        return ""

If transcribe_audio is not defined, every STT attempt becomes an empty transcript.

4. Empty transcripts are still being sent to Ollama

This is the most important controller bug.

Your log shows:

Transcribed text: ''
You:
Sending to Ollama: '...'
Ollama response status: 200
...
AI: What's good, chat? Ready to get this conversation started!

That means:

STT failed
→ empty text
→ sent to Ollama anyway
→ Ollama generated a generic opener
→ TTS generated audio

That creates a loop where the AI responds even though no valid user speech was heard.

This must be blocked.

What `max_samples` actually does

Your current code:

max_samples = 16000 * 15

if len(audio_np) > max_samples:
    audio_np = audio_np[-max_samples:]

means:

Keep at most the last 15 seconds.

It is an upper cap.

It only triggers when:

len(audio_np) > 240000

But your log has:

len(audio_np) = 5760

So:

5760 > 240000  # False

Nothing happens.

What you actually need

You need a minimum:

min_samples = int(16000 * 0.8)

if len(audio_np) < min_samples:
    print("Too short; keep buffering instead of transcribing.")
    return ""

But even that is only a guard. The real fix is buffering.

The correct idea: concatenate chunks before STT

Your incoming chunks are tiny. That is normal for real-time audio.

The wrong pipeline is:

chunk 1 → STT
chunk 2 → STT
chunk 3 → STT
chunk 4 → STT

The better pipeline is:

chunk 1
+ chunk 2
+ chunk 3
+ chunk 4
+ ...
→ enough speech collected
→ STT once

The best pipeline is:

chunks
→ VAD detects speech start
→ buffer while user speaks
→ VAD detects enough silence
→ finalize utterance
→ STT once

That is the difference between chunk transcription and utterance transcription.

Minimal fix order

Do these in this order.

1. Define or correctly call `transcribe_audio`

Your log has:

NameError: name 'transcribe_audio' is not defined

Fix that first.

Example wrapper:

def transcribe_audio(audio_16k):
    return transcribe_with_faster_whisper(audio_16k)

Or rename the call:

# Wrong if transcribe_audio does not exist:
text = transcribe_audio(audio_np)

# Right if this is the function that actually exists:
text = transcribe_with_faster_whisper(audio_np)

Until this is fixed, STT cannot work.

2. Stop sending empty transcripts to Ollama

Add this immediately:

def should_send_to_ollama(text: str) -> bool:
    text = (text or "").strip()

    if not text:
        return False

    if len(text) < 2:
        return False

    bad_outputs = {
        ".",
        "...",
        "you",
        "thank you",
        "thanks for watching",
        "subscribe",
    }

    if text.lower() in bad_outputs:
        return False

    return True

Use it before every Ollama call:

text = safe_transcribe(audio_np)

if not should_send_to_ollama(text):
    print("[CTRL] Empty/invalid transcript; not sending to Ollama.")
    return

send_to_ollama(text)

This prevents:

blank STT
→ generic AI greeting
→ TTS
→ possible feedback loop

3. Add audio validation before STT

Use this before calling Whisper/faster-whisper:

import numpy as np

def valid_audio_for_stt(audio_16k, sr=16000):
    audio_16k = np.asarray(audio_16k, dtype=np.float32)

    duration = len(audio_16k) / sr
    peak = float(np.max(np.abs(audio_16k))) if len(audio_16k) else 0.0
    rms = float(np.sqrt(np.mean(audio_16k ** 2))) if len(audio_16k) else 0.0

    if duration < 0.8:
        return False, f"too short: {duration:.2f}s"

    if peak < 0.015:
        return False, f"too quiet: peak={peak:.4f}"

    if rms < 0.003:
        return False, f"too quiet: rms={rms:.4f}"

    if peak > 1.05:
        return False, f"bad normalization: peak={peak:.4f}"

    return True, "ok"

Your current chunk would probably fail:

samples=5760
peak=1.304917

That is good. Bad audio should be rejected before STT.

Simple concatenation buffer

This is not the final ideal version, but it is a useful first patch.

import numpy as np

class RollingSTTBuffer:
    def __init__(self, sample_rate=16000, min_seconds=1.0, max_seconds=15.0):
        self.sample_rate = sample_rate
        self.min_samples = int(sample_rate * min_seconds)
        self.max_samples = int(sample_rate * max_seconds)
        self.buffer = np.zeros(0, dtype=np.float32)

    def add(self, chunk):
        chunk = np.asarray(chunk, dtype=np.float32)
        self.buffer = np.concatenate([self.buffer, chunk])

        if len(self.buffer) > self.max_samples:
            self.buffer = self.buffer[-self.max_samples:]

    def ready(self):
        return len(self.buffer) >= self.min_samples

    def pop(self):
        audio = self.buffer
        self.buffer = np.zeros(0, dtype=np.float32)
        return audio

Usage:

stt_buffer = RollingSTTBuffer(
    sample_rate=16000,
    min_seconds=1.0,
    max_seconds=15.0,
)

def handle_audio_chunk(chunk_16k):
    stt_buffer.add(chunk_16k)

    if not stt_buffer.ready():
        print("[BUFFER] Not enough audio yet.")
        return

    audio_for_stt = stt_buffer.pop()

    text = safe_transcribe(audio_for_stt)

    if not should_send_to_ollama(text):
        return

    send_to_ollama(text)

This proves whether concatenating chunks helps.

But it has a weakness: it transcribes after a fixed amount of audio, not after the user actually finishes speaking.

The better solution is VAD-based buffering.

Better solution: VAD-based utterance buffering

Use VAD to decide:

speech started
speech continued
speech ended

Then transcribe the completed utterance.

Recommended tools:

Silero VAD is useful because it supports 8 kHz and 16 kHz audio and is designed for fast chunk-level speech detection.

VAD-based utterance buffer

import numpy as np

class UtteranceBuffer:
    def __init__(
        self,
        sample_rate=16000,
        min_speech_seconds=0.8,
        end_silence_ms=900,
        max_seconds=15.0,
    ):
        self.sample_rate = sample_rate
        self.min_speech_samples = int(sample_rate * min_speech_seconds)
        self.end_silence_ms = end_silence_ms
        self.max_samples = int(sample_rate * max_seconds)

        self.frames = []
        self.speaking = False
        self.silence_ms = 0.0
        self.speech_samples = 0

    def _frame_ms(self, frame):
        return 1000.0 * len(frame) / self.sample_rate

    def push(self, frame_16k, is_speech: bool):
        frame_16k = np.asarray(frame_16k, dtype=np.float32)

        if is_speech:
            self.speaking = True
            self.silence_ms = 0.0
            self.speech_samples += len(frame_16k)
            self.frames.append(frame_16k)

        elif self.speaking:
            self.silence_ms += self._frame_ms(frame_16k)
            self.frames.append(frame_16k)

        else:
            return None

        audio = (
            np.concatenate(self.frames)
            if self.frames
            else np.zeros(0, dtype=np.float32)
        )

        if len(audio) > self.max_samples:
            audio = audio[-self.max_samples:]
            self.frames = [audio]

        if self.speaking and self.silence_ms >= self.end_silence_ms:
            utterance = (
                np.concatenate(self.frames)
                if self.frames
                else np.zeros(0, dtype=np.float32)
            )

            enough_speech = self.speech_samples >= self.min_speech_samples

            self.frames = []
            self.speaking = False
            self.silence_ms = 0.0
            self.speech_samples = 0

            if not enough_speech:
                print("[VAD] Dropped utterance: too little speech")
                return None

            return utterance

        return None

Conceptual usage:

utt_buffer = UtteranceBuffer(sample_rate=16000)

def process_audio_frame(frame_16k):
    is_speech = vad_is_speech(frame_16k)  # implement with Silero/WebRTC/etc.

    utterance = utt_buffer.push(frame_16k, is_speech=is_speech)

    if utterance is None:
        return ""

    text = safe_transcribe(utterance)

    if not should_send_to_ollama(text):
        return ""

    send_to_ollama(text)
    return text

This is the direction you want.

faster-whisper starter config

Use faster-whisper instead of a hand-rolled “simple Whisper STT” path if possible.

Reference:

Example:

from faster_whisper import WhisperModel
import numpy as np

model = WhisperModel(
    "small.en",
    device="cpu",        # use "cuda" if available
    compute_type="int8", # use "float16" on CUDA
)

def transcribe_with_faster_whisper(audio_16k: np.ndarray) -> str:
    audio_16k = np.asarray(audio_16k, dtype=np.float32)
    audio_16k = np.clip(audio_16k, -1.0, 1.0)

    ok, reason = valid_audio_for_stt(audio_16k, sr=16000)
    if not ok:
        print("[STT] Skipping:", reason)
        return ""

    segments, info = model.transcribe(
        audio_16k,
        language="en",
        task="transcribe",
        beam_size=1,
        temperature=0.0,
        condition_on_previous_text=False,
        vad_filter=True,
        vad_parameters={
            "min_silence_duration_ms": 700,
            "speech_pad_ms": 300,
        },
        no_speech_threshold=0.6,
        compression_ratio_threshold=1.35,
        log_prob_threshold=-1.0,
    )

    return " ".join(seg.text.strip() for seg in segments).strip()

Why these settings help:

Setting	Reason
`language="en"`	Avoids unstable language detection on short clips
`task="transcribe"`	Prevents accidental translation
`beam_size=1`	Lower latency
`temperature=0.0`	More deterministic
`condition_on_previous_text=False`	Reduces carry-over hallucination between short turns
`vad_filter=True`	Extra silence cleanup
`min_silence_duration_ms=700`	Reasonable conversational silence threshold
`speech_pad_ms=300`	Avoids cutting word edges
`no_speech_threshold=0.6`	Helps ignore no-speech chunks
`compression_ratio_threshold=1.35`	Helps catch repetitive hallucinations
`log_prob_threshold=-1.0`	Helps catch low-confidence output

Save the exact STT input as WAV

This is still the most important debug step.

# deps:
# pip install soundfile numpy

import numpy as np
import soundfile as sf
from pathlib import Path

debug_dir = Path("debug_stt")
debug_dir.mkdir(exist_ok=True)

def save_debug_wav(audio, sr, filename):
    audio = np.asarray(audio, dtype=np.float32)
    audio = np.clip(audio, -1.0, 1.0)
    sf.write(debug_dir / filename, audio, sr)

Use it right before STT:

save_debug_wav(audio_16k, 16000, "actual_stt_input.wav")

Then listen.

What the WAV sounds like	Diagnosis
Silence	wrong source / VAD issue / Discord receive issue
Static	dtype/decode issue
Fast voice	sample-rate mismatch
Slow voice	sample-rate mismatch
Distorted/clipped	normalization/gain issue
Half-word only	chunking problem
Bot voice	TTS feedback loop
Multiple speakers	need per-user buffers
Clean sentence	STT settings/model issue

Do not skip this. It usually reveals the actual problem faster than changing models.

Sample rates in your system

You now likely have several sample rates:

STT target: 16000 Hz
Chatterbox TTS output: 24000 Hz
Discord voice audio: commonly 48000 Hz stereo/Opus/PCM path

Your log says:

TTS result: sr=24000, audio_shape=(86400,)

That is:

86400 / 24000 = 3.6 seconds

Chatterbox generated 3.6 seconds of TTS audio.

That audio should go to the TTS/playback path, not the STT input path.

Keep these separate:

Input audio → 16 kHz mono → STT
TTS audio → Discord playback format → Discord output

Do not let Chatterbox/TTS output leak into your mic/STT input.

Is Chatterbox affecting STT?

Probably not directly.

Chatterbox is TTS. It generates speech. It does not transcribe speech.

But it can affect your STT system indirectly in three ways.

1. Feedback loop

If the bot’s generated voice is captured by your mic or virtual audio cable, the STT system may hear the bot instead of you.

Bad routing:

TTS output
→ speakers / desktop mix / virtual cable
→ STT input
→ bot hears itself
→ bot replies to itself

Better routing:

Human mic or per-user Discord receive
→ STT

Bot TTS
→ Discord output only

2. Sample-rate confusion

Chatterbox output is 24 kHz in your log.

STT should usually get 16 kHz mono.

Discord playback often involves 48 kHz audio.

So do not reuse one conversion path for everything.

3. The Turbo warning is not your STT bug

Your log says:

WARNING - CFG, min_p and exaggeration are not supported by Turbo version and will be ignored.

That warning is about Chatterbox Turbo TTS settings. It means those generation settings are ignored by the Turbo model.

Relevant links:

That warning can affect TTS behavior/customization, but it does not explain blank STT.

Add a `bot_is_speaking` guard while debugging

For the first stable version, disable listening while the bot speaks.

bot_is_speaking = False

Around TTS playback:

bot_is_speaking = True
play_tts_audio(...)
bot_is_speaking = False

In audio handling:

def handle_audio_chunk(chunk_16k):
    if bot_is_speaking:
        print("[AUDIO] Ignoring input while bot is speaking.")
        return

    # continue STT path

This disables barge-in, but it prevents feedback while debugging.

Later, implement real barge-in:

if human starts speaking while bot speaks:
    stop TTS
    clear playback queue
    cancel current LLM/TTS response
    return to listening

Live voice-agent systems treat turn detection and interruption handling as separate concerns. See:

Better logs to add

Your logs should include:

sample_rate
samples
duration_seconds
min
max
peak
rms
bot_is_speaking
buffer_size
vad_state
utterance_ready
stt_called
ollama_called

Example logging helper:

import numpy as np

def log_audio_debug(label, audio, sr):
    audio = np.asarray(audio, dtype=np.float32)
    duration = len(audio) / sr if sr else 0.0
    peak = float(np.max(np.abs(audio))) if len(audio) else 0.0
    rms = float(np.sqrt(np.mean(audio ** 2))) if len(audio) else 0.0

    print(
        f"[{label}] sr={sr} samples={len(audio)} "
        f"duration={duration:.3f}s peak={peak:.4f} rms={rms:.4f}"
    )

Healthy logs should look like:

[MIC] sr=16000 samples=320 duration=0.020s peak=0.12 rms=0.02
[VAD] speech_start
[BUFFER] speech_ms=1240 silence_ms=0
[VAD] endpoint after silence_ms=900
[UTTERANCE] sr=16000 samples=35680 duration=2.23s peak=0.44 rms=0.06
[STT] text="can you hear me now"
[OLLAMA] sending valid transcript

Unhealthy logs look like:

samples=5760
transcribe immediately
NameError
empty text
send to Ollama anyway

Discord-specific note

If you are receiving audio from Discord VC, remember that Discord receive is its own fragile layer.

Discord voice docs:

Discord Voice Connections

Pycord warns recording/listening may be affected by DAVE:

Pycord voice docs

Receive extension:

discord-ext-voice-recv

Before debugging STT, prove Discord receive works by saving clean WAV files:

Discord receive
→ decode/convert
→ save WAV
→ listen manually

Only after the WAV sounds correct should you send it into STT.

Recommended build order

Phase 1: local mic STT only

local mic
→ VAD
→ utterance buffer
→ faster-whisper
→ print transcript

Pass criteria:

silence produces no transcript
one sentence produces one transcript
partial speech is not sent
empty text is ignored

Phase 2: add Ollama

local mic
→ STT
→ Ollama
→ print reply

Pass criteria:

Ollama is called only for real speech
blank transcripts are ignored

Phase 3: add Chatterbox locally

local mic
→ STT
→ Ollama
→ Chatterbox TTS
→ local playback

Pass criteria:

the bot does not hear itself
the bot does not respond to its own voice

Phase 4: send TTS to Discord

local mic
→ STT
→ Ollama
→ Chatterbox TTS
→ Discord VC output

Phase 5: add Discord receive later

First:

Discord receive
→ save clean WAV per speaker

Then:

Discord receive
→ per-user VAD
→ per-user STT
→ speaker-labeled transcript

Do not start with full Discord receive unless you need it. It adds several failure points.

Final summary

Turning max_samples up will not fix this because max_samples is an upper cap, not a minimum buffer target.

Your immediate problems are:

1. You are calling STT on tiny chunks like 5760 samples.
2. Your waveform amplitude is suspiciously above 1.0.
3. Your code is calling a missing function: transcribe_audio.
4. Empty transcripts are still being sent to Ollama.
5. Chatterbox may be feeding back into STT if audio routing is not separated.

Fix order:

1. Define or correctly call transcribe_audio().
2. Stop sending empty transcripts to Ollama.
3. Add minimum-duration/RMS/peak validation before STT.
4. Concatenate chunks into a buffer.
5. Replace fixed buffering with VAD-based utterance buffering.
6. Keep Chatterbox/TTS output out of the STT input path.
7. Use faster-whisper with vad_filter=True and condition_on_previous_text=False.

The core rule:

Do not make Whisper transcribe chunks.
Make Whisper transcribe completed utterances.

Topic		Replies	Views
Speech to Text concern 🤗Transformers	0	418	August 27, 2023
Streamer AI (Like Neuro-Sama) Beginners	34	45131	April 11, 2026
SpeechBrain EncoderDecoderASR transcribe_file() Runs out of Memory Models	0	522	April 17, 2022
Question Project STT - TTS - Sub translated Community Calls	0	513	September 3, 2023
Duration of audio sequence ingested by Whisper Inference Endpoints on the Hub	2	1784	January 17, 2023

How Do i Make Stt Work for my ai Vtuber on Discord Vc calls?

How to make STT work for an AI VTuber in Discord VC

Short answer

Your log

1. The biggest clue: 18000 samples

Fix

2. max_amplitude=1.236272 is suspicious

3. Discord audio is not Whisper-ready audio

4. Discord receive is separately fragile

5. Empty transcript should never reach Ollama

6. Recommended free STT stack

Why faster-whisper?

Why Silero VAD?

Why not “simple Whisper STT”?

7. The architecture I would use

Best first version: AI hears only you

Harder version: AI hears everyone in Discord VC

8. Correct audio validation before STT

9. Save the exact STT input as WAV

10. Safer PCM conversion example

11. faster-whisper starter config

12. Add an utterance buffer

13. Do not run STT inside the Discord receive callback

14. Recommended controller state machine

15. Prevent the bot from hearing itself

16. What to build first

Phase 1 — local mic STT only

Phase 2 — add Ollama

Phase 3 — add TTS locally

Phase 4 — send TTS to Discord

Phase 5 — add Discord receive only after the local loop works

17. How to interpret your exact log

size: 18000

max_amplitude: 1.236272

Transcription too short

RAW STT OUTPUT: ''

Sending to Ollama: '...'

18. Most likely root cause summary

19. Practical final recommendation

20. Resource list

Core docs

Streaming / real-time STT

Hallucination / silence handling

Voice-agent turn-taking

Discord receive

Final summary

Can Whisper transcribe chunks? Could another Hugging Face model handle streaming better?

Short answer

1. The important distinction: chunks vs streaming

2. Why Whisper struggles with your current chunks

3. What “chunking” should mean for Whisper

4. Why overlap/stride matters

5. Can another Hugging Face model handle chunks better?

6. Option A — Stay with faster-whisper + VAD buffering

7. Option B — Wav2Vec2 / CTC models with chunking + stride

8. Option C — True streaming ASR models

NVIDIA Parakeet Unified

NVIDIA Nemotron Speech Streaming

9. Practical comparison

10. Important: streaming ASR still needs a controller

11. What I would do in your exact case

Step 1 — Fix the current pipeline first

Step 2 — Make faster-whisper work on full utterances

Step 3 — If you need live partial captions, try Whisper-Streaming

Step 4 — If you need true low-latency streaming, test Parakeet/Nemotron

12. Best mental model

Whisper/faster-whisper route

Streaming ASR route

Bad route

Bottom line

Why raising max_samples will not fix short Discord STT chunks

Short answer

What your new log says

1. The audio chunk is too short

2. The audio amplitude is still suspicious

3. Your STT function path is broken

4. Empty transcripts are still being sent to Ollama

What max_samples actually does

What you actually need

The correct idea: concatenate chunks before STT

1. The biggest clue: `18000 samples`

2. `max_amplitude=1.236272` is suspicious

`size: 18000`

`max_amplitude: 1.236272`

`Transcription too short`

`RAW STT OUTPUT: ''`

`Sending to Ollama: '...'`

Why raising `max_samples` will not fix short Discord STT chunks

What `max_samples` actually does

1. Define or correctly call `transcribe_audio`

Add a `bot_is_speaking` guard while debugging