There’s some other error in that log…
Why raising max_samples will not fix short Discord STT chunks
Short answer
No, simply turning max_samples up will not fix your current issue.
Your code:
# Limit audio length
max_samples = 16000 * 15 # 15 seconds max
if len(audio_np) > max_samples:
audio_np = audio_np[-max_samples:]
print(f"[TRIM] Trimmed audio to last {max_samples} samples")
only handles audio that is too long.
It says:
If audio is longer than 15 seconds, trim it down to the last 15 seconds.
It does not say:
Wait until I have enough audio before transcribing.
Your new log shows the opposite problem:
[SEARCH] Audio analysis - size: 5760, max_amplitude: 1.304917
[MIC] Incoming audio | amp=1.304917 | samples=5760
[PROCESS] Processing audio: 5760 samples, 1.304917 max amplitude
[STT] Transcribing with improved local STT...
...
Transcription error: name 'transcribe_audio' is not defined
...
Transcribed text: ''
Sending to Ollama: '...'
5760 samples is very short.
At different sample rates, that means:
| Sample rate assumption |
Duration |
| 16 kHz |
5760 / 16000 = 0.36s |
| 24 kHz |
5760 / 24000 = 0.24s |
| 48 kHz |
5760 / 48000 = 0.12s |
So raising max_samples from 15 seconds to 30 seconds would not help. Your audio is not being cut because it is too long. It is being sent to STT before enough speech has accumulated.
What you need is not a bigger maximum. You need:
minimum duration gate
+ audio chunk buffering
+ VAD / speech-end detection
+ transcript validation
+ empty transcript blocking
Useful references:
What your new log says
You now have several separate problems at the same time.
1. The audio chunk is too short
This line matters:
samples=5760
At 16 kHz, that is only 0.36 seconds.
That is not a complete utterance. It might be a breath, half a syllable, background noise, a clipped word, or a small piece of the bot’s own audio.
Whisper-style models are not good at:
tiny fragment in
→ reliable transcript out
Whisper-style models are better at:
complete speech segment in
→ transcript out
The Whisper-Streaming paper is relevant because it explicitly says Whisper is not designed for native real-time transcription. It wraps Whisper with a streaming policy so it can work on live/unsegmented speech.
For your bot, the practical translation is:
Do not transcribe tiny chunks.
Buffer chunks into completed speech turns.
2. The audio amplitude is still suspicious
Your log says:
max_amplitude: 1.304917
For normalized float audio going into STT, you usually want roughly:
-1.0 to +1.0
A peak above 1.0 can happen if there is gain/normalization, but it is suspicious enough to inspect. It may mean:
| Possible issue |
Result |
| int16 PCM converted incorrectly |
static / garbage waveform |
| stereo interleaved audio treated as mono |
distorted audio |
| gain too high |
clipping |
| double normalization |
harsh waveform |
| wrong dtype |
nonsense values |
| wrong sample-rate path |
sped-up or slowed-down speech |
This can explain “when it works, it is off as heck.”
Before changing models, save the exact STT input as a WAV and listen to it.
3. Your STT function path is broken
This is a hard code bug:
NameError: name 'transcribe_audio' is not defined
That means your code tried to call:
transcribe_audio(...)
but no such function exists in that scope.
So that run did not prove anything about Whisper quality. The STT path crashed before a real transcription could happen.
You need either:
def transcribe_audio(audio_16k):
...
or change your code to call the function that actually exists.
Example:
def safe_transcribe(audio_16k):
try:
return transcribe_audio(audio_16k)
except Exception as e:
print(f"[STT] Transcription error: {e}")
return ""
If transcribe_audio is not defined, every STT attempt becomes an empty transcript.
4. Empty transcripts are still being sent to Ollama
This is the most important controller bug.
Your log shows:
Transcribed text: ''
You:
Sending to Ollama: '...'
Ollama response status: 200
...
AI: What's good, chat? Ready to get this conversation started!
That means:
STT failed
→ empty text
→ sent to Ollama anyway
→ Ollama generated a generic opener
→ TTS generated audio
That creates a loop where the AI responds even though no valid user speech was heard.
This must be blocked.
What max_samples actually does
Your current code:
max_samples = 16000 * 15
if len(audio_np) > max_samples:
audio_np = audio_np[-max_samples:]
means:
Keep at most the last 15 seconds.
It is an upper cap.
It only triggers when:
len(audio_np) > 240000
But your log has:
len(audio_np) = 5760
So:
5760 > 240000 # False
Nothing happens.
What you actually need
You need a minimum:
min_samples = int(16000 * 0.8)
if len(audio_np) < min_samples:
print("Too short; keep buffering instead of transcribing.")
return ""
But even that is only a guard. The real fix is buffering.
The correct idea: concatenate chunks before STT
Your incoming chunks are tiny. That is normal for real-time audio.
The wrong pipeline is:
chunk 1 → STT
chunk 2 → STT
chunk 3 → STT
chunk 4 → STT
The better pipeline is:
chunk 1
+ chunk 2
+ chunk 3
+ chunk 4
+ ...
→ enough speech collected
→ STT once
The best pipeline is:
chunks
→ VAD detects speech start
→ buffer while user speaks
→ VAD detects enough silence
→ finalize utterance
→ STT once
That is the difference between chunk transcription and utterance transcription.
Minimal fix order
Do these in this order.
1. Define or correctly call transcribe_audio
Your log has:
NameError: name 'transcribe_audio' is not defined
Fix that first.
Example wrapper:
def transcribe_audio(audio_16k):
return transcribe_with_faster_whisper(audio_16k)
Or rename the call:
# Wrong if transcribe_audio does not exist:
text = transcribe_audio(audio_np)
# Right if this is the function that actually exists:
text = transcribe_with_faster_whisper(audio_np)
Until this is fixed, STT cannot work.
2. Stop sending empty transcripts to Ollama
Add this immediately:
def should_send_to_ollama(text: str) -> bool:
text = (text or "").strip()
if not text:
return False
if len(text) < 2:
return False
bad_outputs = {
".",
"...",
"you",
"thank you",
"thanks for watching",
"subscribe",
}
if text.lower() in bad_outputs:
return False
return True
Use it before every Ollama call:
text = safe_transcribe(audio_np)
if not should_send_to_ollama(text):
print("[CTRL] Empty/invalid transcript; not sending to Ollama.")
return
send_to_ollama(text)
This prevents:
blank STT
→ generic AI greeting
→ TTS
→ possible feedback loop
3. Add audio validation before STT
Use this before calling Whisper/faster-whisper:
import numpy as np
def valid_audio_for_stt(audio_16k, sr=16000):
audio_16k = np.asarray(audio_16k, dtype=np.float32)
duration = len(audio_16k) / sr
peak = float(np.max(np.abs(audio_16k))) if len(audio_16k) else 0.0
rms = float(np.sqrt(np.mean(audio_16k ** 2))) if len(audio_16k) else 0.0
if duration < 0.8:
return False, f"too short: {duration:.2f}s"
if peak < 0.015:
return False, f"too quiet: peak={peak:.4f}"
if rms < 0.003:
return False, f"too quiet: rms={rms:.4f}"
if peak > 1.05:
return False, f"bad normalization: peak={peak:.4f}"
return True, "ok"
Your current chunk would probably fail:
samples=5760
peak=1.304917
That is good. Bad audio should be rejected before STT.
Simple concatenation buffer
This is not the final ideal version, but it is a useful first patch.
import numpy as np
class RollingSTTBuffer:
def __init__(self, sample_rate=16000, min_seconds=1.0, max_seconds=15.0):
self.sample_rate = sample_rate
self.min_samples = int(sample_rate * min_seconds)
self.max_samples = int(sample_rate * max_seconds)
self.buffer = np.zeros(0, dtype=np.float32)
def add(self, chunk):
chunk = np.asarray(chunk, dtype=np.float32)
self.buffer = np.concatenate([self.buffer, chunk])
if len(self.buffer) > self.max_samples:
self.buffer = self.buffer[-self.max_samples:]
def ready(self):
return len(self.buffer) >= self.min_samples
def pop(self):
audio = self.buffer
self.buffer = np.zeros(0, dtype=np.float32)
return audio
Usage:
stt_buffer = RollingSTTBuffer(
sample_rate=16000,
min_seconds=1.0,
max_seconds=15.0,
)
def handle_audio_chunk(chunk_16k):
stt_buffer.add(chunk_16k)
if not stt_buffer.ready():
print("[BUFFER] Not enough audio yet.")
return
audio_for_stt = stt_buffer.pop()
text = safe_transcribe(audio_for_stt)
if not should_send_to_ollama(text):
return
send_to_ollama(text)
This proves whether concatenating chunks helps.
But it has a weakness: it transcribes after a fixed amount of audio, not after the user actually finishes speaking.
The better solution is VAD-based buffering.
Better solution: VAD-based utterance buffering
Use VAD to decide:
speech started
speech continued
speech ended
Then transcribe the completed utterance.
Recommended tools:
Silero VAD is useful because it supports 8 kHz and 16 kHz audio and is designed for fast chunk-level speech detection.
VAD-based utterance buffer
import numpy as np
class UtteranceBuffer:
def __init__(
self,
sample_rate=16000,
min_speech_seconds=0.8,
end_silence_ms=900,
max_seconds=15.0,
):
self.sample_rate = sample_rate
self.min_speech_samples = int(sample_rate * min_speech_seconds)
self.end_silence_ms = end_silence_ms
self.max_samples = int(sample_rate * max_seconds)
self.frames = []
self.speaking = False
self.silence_ms = 0.0
self.speech_samples = 0
def _frame_ms(self, frame):
return 1000.0 * len(frame) / self.sample_rate
def push(self, frame_16k, is_speech: bool):
frame_16k = np.asarray(frame_16k, dtype=np.float32)
if is_speech:
self.speaking = True
self.silence_ms = 0.0
self.speech_samples += len(frame_16k)
self.frames.append(frame_16k)
elif self.speaking:
self.silence_ms += self._frame_ms(frame_16k)
self.frames.append(frame_16k)
else:
return None
audio = (
np.concatenate(self.frames)
if self.frames
else np.zeros(0, dtype=np.float32)
)
if len(audio) > self.max_samples:
audio = audio[-self.max_samples:]
self.frames = [audio]
if self.speaking and self.silence_ms >= self.end_silence_ms:
utterance = (
np.concatenate(self.frames)
if self.frames
else np.zeros(0, dtype=np.float32)
)
enough_speech = self.speech_samples >= self.min_speech_samples
self.frames = []
self.speaking = False
self.silence_ms = 0.0
self.speech_samples = 0
if not enough_speech:
print("[VAD] Dropped utterance: too little speech")
return None
return utterance
return None
Conceptual usage:
utt_buffer = UtteranceBuffer(sample_rate=16000)
def process_audio_frame(frame_16k):
is_speech = vad_is_speech(frame_16k) # implement with Silero/WebRTC/etc.
utterance = utt_buffer.push(frame_16k, is_speech=is_speech)
if utterance is None:
return ""
text = safe_transcribe(utterance)
if not should_send_to_ollama(text):
return ""
send_to_ollama(text)
return text
This is the direction you want.
faster-whisper starter config
Use faster-whisper instead of a hand-rolled “simple Whisper STT” path if possible.
Reference:
Example:
from faster_whisper import WhisperModel
import numpy as np
model = WhisperModel(
"small.en",
device="cpu", # use "cuda" if available
compute_type="int8", # use "float16" on CUDA
)
def transcribe_with_faster_whisper(audio_16k: np.ndarray) -> str:
audio_16k = np.asarray(audio_16k, dtype=np.float32)
audio_16k = np.clip(audio_16k, -1.0, 1.0)
ok, reason = valid_audio_for_stt(audio_16k, sr=16000)
if not ok:
print("[STT] Skipping:", reason)
return ""
segments, info = model.transcribe(
audio_16k,
language="en",
task="transcribe",
beam_size=1,
temperature=0.0,
condition_on_previous_text=False,
vad_filter=True,
vad_parameters={
"min_silence_duration_ms": 700,
"speech_pad_ms": 300,
},
no_speech_threshold=0.6,
compression_ratio_threshold=1.35,
log_prob_threshold=-1.0,
)
return " ".join(seg.text.strip() for seg in segments).strip()
Why these settings help:
| Setting |
Reason |
language="en" |
Avoids unstable language detection on short clips |
task="transcribe" |
Prevents accidental translation |
beam_size=1 |
Lower latency |
temperature=0.0 |
More deterministic |
condition_on_previous_text=False |
Reduces carry-over hallucination between short turns |
vad_filter=True |
Extra silence cleanup |
min_silence_duration_ms=700 |
Reasonable conversational silence threshold |
speech_pad_ms=300 |
Avoids cutting word edges |
no_speech_threshold=0.6 |
Helps ignore no-speech chunks |
compression_ratio_threshold=1.35 |
Helps catch repetitive hallucinations |
log_prob_threshold=-1.0 |
Helps catch low-confidence output |
Save the exact STT input as WAV
This is still the most important debug step.
# deps:
# pip install soundfile numpy
import numpy as np
import soundfile as sf
from pathlib import Path
debug_dir = Path("debug_stt")
debug_dir.mkdir(exist_ok=True)
def save_debug_wav(audio, sr, filename):
audio = np.asarray(audio, dtype=np.float32)
audio = np.clip(audio, -1.0, 1.0)
sf.write(debug_dir / filename, audio, sr)
Use it right before STT:
save_debug_wav(audio_16k, 16000, "actual_stt_input.wav")
Then listen.
| What the WAV sounds like |
Diagnosis |
| Silence |
wrong source / VAD issue / Discord receive issue |
| Static |
dtype/decode issue |
| Fast voice |
sample-rate mismatch |
| Slow voice |
sample-rate mismatch |
| Distorted/clipped |
normalization/gain issue |
| Half-word only |
chunking problem |
| Bot voice |
TTS feedback loop |
| Multiple speakers |
need per-user buffers |
| Clean sentence |
STT settings/model issue |
Do not skip this. It usually reveals the actual problem faster than changing models.
Sample rates in your system
You now likely have several sample rates:
STT target: 16000 Hz
Chatterbox TTS output: 24000 Hz
Discord voice audio: commonly 48000 Hz stereo/Opus/PCM path
Your log says:
TTS result: sr=24000, audio_shape=(86400,)
That is:
86400 / 24000 = 3.6 seconds
Chatterbox generated 3.6 seconds of TTS audio.
That audio should go to the TTS/playback path, not the STT input path.
Keep these separate:
Input audio → 16 kHz mono → STT
TTS audio → Discord playback format → Discord output
Do not let Chatterbox/TTS output leak into your mic/STT input.
Is Chatterbox affecting STT?
Probably not directly.
Chatterbox is TTS. It generates speech. It does not transcribe speech.
But it can affect your STT system indirectly in three ways.
1. Feedback loop
If the bot’s generated voice is captured by your mic or virtual audio cable, the STT system may hear the bot instead of you.
Bad routing:
TTS output
→ speakers / desktop mix / virtual cable
→ STT input
→ bot hears itself
→ bot replies to itself
Better routing:
Human mic or per-user Discord receive
→ STT
Bot TTS
→ Discord output only
2. Sample-rate confusion
Chatterbox output is 24 kHz in your log.
STT should usually get 16 kHz mono.
Discord playback often involves 48 kHz audio.
So do not reuse one conversion path for everything.
3. The Turbo warning is not your STT bug
Your log says:
WARNING - CFG, min_p and exaggeration are not supported by Turbo version and will be ignored.
That warning is about Chatterbox Turbo TTS settings. It means those generation settings are ignored by the Turbo model.
Relevant links:
That warning can affect TTS behavior/customization, but it does not explain blank STT.
Add a bot_is_speaking guard while debugging
For the first stable version, disable listening while the bot speaks.
bot_is_speaking = False
Around TTS playback:
bot_is_speaking = True
play_tts_audio(...)
bot_is_speaking = False
In audio handling:
def handle_audio_chunk(chunk_16k):
if bot_is_speaking:
print("[AUDIO] Ignoring input while bot is speaking.")
return
# continue STT path
This disables barge-in, but it prevents feedback while debugging.
Later, implement real barge-in:
if human starts speaking while bot speaks:
stop TTS
clear playback queue
cancel current LLM/TTS response
return to listening
Live voice-agent systems treat turn detection and interruption handling as separate concerns. See:
Better logs to add
Your logs should include:
sample_rate
samples
duration_seconds
min
max
peak
rms
bot_is_speaking
buffer_size
vad_state
utterance_ready
stt_called
ollama_called
Example logging helper:
import numpy as np
def log_audio_debug(label, audio, sr):
audio = np.asarray(audio, dtype=np.float32)
duration = len(audio) / sr if sr else 0.0
peak = float(np.max(np.abs(audio))) if len(audio) else 0.0
rms = float(np.sqrt(np.mean(audio ** 2))) if len(audio) else 0.0
print(
f"[{label}] sr={sr} samples={len(audio)} "
f"duration={duration:.3f}s peak={peak:.4f} rms={rms:.4f}"
)
Healthy logs should look like:
[MIC] sr=16000 samples=320 duration=0.020s peak=0.12 rms=0.02
[VAD] speech_start
[BUFFER] speech_ms=1240 silence_ms=0
[VAD] endpoint after silence_ms=900
[UTTERANCE] sr=16000 samples=35680 duration=2.23s peak=0.44 rms=0.06
[STT] text="can you hear me now"
[OLLAMA] sending valid transcript
Unhealthy logs look like:
samples=5760
transcribe immediately
NameError
empty text
send to Ollama anyway
Discord-specific note
If you are receiving audio from Discord VC, remember that Discord receive is its own fragile layer.
Discord voice docs:
Pycord warns recording/listening may be affected by DAVE:
Receive extension:
Before debugging STT, prove Discord receive works by saving clean WAV files:
Discord receive
→ decode/convert
→ save WAV
→ listen manually
Only after the WAV sounds correct should you send it into STT.
Recommended build order
Phase 1: local mic STT only
local mic
→ VAD
→ utterance buffer
→ faster-whisper
→ print transcript
Pass criteria:
silence produces no transcript
one sentence produces one transcript
partial speech is not sent
empty text is ignored
Phase 2: add Ollama
local mic
→ STT
→ Ollama
→ print reply
Pass criteria:
Ollama is called only for real speech
blank transcripts are ignored
Phase 3: add Chatterbox locally
local mic
→ STT
→ Ollama
→ Chatterbox TTS
→ local playback
Pass criteria:
the bot does not hear itself
the bot does not respond to its own voice
Phase 4: send TTS to Discord
local mic
→ STT
→ Ollama
→ Chatterbox TTS
→ Discord VC output
Phase 5: add Discord receive later
First:
Discord receive
→ save clean WAV per speaker
Then:
Discord receive
→ per-user VAD
→ per-user STT
→ speaker-labeled transcript
Do not start with full Discord receive unless you need it. It adds several failure points.
Final summary
Turning max_samples up will not fix this because max_samples is an upper cap, not a minimum buffer target.
Your immediate problems are:
1. You are calling STT on tiny chunks like 5760 samples.
2. Your waveform amplitude is suspiciously above 1.0.
3. Your code is calling a missing function: transcribe_audio.
4. Empty transcripts are still being sent to Ollama.
5. Chatterbox may be feeding back into STT if audio routing is not separated.
Fix order:
1. Define or correctly call transcribe_audio().
2. Stop sending empty transcripts to Ollama.
3. Add minimum-duration/RMS/peak validation before STT.
4. Concatenate chunks into a buffer.
5. Replace fixed buffering with VAD-based utterance buffering.
6. Keep Chatterbox/TTS output out of the STT input path.
7. Use faster-whisper with vad_filter=True and condition_on_previous_text=False.
The core rule:
Do not make Whisper transcribe chunks.
Make Whisper transcribe completed utterances.