Advice needed for building InterviewAI: a real-time AI interview feedback project

John6666 · May 3, 2026, 7:48am

(Taking above advice into account) this looks really hard…

Advice for building InterviewAI: real-time AI interview feedback with webcam, speech, and careful feedback

I would build InterviewAI as an evidence-based interview practice coach, not as a facial-emotion or “confidence” detector.

The safest and strongest version is:

webcam + microphone
→ visible/speech measurements
→ reliability checks
→ transcript/rubric analysis
→ evidence-grounded coaching feedback

The risky version is:

face/body/voice
→ emotion/confidence/honesty/personality/hireability score

That second version can look impressive in a demo, but it is hard to validate and easy to overclaim. A better product reports observable behavior:

speaking pace
long pauses
filler words
answer length
whether the answer addressed the question
STAR structure: situation, task, action, result
face visible percentage
screen-facing / looking-away estimate
head movement stability
shoulder/posture visibility
hand-to-face episodes
camera/audio quality

It should avoid unsupported psychological judgments such as:

“you lacked confidence”
“you looked nervous”
“you seemed dishonest”
“your personality is weak”
“you are not hireable”

A good feedback sentence is:

“During answer 2, you had four pauses longer than 2.5 seconds and did not include a clear result. Try answering again with: situation, action, result, and one measurable outcome.”

Not:

“You lacked confidence.”

This matters technically and ethically. HireVue, a major video-interview vendor, discontinued facial analysis in screening assessments after concerns about AI use in employment decisions: SHRM report, EPIC note, Wired coverage. If InterviewAI stays a user-owned practice tool, risk is much lower than if it becomes an employer-facing candidate-ranking system.

1. Recommended architecture

Use a modular architecture. Do not build one giant “interview quality model” first.

Client app
├── webcam capture
│   ├── face detection / face landmarks
│   ├── head pose or screen-facing estimate
│   ├── pose landmarks
│   ├── hand landmarks
│   └── frame-quality checks
│
├── microphone capture
│   ├── voice activity detection
│   ├── speech-to-text
│   ├── pause detection
│   ├── speaking pace
│   └── filler-word detection
│
├── interview engine
│   ├── question generation
│   ├── role-specific question bank
│   ├── follow-up questions
│   └── answer segmentation
│
├── feature aggregator
│   ├── per-frame visual features
│   ├── per-answer speech features
│   ├── transcript features
│   ├── rolling averages
│   ├── confidence/reliability flags
│   └── event timeline
│
├── feedback engine
│   ├── deterministic metric summaries
│   ├── rubric-based answer analysis
│   ├── LLM-generated coaching
│   ├── safety/claim checker
│   └── practice recommendations
│
└── report UI
    ├── per-answer feedback
    ├── timeline
    ├── evidence for each suggestion
    ├── reliability caveats
    └── progress over sessions

The most important separation is:

measurement layer ≠ interpretation layer

The measurement layer should output facts:

{
  "answer_id": "answer_002",
  "duration_seconds": 74,
  "wpm": 142,
  "long_pauses": 4,
  "filler_words": 11,
  "face_visible_ratio": 0.93,
  "screen_facing_ratio": 0.68,
  "hand_to_face_events": 5,
  "posture_feedback_valid": false,
  "posture_invalid_reason": "shoulders not visible enough"
}

The interpretation layer turns those facts into coaching:

“Your action was clear, but the result was missing. You also had four long pauses. Try answering again using situation, action, result, and one measurable outcome.”

2. Build in stages

Stage 1 — transcript-only MVP

Start here before webcam analysis.

question
→ user speaks answer
→ ASR transcript
→ WPM / pauses / fillers
→ rubric feedback
→ report

Features:

generate interview questions
record audio
transcribe answer
compute speaking pace
detect long pauses
count filler words
check answer structure
generate feedback

Example output:

Answer length: 74 seconds
Speaking pace: 142 WPM
Long pauses: 4
Filler words: 11
STAR structure:
  situation: present
  task: unclear
  action: present
  result: missing

Feedback:
Your action was clear, but the result was missing. Add one sentence explaining what changed because of your action.

This is already useful and much easier to evaluate than facial emotion recognition.

Stage 2 — basic webcam observables

Add:

face visible percentage
face centeredness
screen-facing estimate
looking-away episodes
head movement variance

Use these as observations, not psychological claims.

Stage 3 — pose and hands

Add:

shoulder visibility
upper-body stability
posture validity flag
hand-to-face episodes
large movement spikes

Good feedback:

“Your hand moved near your face five times during this answer.”

Bad feedback:

“You were anxious.”

Stage 4 — job-aware coaching

Add:

resume parsing
job-description matching
role-specific rubrics
retrieval of coaching examples
answer-to-rubric comparison

Use embeddings/rerankers for this.

Stage 5 — evaluation and safety

Add:

manually labeled benchmark sessions
detector precision/recall/F1
feedback faithfulness checks
unsupported-claim detector
accessibility modes
privacy controls

3. Computer vision direction

For the first version, use landmark tracking, not a custom visual model.

Recommended first tools:

MediaPipe Face Landmarker for Web
MediaPipe Pose Landmarker for Web
MediaPipe Hand Landmarker for Web
OpenCV for image/video utilities
OpenFace for offline/research facial behavior analysis
MMPose / RTMPose later if MediaPipe is not enough

MediaPipe is a good MVP choice because it is real-time, browser/mobile friendly, and gives the kind of coordinates you need: face, body, and hand landmarks.

Use OpenFace mainly for research/offline validation. It supports facial landmarks, head pose, facial action units, and eye gaze. Use it to compare signals, not to claim internal emotion.

4. Hugging Face models/libraries that can help

Use Hugging Face mainly for ASR, LLMs, datasets, evaluation, embeddings, demos, and optional experiments. For real-time camera landmarks, MediaPipe is usually the better first tool.

Speech-to-text

Good candidates to test:

ASR matters because transcript quality affects everything else: filler words, answer structure, relevance, and feedback quality.

Evaluate ASR on your own mock-interview audio. Generic WER is not enough. Measure:

word error rate
filler-word recall
timestamp quality
pause boundary accuracy
speed/latency
accent robustness
microphone robustness

Useful benchmark:

Face/body/hand models on Hugging Face

Useful HF-hosted MediaPipe-style models:

Use these for:

face visibility
face framing
head movement
shoulder visibility
upper-body stability
hand-to-face events

Do not use them to infer confidence or nervousness.

Facial-expression models

Treat these as optional experiments only:

These classify expression labels like angry, happy, sad, surprise, neutral, etc. That is not the same as detecting interview confidence, nervousness, honesty, or hireability.

Safe label:

“experimental facial-expression classifier output”

Unsafe label:

“candidate confidence score”

Feedback LLMs

The LLM should receive structured evidence and write coaching feedback. It should not invent visual claims.

Candidates to test:

Example LLM input:

{
  "question": "Tell me about a time you handled conflict.",
  "transcript": "I had a disagreement with a teammate...",
  "speech_metrics": {
    "wpm": 138,
    "long_pauses": 3,
    "filler_words": 9
  },
  "vision_metrics": {
    "face_visible_ratio": 0.94,
    "screen_facing_ratio": 0.71,
    "hand_to_face_events": 4,
    "posture_feedback_valid": false
  },
  "rubric": {
    "situation": true,
    "task": true,
    "action": true,
    "result": false
  },
  "instruction": "Use only the evidence above. Do not infer confidence, nervousness, honesty, personality, or hireability."
}

Embeddings and rerankers

Useful if you want job/resume-aware coaching:

Use embeddings for:

resume → retrieve relevant past projects
job description → retrieve required competencies
question → retrieve rubric
answer → retrieve coaching examples

Do not use embedding similarity alone as an interview-quality score.

Safety / prompt-injection guard

If users upload resumes, job descriptions, or company pages, protect the feedback LLM from prompt injection.

Options:

Also add a custom validator that blocks unsupported feedback such as:

you looked nervous
you lacked confidence
you seemed dishonest
you are not hireable
your personality is weak

5. Should you train your own model?

Do not train your own model first.

Start with:

pretrained ASR
pretrained landmarks
rule-based metrics
LLM feedback from structured evidence
manual evaluation

Fine-tune only after you have:

a narrow observable target
labeled data
a baseline
a clear failure case
evaluation metrics

Good fine-tuning targets:

Target	Good label
ASR adaptation	transcript text
filler detection	token-level filler labels
long pauses	timestamped pause segments
looking away	timestamped looking-away segments
hand-to-face	timestamped hand-near-face events
answer quality	rubric scores by human reviewers
feedback style	human coach feedback examples

Bad fine-tuning targets:

Target	Problem
confidence	vague internal state
nervousness	not directly observable
honesty	not valid from video/audio
personality	ethically and scientifically risky
hireability	high-stakes and bias-prone

6. Real-time webcam inference best practices

Run models slower than camera FPS

Suggested rates:

Component	Rate
webcam preview	30 FPS
face landmarks	10–15 FPS
pose landmarks	5–10 FPS
hand landmarks	5–10 FPS
expression classifier, if any	1–3 FPS
full LLM feedback	after each answer

Do not generate feedback every frame.

Smooth signals

Landmarks jitter. Use:

moving average
exponential moving average
median filter
hysteresis thresholds
minimum event duration

Bad:

frame 1834 = looking away

Better:

looking-away event:
  start: 00:41.2
  end: 00:46.8
  duration: 5.6 seconds
  confidence: medium

Use calibration

At session start:

Please sit naturally and look at the screen for 5 seconds.

Capture baseline:

head yaw
head pitch
face center
shoulder position
distance from camera
lighting quality

Then compare future movement to that user’s baseline.

Use reliability gates

Do not report metrics when evidence is weak.

Condition	Action
face visible < 60%	do not report screen-facing estimate
shoulders not visible	do not report posture
hand landmarks unstable	do not report hand-to-face events
audio quality poor	warn transcript metrics may be unreliable
multiple faces visible	pause analysis or warn
low lighting	ask user to improve lighting

A good caveat:

“I could not reliably evaluate posture because your shoulders were not visible enough.”

Store features, not raw video

Prefer:

{
  "timestamp_ms": 12800,
  "face_visible": true,
  "head_yaw": -0.12,
  "head_pitch": 0.04,
  "screen_facing_estimate": true,
  "left_hand_near_face": false,
  "right_hand_near_face": true
}

Do not store raw video unless the user explicitly opts in.

Browser webcam access uses getUserMedia(); see MDN getUserMedia.

7. Evaluation plan

Do not evaluate by asking:

“Does the generated feedback sound good?”

A polished LLM can produce convincing but false feedback.

Evaluate by asking:

“Did the system correctly detect the observable things it claims to detect, and did the feedback stay faithful to those measurements?”

Build a labeled benchmark

Create:

30–100 mock interview sessions
5–10 minutes each
10+ users
different webcams
different microphones
different lighting
different accents
camera-on and camera-off cases

Manually label:

Label	Type
long pauses	timestamped start/end
filler words	transcript tokens
looking away	timestamped segments
face visible	per frame or per second
hand near face	timestamped segments
shoulders visible	segment-level
posture feedback valid	yes/no
STAR structure	present/missing
answer relevance	rubric score
unsupported feedback claim	yes/no

Metrics

Component	Metric
ASR	WER, filler recall, timestamp error
pause detector	precision, recall, F1, boundary error
looking-away detector	event F1, false positives/min
hand-to-face detector	event precision/recall
posture validity	accuracy of “can judge / cannot judge”
answer rubric	agreement with human reviewer
feedback	faithfulness, helpfulness, unsupported-claim rate
overall	user usefulness rating + objective detector scores

Example report:

InterviewAI v0.2 evaluation

Dataset:
  48 mock interview sessions
  14 participants
  7 webcam/microphone setups
  326 answer segments

ASR:
  WER: 8.9%
  filler-word recall: 0.81
  average timestamp error: 420 ms

Long pauses:
  precision: 0.91
  recall: 0.85
  F1: 0.88

Looking-away estimate:
  precision: 0.76
  recall: 0.69
  F1: 0.72
  false positives: 0.7/min

Hand-to-face:
  precision: 0.82
  recall: 0.64
  F1: 0.72

Feedback:
  evidence faithfulness: 96%
  unsupported psychological claims: 0%
  human coach agreement: 0.74

8. Datasets and benchmarks

Interview-question datasets

Useful for question generation and technical QA examples:

Use them for:

question generation
role-specific question bank
technical interview examples
simple instruction-tuning experiments

Do not treat them as reliable answer-quality labels.

ASR datasets

Useful for transcription evaluation:

Face/head/gaze/pose datasets

Useful for component experiments:

FER/SER datasets

Use only for optional research:

Do not use FER/SER datasets as proof that you can detect interview confidence or nervousness.

Prefer no-script datasets

Prefer Hugging Face datasets stored as Parquet, JSON, CSV, image folders, or audio files. Avoid datasets that require custom Python dataset builder scripts or trust_remote_code=True when possible.

Relevant docs:

9. Similar projects to study

Use these for architecture and UX ideas, not as proof of validity.

Projects that say “confidence analyzer” or “candidate scoring” are useful cautionary examples. Their architecture may be interesting, but the framing is risky.

Better names:

Interview practice coach
Interview delivery analyzer
Observable behavior feedback system
Mock interview feedback assistant

Riskier names:

confidence detector
nervousness analyzer
honesty detector
hireability evaluator
personality detector

10. Legal/safety/product-positioning warnings

If InterviewAI is a candidate-owned practice tool, the risk is much lower.

If it becomes an employer-facing automated screening tool, the risk increases sharply.

Relevant references:

Recommended disclaimer:

InterviewAI analyzes observable practice signals such as transcript quality, speaking pace, pauses, filler words, face visibility, screen-facing estimate, and hand/pose landmarks.

InterviewAI does not infer honesty, personality, mental health, true confidence, emotional state, or hireability.

11. Common mistakes to avoid

Mistake 1 — leading with emotion recognition

Facial expression recognition sounds impressive, but it is not the strongest core feature. It is hard to validate and easy to overclaim.

Use it as:

optional experimental expression classifier

Not as:

confidence detector

Mistake 2 — using one overall score too early

Avoid:

Interview score: 74/100
Confidence: 62/100
Professionalism: 80/100

Prefer:

Speech:
  146 WPM
  3 long pauses
  9 filler words

Answer structure:
  situation: present
  action: present
  result: missing

Camera:
  face visible: 94%
  screen-facing estimate: 71%
  hand-to-face events: 5

Mistake 3 — no reliability gates

If the evidence is weak, say so. Do not fake posture or gaze feedback.

Mistake 4 — letting the LLM invent observations

Do not prompt:

Analyze this candidate's confidence.

Prompt:

Use only the transcript and measured metrics. Do not infer mental state, honesty, personality, or hireability.

Mistake 5 — no labeled evaluation set

Most demos fail here. Build a small labeled mock-interview benchmark and report objective metrics.

Mistake 6 — ignoring accessibility

Eye contact, speech rhythm, posture, facial movement, and gesture patterns vary across people. Include:

camera-off mode
transcript-only mode
no penalty for gaze differences
manual self-review
user-controlled goals
clear caveats

12. Recommended stack

Web MVP

Layer	Recommendation
frontend	Next.js / React
webcam/mic	`getUserMedia`, MediaRecorder
real-time CV	MediaPipe Tasks Web
backend	FastAPI or Node
ASR	Whisper / Cohere Transcribe / Canary / Parakeet
feedback LLM	Qwen / Llama / API model
embeddings	Qwen3 Embedding
reranking	Qwen3 Reranker
storage	PostgreSQL + object storage
evaluation	Python, scikit-learn, Hugging Face Evaluate
demo	Hugging Face Spaces or custom web deployment

Useful docs:

Local/privacy-oriented version

Layer	Recommendation
CV	MediaPipe/OpenCV locally
ASR	local Whisper/faster-whisper-style runtime
LLM	local Qwen/Llama if hardware supports
storage	local SQLite
reports	local HTML/PDF export

13. Strong README positioning

Use this:

InterviewAI is an AI interview-practice coach that combines transcript analysis, speech timing, and webcam landmark tracking to give users evidence-based feedback.

It measures observable practice signals such as speaking pace, long pauses, filler words, answer structure, face visibility, screen-facing estimate, head movement, upper-body stability, and hand-to-face movement.

It does not infer honesty, personality, mental health, true confidence, nervousness, or hireability.

Avoid this:

InterviewAI uses emotion recognition to detect whether candidates are confident, nervous, honest, and hireable.

14. Direct answers to the seven questions

1. Recommended architecture?

Use:

webcam + microphone
→ landmarks + transcript
→ observable metrics
→ per-answer aggregation
→ rubric scoring
→ LLM feedback from evidence
→ report with caveats

Keep measurement separate from interpretation.

2. Which HF models/libraries help?

Use Hugging Face for:

ASR: Whisper, Cohere Transcribe, Canary, Parakeet
LLM feedback: Qwen/Llama-style instruction models
embeddings/rerankers: Qwen3 Embedding/Reranker
datasets: ASR, interview questions, pose/gaze experiments
evaluation: Hugging Face Evaluate
demos: Spaces
safety: Prompt Guard / Llama Guard-style models

Use MediaPipe/OpenFace/MMPose for camera landmarks.

3. Train, fine-tune, or pretrained?

Use pretrained models first. Fine-tune only after you have labeled data and a clear failing baseline. Do not train a model to predict “confidence” or “nervousness” first.

4. Best practices for real-time webcam inference?

lower FPS inference
smoothing
calibration
async processing
feature storage instead of raw video
reliability gates
uncertainty reporting
feedback after each answer, not every frame

5. How to evaluate reliability?

Create a labeled mock-interview benchmark and measure:

long-pause F1
filler recall
looking-away event F1
hand-to-face precision/recall
ASR WER
rubric agreement
feedback faithfulness
unsupported-claim rate

6. Datasets/models/pipelines?

Use:

AI-Mock-Interviewer/Train_data
K-areem/AI-Interview-Questions
hf-audio/asr-leaderboard-longform
MediaPipe / OpenFace / MMPose
Whisper / Cohere / Canary / Parakeet
Qwen / Llama for feedback
Qwen3 Embedding/Reranker for retrieval
your own interviewai-eval-v1 for final validation

7. Common mistakes?

Avoid:

emotion overclaiming
confidence/honesty/hireability scoring
one overall score too early
no reliability gates
no labeled evaluation
raw-video storage by default
LLM-invented observations
ignoring accessibility
building an employer screening tool before addressing legal/fairness requirements

Final recommendation

Build the boring measurable system first:

observable signals
+ transcript analysis
+ reliability gates
+ evidence-grounded LLM feedback
+ manually labeled evaluation

Do not start with:

facial emotion recognition
+ confidence score
+ personality score
+ hireability score

If the basic signals are noisy, a bigger model will mostly give you a more expensive noisy system. If the basic signals are reliable, the feedback layer can become genuinely useful.

Topic		Replies	Views
LLM and DataSet for Job Interview questions & answers Beginners	0	2899	October 21, 2023
Real-time exercise form analysis with MediaPipe , looking for advice Beginners	2	34	May 2, 2026
AI Recruiting Agent — Weekly Updates + Roadmap Spaces	0	32	January 23, 2026
Say goodbye to manual testing of your LLM-based apps – automate with EvalMy.AI beta! 🚀 Research	0	94	October 29, 2024
SplitMind-AI: Modeling LLM replies as competing internal pressures Show and Tell	1	30	March 19, 2026

Advice needed for building InterviewAI: a real-time AI interview feedback project

Advice for building InterviewAI: real-time AI interview feedback with webcam, speech, and careful feedback

1. Recommended architecture

2. Build in stages

Stage 1 — transcript-only MVP

Stage 2 — basic webcam observables

Stage 3 — pose and hands

Stage 4 — job-aware coaching

Stage 5 — evaluation and safety

3. Computer vision direction

4. Hugging Face models/libraries that can help

Speech-to-text

Face/body/hand models on Hugging Face

Facial-expression models

Feedback LLMs

Embeddings and rerankers

Safety / prompt-injection guard

5. Should you train your own model?

6. Real-time webcam inference best practices

Run models slower than camera FPS

Smooth signals

Use calibration

Use reliability gates

Store features, not raw video

7. Evaluation plan

Build a labeled benchmark

Metrics

8. Datasets and benchmarks

Interview-question datasets

ASR datasets

Face/head/gaze/pose datasets

FER/SER datasets

Prefer no-script datasets

9. Similar projects to study

10. Legal/safety/product-positioning warnings

11. Common mistakes to avoid

Mistake 1 — leading with emotion recognition

Mistake 2 — using one overall score too early

Mistake 3 — no reliability gates

Mistake 4 — letting the LLM invent observations

Mistake 5 — no labeled evaluation set

Mistake 6 — ignoring accessibility

12. Recommended stack

Web MVP

Local/privacy-oriented version

13. Strong README positioning

14. Direct answers to the seven questions

1. Recommended architecture?

2. Which HF models/libraries help?

3. Train, fine-tune, or pretrained?

4. Best practices for real-time webcam inference?

5. How to evaluate reliability?

6. Datasets/models/pipelines?

7. Common mistakes?

Final recommendation

Related topics