Sufiyan Khan — Full-Stack AI Engineer

01 / Projects

Production Systems.

01 — 2026

Live · Production

FastAPI (Python)

Deepgram Nova-2 STT

Groq Llama 3.1 LLM

Deepgram Aura TTS

Vanilla JS · WebSockets

Vercel · Render CI/CD

P95 Latency

<300ms

Mic → STT → LLM → TTS → Speaker

Real-Time AI Voice Agent

End-to-end production voice pipeline — sub-300ms latency, streaming audio, fault-tolerant backend

FastAPI over Flask — async I/O for concurrent voice streams Groq over OpenAI — 10× faster inference at lower cost Deepgram over Whisper — real-time streaming STT vs batch

🎤 Mic → Audio Capture → Deepgram STT → FastAPI /voice → Groq LLM → Deepgram TTS → 🔊 <300ms

Architecture Decisions · Trade-off Analysis

Decision

Chosen

Why

Backend framework

FastAPI

Async I/O handles concurrent voice streams without blocking; Flask's WSGI model creates latency under load

LLM inference

Groq (Llama 3.1)

~10× faster inference vs OpenAI at 1/3 the cost; critical for sub-300ms voice SLA

STT/TTS provider

Deepgram Nova-2

WebSocket streaming (not batch HTTP); enables real-time transcription while audio still being spoken

Frontend stack

Vanilla JS

Zero framework overhead for audio APIs; MediaRecorder + Web Audio API work best without abstraction layers

Deployment split

Vercel + Render

Frontend on edge CDN (Vercel) + Python async backend on Render; independent scaling per layer

<300ms

P95 E2E Latency

4

API Integrations

100%

Real-Time Uptime

01Streaming audio pipeline — Deepgram Nova-2 STT with real-time WebSocket transcription, not batch processing. Audio chunks streamed as spoken, reducing first-token latency by ~200ms vs HTTP upload approach.
02Fault-tolerant error handling — explicit handling for short audio (<0.5s), no-speech detection, API timeouts, and CORS edge cases. Health check endpoint at /status for uptime monitoring.
03Multi-persona architecture — /configure endpoint switches AI persona by hot-swapping system prompts without restarting the server. Conversation history persisted per session with /reset for clean handoffs.
04Interrupt capability — /interrupt endpoint lets users stop AI mid-response, mirroring natural human conversation. Built on server-side abort flags — the API call is actually cancelled to save tokens.
05Production deployment — CORS configured for cross-origin Vercel frontend + Render backend split. Environment secrets managed via Render's secrets manager. Zero-downtime redeploy on push to main.

↗ Live Demo </> Code

02 — 2026

Live · Production

LangGraph Multi-Agent

Groq LLM (Llama 3.1)

FastAPI · Docker

React Frontend

Vercel · Render CI/CD

Response Time

<200ms

Triage → Classify → Reply → Closed

AI Support Pro

LangGraph-orchestrated multi-agent support system with RAG pipeline and real-time observability dashboard

LangGraph over AutoGen — explicit graph routing, testable paths Groq over OpenAI — lower latency for real-time ticket replies Docker — environment parity dev → prod, zero surprise deploys

Ingest → Classification Agent → Triage Agent → Priority Router → Reply Agent → ✓ Closed

Architecture Decisions · Trade-off Analysis

Decision

Chosen

Why

Agent framework

LangGraph

Routing logic is explicit graph code, not prompt instructions — makes escalation paths testable and debuggable

LLM inference

Groq (Llama 3.1)

Sub-200ms inference per agent turn; OpenAI at equivalent quality adds ~400ms per hop in a multi-agent chain

Containerisation

Docker

Guarantees environment parity between local dev and Render production; eliminates "works on my machine" deploys

Frontend

React

Component model maps cleanly to the live ticket stream UI; real-time state updates via polling without a WebSocket layer

<200ms

Avg Response

0

Human Touches

4

AI Agents

01LangGraph-orchestrated routing with two explicit paths: normal tickets flow through triage → escalation check → knowledge → response agents; urgent/fraud tickets route directly to escalation response and human handoff.
02RAG pipeline + Observability — ChromaDB vector store with sentence-transformers embeddings answers knowledge base queries with cited sources. Real-time observability dashboard tracks agent execution frequency, escalation rate, and ticket volume live.
03Sub-200ms concurrent responses under real load. FastAPI async endpoints + Groq's low-latency inference keep response times consistent even at 10+ concurrent tickets.

↗ Live Demo </> Code

02 / Experience

Where I've Shipped.

May 2026
Present

Founding · Current

Full-Stack AI Engineer — Independent

Production Systems · Remote · Pune

Architected and shipped Real-Time AI Voice Agent — Deepgram STT/TTS + Groq LLM, sub-300ms E2E latency, live on Vercel + Render
Built AI Support Pro: autonomous 4-agent customer support system (AutoGen + Groq), zero human involvement, <200ms response
Owned full inference stack: system architecture, API orchestration, FastAPI async backend, CI/CD, production deployment

Mar 2026
May 2026

RLHF · Post-Training

LLM Post-Training Engineer

Confidential · Remote

RLHF data collection: evaluated 50+ model responses daily for quality, safety, and alignment signals
Identified failure modes in multilingual and technical reasoning — feedback fed into production fine-tuning pipeline
Developed deep intuition for LLM behavior, prompt engineering, and model evaluation at production scale

Aug 2024
Oct 2024

Internship

Software Engineer Intern

Confidential · Hybrid

Built automated testing pipelines in Python and PowerShell, reducing manual QA time by ~60%
Designed dashboards for test result visualization and reporting across teams

03 / Stack

Tech
Stack.

Tools I use in production — not a list of things I've Googled. Every item here has been shipped in a real system with real users.

Voice AI5 tools

Deepgram Nova-2Deepgram Aura TTSWebSocketsReal-Time AudioSTT / TTS

LLM & Agents8 tools

Groq (Llama 3.1)ClaudeOpenAIGeminiAutoGenLangChainCrewAIRLHF

Backend6 tools

PythonFastAPINode.jsExpressRESTWebhooks

Frontend5 tools

ReactNext.jsTypeScriptTailwindVanilla JS

Data & Infra7 tools

PostgreSQLMongoDBDockerAWSVercelRenderCI/CD

04 / Contact

Let's build
something
real.

Open to roles across AI Engineering, Full-Stack, Backend (Python/FastAPI), LLM/Prompt Engineering, and Founding Engineer positions at high-growth startups. Remote-first from Pune, India. Available immediately.

Open to Competitive Offers · Immediate Joiner · Remote Worldwide

Email → GitHub → LinkedIn → Live Demo — Voice Agent → Live Demo — AI Support →

Email

suzkhan135@gmail.com

Location

Pune, India · Remote Worldwide

Availability

Immediate Joiner · Remote-First · Open to discuss comp