Sufiyan Khan
0%
AI
AI Engineer · Full-Stack · LLM · Voice · Backend Available Now · Remote-First

I architect production AI systems from infrastructure to UX — real-time voice agents, streaming pipelines, and fault-tolerant backends. Zero-to-deploy in days, not months.

FastAPI Deepgram Groq LLM WebSockets React AutoGen Docker Vercel · Render
Neural Network · Live 3D
0ms
P95 end-to-end voice latency
STT → LLM → TTS streaming pipeline
0
Production API integrations
Deepgram · Groq · FastAPI
0
Live production deploys
Vercel + Render CI/CD
Pune · Remote
India · Available worldwide
Open to competitive offers · Immediate joiner
FastAPI· Deepgram Nova-2· Groq Llama 3.1· AutoGen Agents· WebSockets· Multi-Agent AI· LangChain· CrewAI· Docker · Vercel· RLHF · Post-Training· FastAPI· Deepgram Nova-2· Groq Llama 3.1· AutoGen Agents· WebSockets· Multi-Agent AI· LangChain· CrewAI· Docker · Vercel· RLHF · Post-Training·

Production Systems.

01 — 2026
Live · Production
FastAPI (Python)
Deepgram Nova-2 STT
Groq Llama 3.1 LLM
Deepgram Aura TTS
Vanilla JS · WebSockets
Vercel · Render CI/CD
P95 Latency
<300ms
Mic → STT → LLM → TTS → Speaker

Real-Time AI Voice Agent

End-to-end production voice pipeline — sub-300ms latency, streaming audio, fault-tolerant backend

FastAPI over Flask — async I/O for concurrent voice streams Groq over OpenAI — 10× faster inference at lower cost Deepgram over Whisper — real-time streaming STT vs batch
🎤 Mic Audio Capture Deepgram STT FastAPI /voice Groq LLM Deepgram TTS 🔊 <300ms
Architecture Decisions · Trade-off Analysis
Decision
Chosen
Why
Backend framework
FastAPI
Async I/O handles concurrent voice streams without blocking; Flask's WSGI model creates latency under load
LLM inference
Groq (Llama 3.1)
~10× faster inference vs OpenAI at 1/3 the cost; critical for sub-300ms voice SLA
STT/TTS provider
Deepgram Nova-2
WebSocket streaming (not batch HTTP); enables real-time transcription while audio still being spoken
Frontend stack
Vanilla JS
Zero framework overhead for audio APIs; MediaRecorder + Web Audio API work best without abstraction layers
Deployment split
Vercel + Render
Frontend on edge CDN (Vercel) + Python async backend on Render; independent scaling per layer
<300ms
P95 E2E Latency
4
API Integrations
100%
Real-Time Uptime
  • 01Streaming audio pipeline — Deepgram Nova-2 STT with real-time WebSocket transcription, not batch processing. Audio chunks streamed as spoken, reducing first-token latency by ~200ms vs HTTP upload approach.
  • 02Fault-tolerant error handling — explicit handling for short audio (<0.5s), no-speech detection, API timeouts, and CORS edge cases. Health check endpoint at /status for uptime monitoring.
  • 03Multi-persona architecture — /configure endpoint switches AI persona by hot-swapping system prompts without restarting the server. Conversation history persisted per session with /reset for clean handoffs.
  • 04Interrupt capability — /interrupt endpoint lets users stop AI mid-response, mirroring natural human conversation. Built on server-side abort flags — the API call is actually cancelled to save tokens.
  • 05Production deployment — CORS configured for cross-origin Vercel frontend + Render backend split. Environment secrets managed via Render's secrets manager. Zero-downtime redeploy on push to main.
02 — 2026
Live · Production
LangGraph Multi-Agent
Groq LLM (Llama 3.1)
FastAPI · Docker
React Frontend
Vercel · Render CI/CD
Response Time
<200ms
Triage → Classify → Reply → Closed

AI Support Pro

LangGraph-orchestrated multi-agent support system with RAG pipeline and real-time observability dashboard

LangGraph over AutoGen — explicit graph routing, testable paths Groq over OpenAI — lower latency for real-time ticket replies Docker — environment parity dev → prod, zero surprise deploys
Ingest Classification Agent Triage Agent Priority Router Reply Agent ✓ Closed
Architecture Decisions · Trade-off Analysis
Decision
Chosen
Why
Agent framework
LangGraph
Routing logic is explicit graph code, not prompt instructions — makes escalation paths testable and debuggable
LLM inference
Groq (Llama 3.1)
Sub-200ms inference per agent turn; OpenAI at equivalent quality adds ~400ms per hop in a multi-agent chain
Containerisation
Docker
Guarantees environment parity between local dev and Render production; eliminates "works on my machine" deploys
Frontend
React
Component model maps cleanly to the live ticket stream UI; real-time state updates via polling without a WebSocket layer
<200ms
Avg Response
0
Human Touches
4
AI Agents
  • 01LangGraph-orchestrated routing with two explicit paths: normal tickets flow through triage → escalation check → knowledge → response agents; urgent/fraud tickets route directly to escalation response and human handoff.
  • 02RAG pipeline + Observability — ChromaDB vector store with sentence-transformers embeddings answers knowledge base queries with cited sources. Real-time observability dashboard tracks agent execution frequency, escalation rate, and ticket volume live.
  • 03Sub-200ms concurrent responses under real load. FastAPI async endpoints + Groq's low-latency inference keep response times consistent even at 10+ concurrent tickets.
02 / Experience

Where I've Shipped.

May 2026
Present
Founding · Current

Full-Stack AI Engineer — Independent

Production Systems · Remote · Pune
  • Architected and shipped Real-Time AI Voice Agent — Deepgram STT/TTS + Groq LLM, sub-300ms E2E latency, live on Vercel + Render
  • Built AI Support Pro: autonomous 4-agent customer support system (AutoGen + Groq), zero human involvement, <200ms response
  • Owned full inference stack: system architecture, API orchestration, FastAPI async backend, CI/CD, production deployment
Mar 2026
May 2026
RLHF · Post-Training

LLM Post-Training Engineer

Confidential · Remote
  • RLHF data collection: evaluated 50+ model responses daily for quality, safety, and alignment signals
  • Identified failure modes in multilingual and technical reasoning — feedback fed into production fine-tuning pipeline
  • Developed deep intuition for LLM behavior, prompt engineering, and model evaluation at production scale
Aug 2024
Oct 2024
Internship

Software Engineer Intern

Confidential · Hybrid
  • Built automated testing pipelines in Python and PowerShell, reducing manual QA time by ~60%
  • Designed dashboards for test result visualization and reporting across teams
03 / Stack

Tech
Stack.

Tools I use in production — not a list of things I've Googled. Every item here has been shipped in a real system with real users.

Voice AI5 tools
Deepgram Nova-2Deepgram Aura TTSWebSocketsReal-Time AudioSTT / TTS
LLM & Agents8 tools
Groq (Llama 3.1)ClaudeOpenAIGeminiAutoGenLangChainCrewAIRLHF
Backend6 tools
PythonFastAPINode.jsExpressRESTWebhooks
Frontend5 tools
ReactNext.jsTypeScriptTailwindVanilla JS
Data & Infra7 tools
PostgreSQLMongoDBDockerAWSVercelRenderCI/CD
04 / Contact

Let's build
something
real.

Open to roles across AI Engineering, Full-Stack, Backend (Python/FastAPI), LLM/Prompt Engineering, and Founding Engineer positions at high-growth startups. Remote-first from Pune, India. Available immediately.

Open to Competitive Offers · Immediate Joiner · Remote Worldwide
Email
suzkhan135@gmail.com
Location
Pune, India · Remote Worldwide
Availability
Immediate Joiner · Remote-First · Open to discuss comp