AI & LLM Engineering Internship
Internship
What You'll Own:
You are responsible for making every model call in the platform as fast and cost-efficient as possible. You benchmark inference providers (Groq, Cerebras, Fireworks), configure speculative decoding, and ensure the Model Router always picks the fastest available free-tier endpoint.
Responsibilities:
Benchmark Groq, Cerebras, Together AI, Fireworks, and OpenRouter across latency, throughput, and accuracy on agentic tasks
Configure Ollama 0.3.x with speculative decoding for local fallback models
Build and maintain the ModelRouter class with automatic rate-limit detection and provider rotation
Profile memory and token usage per worker agent and reduce average cost per task by 40%
Write inference optimization documentation for the team
Requirements:
Python (intermediate) — can read and write async code
Familiarity with REST APIs and JSON
Basic understanding of LLMs (knows what tokens, temperature, and context window mean)
Bonus: Prior exposure to Ollama, LM Studio, or any local model runner
You'll Learn: vLLM, Cerebras WSE-3 API, OpenRouter free routing, speculative decoding, provider failover architecture
A. Coding + Problem Solving
Python (mandatory)
Basic DSA (arrays, strings, hashing)
Possibly:
Data preprocessing task