Scaling to 99M Viewers: Architecture Lessons

Practical architecture lessons from JioHotstar’s 99M-viewer peak: autoscaling patterns, CDN/edge caching, load-testing strategies, and incident runbooks.

Hook: When a live event becomes an existential test — and how to be ready

If you run streaming infrastructure, your worst nightmare is a one-off event that catapults traffic into the tens of millions. You have to deliver continuous video, not error pages; keep latency predictable; and avoid origin overload that breaks downstream services. Public reports in early 2026 saying JioHotstar attracted 99 million digital viewers for a Women’s World Cup final are more than headlines — they’re a field manual for anyone building resilient live-streaming platforms.

Why JioHotstar’s public metrics matter to engineers

The headline number — 99M viewers — is useful only when converted into operational requirements. Every viewer translates into a stream, a manifest, repeated HTTP requests for segments, CDN allocations, and orchestration pressure on transcoding and origin services. For platform teams, the immediate questions are:

How much traffic reaches origin vs. edge?
How do you autoscale the ingest/transcode pipeline without minutes-long spin-up times?
What must your cache hit ratio be to keep origin load manageable?
How do you verify the whole chain before the event?

Quick math that shapes architecture decisions

Use this as a rule-of-thumb planning model to convert viewer counts into request and bandwidth expectations. Pick conservative numbers for segment length and bitrate, then apply CDN hit ratios to estimate origin load.

Assume segment length = 6s (HLS/DASH common). Each viewer requests roughly 1/6 ≈ 0.167 segments/sec.
For 99M viewers: 99,000,000 × 0.167 ≈ 16.5 million segment requests/sec at the edge.
If average segment size (chunk) = 1 MB (4–6s @ adaptive bitrate profiles), aggregate edge bandwidth ≈ 16.5 GB/s (≈ 132 Gbps).
With a strong CDN and edge cache hit ratio of 99.9% for segments, origin sees only 0.1% of requests: ≈16.5k req/sec and ≈132 Mbps — a manageable origin load.

These numbers show why CDN and edge caching are non-negotiable. Small improvements in cache-hit % produce huge reductions in origin pressure.

Architecture patterns distilled from the event

1) Autoscaling must be fast, predictive, and multi-layered

Reactive autoscaling alone fails at this scale because cold-start latency becomes visible across millions of viewers. Production-grade systems combine:

Predictive scaling: Use event schedules, historical patterns, and simple calendar-based policies to pre-scale encoder farms, ingress, and critical control-plane services.
Fast reactive scaling: Kubernetes HPA combined with VPA and KEDA for event-driven bursts (Kafka inflight, message queue depth).
Warm pools and standby instances: Keep pre-warmed VMs/containers for transcoding and origin edge shields to remove spin-up delays.

Example: a Kubernetes HPA for an ingress fleet based on custom metrics (requests/sec per pod):

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ingress-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ingress
  minReplicas: 20
  maxReplicas: 2000
  metrics:
  - type: Pods
    pods:
      metric:
        name: requests_per_second
      target:
        type: AverageValue
        averageValue: "1000"

Tip: use custom metrics exported to Prometheus + external metrics adapter instead of CPU-based autoscaling for network-bound services.

2) Edge caching, origin shielding, and manifest strategy

At scale, the CDN is your primary server. Design caches so that the origin rarely sees segment requests:

Long-lived media segments: Increase segment length slightly where latency tolerances allow (e.g., 6–8s), which lowers request rate per viewer.
Manifest (m3u8/mpd) caching: Cache manifests aggressively, but allow short TTLs for ABR updates. Accidental manifest purges cause immediate cache miss storms.
Cache key hygiene: Use stable cache keys for identical segments across bitrates and CDNs where possible. Strip transient query params at edge.
Origin shielding: Add a multi-tier cache (edge → regional shield → origin) to prevent independent PoPs from simultaneously hitting origin.
Cache warming: Pre-populate edge caches with initial segments and top-bitrate variants before kickoff using a "cache fill" job.

Example CDN policy: Cache segments (Cache-Control: public, max-age=86400) and set short TTL for manifests (Cache-Control: max-age=4).

3) Load testing: test edges, not just origin

Traditional load tests that hammer origins are useful but insufficient. At this level you must validate CDN behavior, peering, and regional limits.

Distributed load generators: Run traffic from cloud regions and on-prem POPs to mimic geographic concurrency. Tools: k6 (distributed via Kubernetes), Locust, Vegeta, or custom Go/Node runners.
Emulate playback patterns: Simulate manifest fetch, segment request cadence, rate switching (ABR), rebuffer events, and range requests.
Measure end-to-end: Monitor client-side metrics (startup time, buffer health), CDN edge metrics (hit ratio, 5xx) and origin metrics.
Test plan scale strategy: You don’t need 99M simulated clients; instead, run many concurrent clients per generator from distributed locations to validate cache behavior and egress limitations per POP.

Sample k6 snippet (conceptual) to exercise manifests and segments:

import http from 'k6/http';
import { sleep } from 'k6';
export let options = { vus: 1000, duration: '10m' };
export default function () {
  let manifest = http.get('https://edge.example.com/live/stream.m3u8');
  // parse manifest and request top 2 segments
  http.get('https://edge.example.com/live/segment-1.ts');
  sleep(6);
}

4) Incident response and graceful degradation

Prepare runbooks that focus on fast decisions and minimizing user-facing impact. For big events you must be able to triage, throttle, and degrade in seconds.

Runbooks: Maintain runbooks for common failure modes: cache purge, origin overload, encoder failure, CDN peering outage.
Feature gating: Use feature flags to disable optional features (chat overlays, low-latency modes, statistics widgets) to reduce load quickly.
Traffic shaping: Implement per-region rate limits and low-bitrate fallback pools that can be enabled centrally.
Surge ops: Pre-assign escalation channels and SDKs (Slack + dedicated bridge), and load human-in-the-loop automation to flip routing/config safely.
Monitoring & alerts: Alert on symptoms, not single metrics: sudden cache-hit drops + rising origin 5xx = immediate guardrails.

“At extreme scale, your ability to degrade cleanly matters more than your ability to add features.”

5) Scalable transcoding and ingest pipelines

Transcoding pipelines must be horizontally scalable, containerized, and leverage hardware where possible. Key patterns:

Profile set rationalization: Reduce bitrates/variants to the minimum you can while meeting SLOs — each variant increases CDN storage and origin API costs.
Hardware acceleration: Use GPUs or dedicated ASIC encoders for high-density live transcoding to reduce per-stream cost and latency.
Stateless workers: Keep encoders stateless and drive them via message queues. Use durable queues (Kafka, Pulsar) for ingest metadata and job orchestration.

6) Network & transport optimizations

In late 2025 and into 2026, HTTP/3 and QUIC have become standard features on major CDNs, reducing connection churn and improving throughput for mobile-heavy audiences. Adopt the stack:

HTTP/3 at edge: Leverage HTTP/3 between client and edge to reduce head-of-line blocking and TCP handshake costs for high-concurrency bursts.
Anycast and POP density: Ensure your CDN partner has dense PoP coverage in target regions. Multi-CDN with intelligent steering reduces risk of single-CDN saturation.
TLS session reuse and key caching: Tune TLS session resumption to avoid handshake pressure during peaks.

Practical playbook — what to do before, during, and after an event

Before: prepare and validate

Run a rehearsal test at 50–70% expected peak across geographies including CDN edges.
Warm caches by seeding the CDN with initial segments and manifests 30–60 minutes before start.
Pre-scale encoder/transcode pools and verify readiness (health probes green) with zero-downtime deployment patterns.
Activate enhanced telemetry: client-side RUM, edge hit-rate dashboards, origin 5xx heatmap.
Ensure runbooks are current and that the incident bridge is provisioned with key personnel.

During: operate and contain

Keep a small, empowered ops cell for live changes.
Watch for early indicators: manifest cache miss spike, regional 5xx rise, transcode queue backlog.
If origin overload occurs, enable origin shield and throttle new manifest generations. Gracefully disable non-essential features.
Use multi-CDN steering to redirect traffic away from a suffering CDN PoP automatically.

After: learn and harden

Conduct a blameless postmortem with metrics and timelines attached.
Update SLOs and runbooks based on observed behaviors and false assumptions.
Automate improvements: cache warming automation, predictive scaling rules updated from real traffic patterns.

Tooling & DevOps packages that accelerate readiness

For teams building portable, repeatable setups, these tools and packages matter in 2026:

Observability: Prometheus + Grafana (kube-prometheus-stack), OpenTelemetry for tracing across CDN/edge/origin.
Load testing: k6 (distributed), Locust, Vegeta; run via Kubernetes jobs and scale generators per region.
Autoscaling: KEDA for event-driven scaling (Kafka, SQS), HPA/VPA for general workloads, and cloud provider predictive autoscaler features.
Transcoding: Portable Docker images for FFmpeg-based pipelines, GPU-enabled container runtimes (NVIDIA device plugin), and managed encoder services where latency and control allow.
Edge devkits: Cloudflare Workers, Lambda@Edge or equivalent for manifest manipulation, signed URL generation, and lightweight ABR logic at the edge.
Storage: S3-compatible object storage with lifecycle policies and region-replication; MinIO for portable local testing clusters.

Applying the lessons: a short blueprint inspired by JioHotstar

Imagine you expect a 50M–100M-viewer event in your market. Use this blueprint:

Engage a multi-CDN strategy and force early contract SLAs for POP capacity. Validate PoP connectivity with small-scale pre-tests.
Design manifests and segment naming so multiple CDNs can cache the same objects (identical URLs or signed, normalized keys).
Pre-warm edge caches with the first N segments and top bitrate few minutes before kickoff using an automated job; check edge hit ratios continuously.
Implement a predictive scaler to bring encoder/transcode capacity to 60–80% ahead of start, with fast reactive autoscaling to handle surprises.
Prepare runbooks that allow a single operator to disable non-critical features (chat, overlays) and switch to low-bitrate fallback pools in under 60 seconds.

2026 trends and what comes next

Late 2025 and early 2026 cemented a few trends that directly affect live-stream design:

HTTP/3 and QUIC mainstreaming: Better congestion handling and faster resumption reduce startup stalls, especially over mobile networks.
Edge compute convergence: CDNs now do more than cache — they host logic to stitch manifests, sign URLs, and even run ABR heuristics at the edge.
Observability at the edge: New telemetry channels make edge metrics (cache misses, origin-fallback rates) first-class, enabling faster automated guardrails.
AI-assisted incident ops: Early 2026 tools assist with anomaly detection and suggested mitigation steps, but human-run playbooks still win in high-stakes events.

Actionable takeaways (your 10-point checklist)

Calculate request + bandwidth expectations from viewer counts and segment cadence.
Target ≥99.9% edge cache hit for segments to keep origin load manageable.
Pre-warm caches and seed popular segments before kickoff.
Use predictive autoscaling plus fast-reactive layers (HPA + KEDA + warm pools).
Distributed load tests that validate CDN behavior and peering, not just your origin.
Keep lightweight fallbacks (low-bitrate pools, disable extras) that can be flipped via feature flags.
Implement origin shielding and regional edge-to-shield routing to avoid cache stampedes.
Instrument everything — client RUM + edge invocations + origin metrics in a single pane.
Plan and rehearse incident response with a blameless postmortem culture.
Adopt new transport technologies (HTTP/3) and edge compute patterns as they mature in your CDN partner stack.

Final note — scale is a system property, not a single team’s job

JioHotstar’s public 99M viewers figure is a reminder: at extreme scale, architecture choices compound. A small cache-miss policy, an unprepared transcode pool, or a slow autoscaler becomes a system-level outage. The reliable path is to treat events as integrated systems — orchestrated caches, prepared encoders, predictable autoscaling, and a rehearsed on-call team.

Call to action

If you’re responsible for event-scale streaming, start by running a CDN-edge rehearsal inside your staging environment this quarter. Use the checklist above; seed your caches; run distributed k6 tests from multiple regions; and practice the runbooks under load. Want a starter kit — Helm charts, k6 scenarios, example runbooks, and Prometheus dashboards tuned for streaming events? Download our open playbook and deployable packages to get a repeatable, portable baseline for event readiness.

Scaling to 99M Viewers: Architecture Lessons from JioHotstar's Record Traffic