Complete Technical Walkthrough · Final Spec · May 2026

Data centres, end to end

From a single drive to a sovereign-cloud build. Every concept you need to read a vendor pitch, evaluate a colo contract, or sit on a procurement call without faking it. Written for someone who has never set up infrastructure before.

Final pod configuration locked: 5 storage + 1 control node, all-NVMe Supermicro 1U, EPYC 7313P, 8 × 30.72 TB Samsung PM9A3, refurbished SN2700 100 GbE, EC k=4,m=1. 1.23 PB raw / ~983 TB usable. Supersedes the earlier three-variant analysis.

Part 1

Foundations — what's physically in the rack

01 The simplest mental model

A storage cloud is just three things stitched together. The hardware is dumb. The software is the product.

When we say “1 PB pod” we mean physical drive capacity summed across every server in the rack — before any compression or redundancy math. 1 PB = 1 petabyte = 1,000 TB = 1,000,000 GB. A typical iPhone holds ~128 GB; a petabyte is roughly 7,800 of those.

02 Inside one server (a “node”)

Every box in the rack has the same internal anatomy. Trace bytes from the network port to the spinning disk.

What each component does, in plain terms

CPU — the brain

Runs the OS, RustFS, Jam. EPYC is AMD's data-centre CPU line; Xeon is Intel's equivalent. EPYC chips bring lots of cores (parallel workers) and lots of PCIe lanes (~128 on a single socket) — both are what you need for storage.

We picked the EPYC 7313P: 16 cores, 3.0 GHz, 155 W, Milan generation, single-socket. Why this chip:

16 cores is enough. Jam doesn't need many — compression is fast and mostly memory-bandwidth-bound. Paying for 32+ cores would be waste.
Milan over Genoa. Genoa is the newer, faster generation but PCIe Gen 5 NVMe drives are still expensive and our PM9A3 drives are Gen 4. Milan + Gen 4 runs cooler (~20% less power) and is significantly cheaper.
Single socket. Two-socket boards add cost, complexity, and a second NUMA node that software has to dance around. Not needed at this scale.
“P” suffix = single-socket SKU only. Locked at one socket but cheaper than the dual-socket version.

A core is a complete CPU on its own. Threads are virtual cores — usually 2× physical with hyperthreading. NUMA = Non-Uniform Memory Access, the latency penalty when a core reaches across to RAM attached to another socket.

RAM — short-term memory

Where data sits temporarily before it lands on disk. Critical for our workload because RustFS uses it as a write buffer — incoming bytes pile up here while Jam compresses them, then flush to drives in larger chunks. Larger buffer = smoother throughput under bursty load.

ECC = Error-Correcting Code. Cosmic rays do flip RAM bits in production (real, measurable, ~1 flip per GB per year). Without ECC, those flips silently corrupt data. ECC catches and fixes them. Mandatory for production storage.

PCIe bus — the internal highway

The high-speed pipe inside the server connecting CPU to everything (NICs, NVMe drives, HBAs, GPUs). Speed is measured in lanes:

PCIe Gen 4: ~2 GB/s per lane. A 16-lane (x16) GPU slot = 32 GB/s.
PCIe Gen 5: ~4 GB/s per lane (double Gen 4).
Each NIC, NVMe, HBA needs lanes. Servers compete for them — that's why EPYC's 128 lanes matter.

NIC — Network Interface Card

The card that connects the server to the outside world. Speed of NIC = speed at which the server can talk to anything outside it.

1 GbE = 1 gigabit per second = ~125 MB/s. Home internet speeds. Useless for storage.
10 GbE = 1.25 GB/s. Old data centre standard. One HDD can saturate it.
25 GbE = 3.1 GB/s. Common but borderline for our workload.
100 GbE = 12.5 GB/s. What we want. Mellanox is the brand that dominates this tier.

HBA — Host Bus Adapter

The card that connects CPU to physical drives. Two modes:

JBOD / passthrough: each drive shows up as an independent device. The OS (and RustFS) sees them all directly. This is what we want.
Hardware RAID: the HBA does redundancy math itself, presents one logical “drive” to the OS. Old-school. Bad for our stack because RustFS does redundancy in software and hardware RAID would hide failures from us.

Drives — the actual storage

Three families exist in the market. We use NVMe-only across all five storage nodes.

HDD archival

Hard Disk Drive. Spinning platters, mechanical head. ~250 MB/s sequential, 150–250 IOPS random. ~$15/TB at 20 TB. Not in our pod. May reappear at the 1 MW build for a true bulk archive tier.

SATA SSD middle

Flash drives on the legacy SATA bus. ~600 MB/s ceiling. Cheaper than NVMe per TB but slower. Bottlenecks badly on writes. Not in our pod.

NVMe our pick

Flash chips talking PCIe directly. ~6–7 GB/s sequential, 100,000+ IOPS. The PM9A3 30.72 TB drive: ~$0.06–0.10/GB at scale.

Why the PM9A3 specifically

30.72 TB per drive. 8 drives × 30.72 TB = 245.76 TB raw per node. Five nodes = 1.23 PB raw in 5U of rack space. Insane density.
U.2 form factor. Hot-swappable from the front of the chassis. Replace a failed drive without taking the server down.
TLC NAND. Triple-Level Cell — three bits per memory cell. Density vs endurance trade-off. Modern TLC is reliable enough for production. (QLC = 4 bits, denser but lower endurance, riskier.)
1 DWPD endurance. “Drive Writes Per Day” — you can rewrite the entire drive once per day for the warranty period (5 years) before flash wears out. For our workload (write-once, read-many), this is plenty. Higher-write workloads need 3 DWPD or “Mixed Use” drives.
PCIe Gen 4 ×4. Each drive gets 4 lanes of Gen 4 = ~8 GB/s ceiling. Real-world ~6.5 GB/s. We don't go to Gen 5 because the cost premium isn't justified for this generation of drives.

The new bottleneck math. 8 NVMe × ~6.5 GB/s = ~52 GB/s of raw drive throughput per node. That's wildly faster than any reasonable network. Two 100 GbE ports = 200 Gbps = 25 GB/s — so the network is now the bottleneck, not the disks. Exactly what we want: Jam and RustFS will be CPU-bound long before the drives are saturated, and that means compression ratio (not drive count) is what determines economics.

03 Multiple servers — the cluster

A cloud needs multiple servers, a network fabric connecting them, and software that coordinates them as one logical pool.

A note on jargon you'll hear:

North-south traffic — bytes moving in and out of the cluster (customer ↔ servers). Goes through the border router.
East-west traffic — bytes moving between nodes inside the cluster (replication, EC chunk distribution, rebalancing). Goes through the ToR switch. Far higher volume than north-south, which is why the ToR fabric matters so much. Our pod sustains ~1 Tbps east-west.
Colo — short for colocation. A facility that rents you rack space, power, and network uplinks. You bring your own servers; they bring the building.
Rack / cabinet — a metal frame that holds 42–48 servers stacked vertically. Standard width, measured in U (1U = 1.75 inches tall). A 2U server takes up 2 slots. Our entire pod fits in 7–9U of a 42U cabinet (6 × 1U servers + 1U ToR + 1U OOB).
DAC vs AOC cables — DAC (Direct Attach Copper) is rigid, cheap (~$50–80/cable), good for runs under 3 m. AOC (Active Optical Cable) is flexible, expensive (~$200–400), good for longer runs. Inside one cabinet, DAC wins.
OOB — Out-of-Band management. A second, slow network (1 GbE) used to manage servers when the primary network is down. Every server has a dedicated management port (Dell calls it iDRAC, Supermicro calls it IPMI/BMC). Lets you reboot or reinstall a node remotely.

04 The software stack — bottom to top

Hardware is the bottom. The product is the top. Each layer rests on the one below it.

LAYER 6

What we sell

Pilot · Production (per-TB/mo) · Enterprise / Gov / Tender

LAYER 5

Platform services

Dashboards · monthly reports · restore-proof · audit log · billing · hosted AI inference

LAYER 4

S3 API — RustFS

Speaks S3, translates PUT/GET, buffers writes in RAM, handles bucket isolation + IAM + encryption

LAYER 3

Codec — JAM our IP

Compresses on write (3–8× typical, up to 100×) · decompresses on read · hash-verifies on restore

LAYER 2

Storage presentation — JBOD

Each drive shows up as an independent device. RustFS does redundancy in software.

LAYER 1

AlmaLinux (FIPS) for gov / Ubuntu for commercial. Kernel, drivers, network stack.

LAYER 0

Hardware

CPU + RAM + NIC + HBA + Drives — the physical box from section 2.

What each layer does, in plain terms

Layer 0 — Hardware. The physical box. Boring, expensive, breaks occasionally.
Layer 1 — OS. The operating system everything else runs on. AlmaLinux is a free Red Hat clone; with the FIPS-140 module it's a US-government-approved cryptography baseline. Ubuntu is the mainstream commercial Linux.
Layer 2 — JBOD. A configuration choice on the HBA, not separate software. Tells the OS to treat each drive as its own device.
Layer 3 — Jam. Our codec. Sits between RustFS and the disks. Every byte going to disk gets compressed first; every byte read gets decompressed first. Customer never sees Jam — they see their original data.
Layer 4 — RustFS. The S3-compatible object server. Customers talk to this layer using standard S3 commands (the same ones they use for AWS S3). RustFS is what makes the pod look like AWS S3 from outside. Buffers writes in RAM for snappy demos.
Layer 5 — Platform. The Strata-specific services that wrap raw storage: dashboards, billing, restore-proof, AI inference. This is where we differentiate from “cheap S3.”
Layer 6 — Pricing tiers. The three commercial wrappers we sell.

Key insight: The customer never sees Jam directly. They see an S3 endpoint. Compression is invisible — they just notice they're paying for less storage than they expected.

05 Where Jam sits — the data path

Trace what happens when a customer uploads a 100 GB file.

Customer issues S3 PUT my-file.dat (100 GB)

TLS 1.3 encrypted in transit

NIC receives at up to 100 Gbps

Mellanox 100 GbE

RustFS accepts the request, authenticates, picks bucket + drives

Layer 4 — the S3 server

RAM buffer holds incoming bytes

256 GB available — keeps demos snappy under load

JAM compresses our IP

100 GB → ~26 GB at 3.86× · or ~12.7 GB at 7.85× with ZSTD layered

AES-256 encryption at rest

Per-tenant key

JBOD layer writes to the physical drive

HDD or NVMe — bytes hit the media

On read, the same flow runs in reverse: bytes off the drive → decrypt → Jam decompresses → RAM buffer → stream to customer.

The bottleneck wins. If the network is 25 GbE, the chain caps at 25 Gbps. If the disk is HDD, the chain caps at HDD speed. If RAM is too small, RustFS thrashes. Compression ratio is set by the codec, not the disk — HDD vs NVMe doesn't change 3.86×.

06 The final BOM — what we're actually building

After the three-variant analysis (HDD bulk vs NVMe demo paths), we landed on a simpler, denser, cost-optimised configuration: five identical all-NVMe storage nodes plus one control / GPU node. Same chassis everywhere.

Why this shape — five identical all-NVMe nodes

Five nodes maps perfectly to EC k=4,m=1. One chunk per node, 25% storage overhead, survives loss of any one node. Three nodes forced k=2,m=1 (50% overhead) — far less efficient.
All-NVMe collapses the tiering question. No HDD/NVMe split, no “demo path vs bulk path,” no decision tree. Every node is a hot node.
1U Supermicro chassis is denser and cheaper. Six servers + ToR + OOB = 7–9U total. Leaves 30+ U in the cabinet for future expansion.
Refurbished networking. SN2700 ToR and ConnectX-5 NICs both bought used with vendor warranty — significant capex saving over new SN3700C.
DAC cables. All node-to-ToR runs are under 3 m inside the cabinet, so passive copper DAC cables work and are 4–5× cheaper than active optical.

Storage nodes (× 5)

Chassis	Supermicro AS-1115HS-TNR — 1U, 8-bay U.2 NVMe front access
CPU	AMD EPYC 7313P — 16C / 3.0 GHz / 155 W · Milan-gen · single socket · “P” SKU
RAM	256 GB DDR4-3200 ECC (4 × 64 GB RDIMMs) — upgrade to 512 GB once paying users justify it
Drives	8 × Samsung PM9A3 30.72 TB U.2 NVMe — PCIe Gen 4 · TLC NAND · 1 DWPD endurance
NIC	Mellanox ConnectX-5 dual-port 100 GbE — refurbished with vendor warranty
Power	Dual hot-plug PSU (A + B feed)
Per-node raw	~245.76 TB · 200 Gbps network · ~700 W typical draw

Control / GPU node (× 1)

Chassis + CPU + NIC	Identical to storage nodes (parts commonality, simpler ops)
RAM	256 GB initially → 512 GB when GPU workload justifies
GPU	NVIDIA L4 — 24 GB · 72 W · ~$2–3K · handles inference workloads · upgrade to L40S only when training revenue lands
Drives	2 × 1.92 TB NVMe — boot + local cache (no bulk storage on this node)
Hosts	RustFS coordinator · customer dashboards · billing · audit logs · AI inference · security-ops VM (Lucas's Kali pattern, runs here until a regulated tenant pays for a dedicated box)

Networking

ToR switch	Mellanox SN2700 — 32-port 100 GbE · refurbished (~$5–8K used vs ~$15–20K new SN3700C)
OOB switch	Any 1 GbE managed switch (Netgear / TP-Link enterprise, ~$200) for IPMI / BMC access
Cabling	14 × 100 GbE DAC (passive copper, <3 m) + 7 × Cat6A for OOB

Rack & power

Cabinet	1 × 42U — colo-provided, included in lease
PDUs	2 × vertical 32A · A + B feed · colo-provided
Space used	7–9U occupied (6 × 1U servers + 1U ToR + 1U OOB)

Pod totals

1.23 PB

Raw capacity (5 × 245.76 TB)

~983 TB

Usable after EC k=4,m=1

~1 Tbps

East-west fabric (5 × 200 Gbps)

4–5 kW

Typical draw · 6 kW peak

Why Milan over Genoa, Gen 4 over Gen 5: Milan EPYC + Gen 4 NVMe runs noticeably cooler than Genoa + Gen 5 (~20% lower thermal draw), which is meaningful in Indian colo summers. Gen 5 NVMe drives at this density are also still expensive enough that the cost premium isn't worth it for our compression-bound workload — we hit the network ceiling long before drive bandwidth matters.

Part 2

Vocabulary — the words you need to read any vendor pitch

07 Storage performance metrics

Three numbers describe every storage system. They mean different things and you can't substitute one for another.

Throughput (MB/s, GB/s)

How many bytes the system moves per second. Like the width of a hose.

A 12-disk HDD node: ~3 GB/s sequential. A 100 GbE link: ~12.5 GB/s. Matters for: backups, big-file uploads, video.

IOPS (operations/sec)

How many distinct read/write requests the system handles per second, regardless of size. Like the number of separate trips the hose makes.

HDD: ~150–250 random IOPS. NVMe: 100,000+ IOPS. Matters for: databases, web apps, anything with lots of small reads/writes.

Latency (ms, μs)

How long a single operation takes. Like the time from turning on the tap to water arriving.

HDD seek: ~5–10 ms. NVMe read: ~50–100 μs (100× faster). Matters for: user-facing latency, real-time systems.

Why these are different and why it matters

Imagine moving 1 GB of data:

One 1 GB file at 250 MB/s sequential takes 4 seconds (HDDs are fine).
One million 1 KB files at 200 IOPS takes 5,000 seconds = 83 minutes (HDDs are useless; you need NVMe).

Same total data. Different access pattern. Pick the drive based on the workload, not just the capacity.

Sequential vs random

Sequential

Reading or writing bytes in order, one after the other. HDDs are great at this — the head doesn't have to move.

Examples: streaming video, full-disk backups, large file uploads.

Random

Reading or writing bytes in unpredictable locations. HDDs are terrible because the mechanical head has to seek every time. NVMe doesn't care — no moving parts.

Examples: database queries, virtual machine boot disks, web app traffic.

Read vs write

SSDs and NVMe drives are usually faster at reads than writes. Write-heavy workloads need different drives (or more drives) than read-heavy ones. Vendors often quote the more flattering of the two — read the spec sheet carefully.

08 Durability vs availability — and the “nines”

Two different promises that sound similar. Customers conflate them. Don't.

Durability — “will I lose data?”

Probability that a stored byte is still there next year. Measured in “nines.”

9 nines (99.9999999%): AWS S3 standard. ~1 file lost per billion per year.
11 nines (99.999999999%): AWS S3 marketing claim. ~1 per 100 billion.

Driven by: how many copies you keep, how independent the failure modes are, how fast you detect and repair.

Availability — “can I read my data right now?”

Percentage of time the system answers requests. Measured in “nines” of uptime.

3 nines (99.9%): 8.76 hours downtime/year
4 nines (99.99%): 52 minutes/year
5 nines (99.999%): 5.26 minutes/year — telecoms standard

Driven by: redundant power, redundant network, software failover, geographic distribution.

Common trap: a system can have 11 nines of durability (your data is safe) but 3 nines of availability (you can't reach it for 8 hours/year). Tape libraries are extremely durable and very unavailable. Cache-only stores are very available but undurable. You need both, and they're priced separately.

Other vocabulary you'll see in SLAs

MTBF Mean Time Between Failures

Average time a piece of hardware runs before breaking. HDDs: ~1.5M hours (~170 years on paper, but the real number is much lower).

MTTR Mean Time To Repair

Average time to detect a failure and finish recovery. For us: rebuild a failed drive's worth of EC chunks. Lower MTTR → higher durability.

AFR Annualised Failure Rate

% of drives that fail per year. ~1–3% in practice. Backblaze publishes the best public data.

SLA Service Level Agreement

The contractual promise. Defines availability, durability, response times, and what credits the customer gets when you miss them.

Part 3

Keeping data safe — the five concepts the industry blurs

09 Replication — multiple live copies

Keeping the same byte on N machines simultaneously, kept in sync. Simple, fast to read, expensive to store.

The basic shape

Synchronous vs asynchronous — the most important distinction

Synchronous replication

Customer's write isn't acknowledged until every replica has confirmed it. Strongest consistency. Slowest writes. If any replica is slow, all writes are slow.

Asynchronous replication

Primary acknowledges the customer immediately. Replicas catch up later, in the background. Fast. Can lose data if the primary dies before the bytes propagate.

Quorum / semi-synchronous

Middle ground. Acknowledge the customer once a majority of replicas confirm (e.g. 2 of 3, 3 of 5). Tolerates one slow node without sacrificing consistency. This is what modern distributed systems (Kafka, etcd, Cassandra, Spanner) actually do.

Topology choices — primary-replica vs multi-master

Primary-replica (master-slave)

Only one node accepts writes. Replicas serve reads. If the primary dies, a failover promotes a replica.

Used by: PostgreSQL, MySQL replication, MongoDB.

Multi-master (active-active)

Any node can accept writes. Conflicts must be resolved (last-write-wins, CRDTs, vector clocks). Hard to get right.

Used by: Cassandra, DynamoDB, CouchDB.

What we use in the pod: RustFS does erasure coding (next section), not replication. Replication is more common in databases than in object stores. We don't pay for 3× storage when we can pay for 1.25× and get the same durability.

10 Erasure coding — the math that beats replication

Math-based redundancy. Split data into chunks, add parity, recover from any subset. This is what we use.

The notation: k + m

k = number of data chunks the original is split into
m = number of parity chunks added
Total stored = k + m chunks. Storage overhead = m/k.
Tolerance = you can lose any m chunks and still recover.

How recovery works (the magic)

Reed-Solomon coding (the math RustFS uses) treats your file as numbers in a special algebra called a Galois field. The parity chunks are computed so that any k of the (k+m) chunks are enough to reconstruct the original.

If a drive holding D2 dies:

Detect failure (RustFS health check)
Read the remaining 5 chunks (D1, D3, D4, P1, P2) — we only need 4 of them
Solve the linear system → recover D2
Write D2 to a fresh drive on a different node

No human intervention. No replica conflicts. Just math.

Replication vs erasure coding — the trade-off

Property	Replication (3×)	EC k=4,m=2	EC k=8,m=2
Storage overhead	200% (3× original)	50% (1.5×)	25% (1.25×)
Tolerates	Loss of 2 nodes	Loss of 2 nodes	Loss of 2 nodes
Read performance	Fast (one chunk = one read)	Slower (gather k chunks)	Slowest (gather 8 chunks)
Write performance	Fast (no math)	Slower (parity compute)	Slower (parity compute)
Repair traffic	1× drive's worth	k× drive's worth	k× drive's worth
Best for	Hot data, small files	Bulk data, large files	Cold archives at scale

What we run today: the 5-node pod uses k=4, m=1 — split into 4 data + 1 parity, distributed exactly one chunk per node. 25% storage overhead (1.25× the original size), survives loss of any single node. The five-node count was chosen specifically so this scheme fits cleanly: each node holds exactly one chunk of every object, so recovery from a node failure pulls one chunk from each of the four survivors.

At 1 MW we'll move to k=8, m=2 — same 25% overhead but tolerates loss of two nodes simultaneously, which becomes important once you're operating dozens of storage servers.

11 Snapshots — point-in-time views

A frozen view of a dataset at a moment in time, on the same hardware. Cheap thanks to “copy-on-write.”

How copy-on-write (CoW) works

What snapshots are good for

Ransomware recovery. Take hourly snapshots. If malware encrypts your files, roll back.
“Oops I deleted that.” Customer self-service restore from a recent snapshot.
Compliance. Audit-friendly point-in-time views (“what did this data look like on March 31?”).
Test/dev clones. Spin up a snapshot as a fresh dataset for testing without copying anything.

What snapshots are NOT: they're not backups. They live on the same hardware as the original. If the building burns down, both go with it. Snapshots protect against logical corruption (deletion, ransomware), not physical loss.

12 Backups — copies on different hardware

A point-in-time copy on different hardware (often a different site). Survives total loss of the primary.

The 3-2-1 rule

Backup types — full, incremental, differential

Full backup

A complete copy of everything. Largest. Slowest to make. Fastest to restore (single file → done).

Sun
FULL

Mon
FULL

Tue
FULL

Wed
FULL

Thu
FULL

Fri
FULL

Sat
FULL

Storage cost: very high. Restore time: fastest. Used for: small datasets, weekly anchors.

Incremental backup

Only the changes since the last backup of any kind. Smallest daily size. Slowest to restore — you need the full + every increment.

Sun
FULL

Mon
INC

Tue
INC

Wed
INC

Thu
INC

Fri
INC

Sat
INC

Restore Friday's data: replay Sun + Mon + Tue + Wed + Thu + Fri. Six restore steps.

Differential backup

Only the changes since the last full. Sizes grow through the week. Faster restore than incremental — just full + latest differential.

Sun
FULL

Mon
DIFF

Tue
DIFF

Wed
DIFF

Thu
DIFF

Fri
DIFF

Sat
DIFF

Restore Friday's data: replay Sun + Fri. Two restore steps. Storage middle-ground.

RPO and RTO — the two numbers customers ask for

RPO — Recovery Point Objective data loss budget

How much data are you willing to lose? Determined by backup frequency.

Daily backup → RPO = 24 hours
Hourly snapshot → RPO = 1 hour
Sync replication → RPO ≈ 0

RTO — Recovery Time Objective downtime budget

How long until service is restored? Determined by recovery process.

Restore from tape → RTO = hours-days
Restore from disk backup → RTO = minutes-hours
Hot DR site failover → RTO = seconds

Pricing tip: RPO and RTO are inversely proportional to cost. RPO=0 / RTO=0 (always-on hot replica) is 5–10× the storage cost of daily-backup-with-2-hour-RTO. Customers should pick targets that match their actual business pain, not what sounds impressive.

Restore-proof drills our differentiator

Backups that are never tested are not backups — they're guesses. Every month we pick a random sample of customer data, restore it from backup, hash-verify against the original, and produce a report with timestamps and ratios. SOC 2 requires this. Most providers don't actually do it.

13 Disaster recovery — surviving site loss

Backups + a plan. The plan matters as much as the data.

Three flavours of standby

Type	What's running	Failover time	Cost
Cold standby	Backup tapes/files in storage. No live infrastructure.	Hours to days (rebuild + restore)	Lowest
Warm standby	Hardware powered on, software installed, data replicated periodically.	Minutes	Medium
Hot standby	Fully running mirror, sync replication, can take traffic immediately.	Seconds (automatic failover)	Highest (~2× primary)

Failover and failback

Failover = switching from primary to DR site when something goes wrong. Failback = switching back to primary once it's repaired. Both should be tested quarterly. Many companies have working failover and broken failback because they never practise it.

The grim truth about DR: most organisations discover their DR plan is broken during the actual disaster. The two ways to avoid this are (1) regular DR drills with full traffic cutover, and (2) automated failover that has run successfully under load. Anything else is theatre.

Part 4

Physical plant — power, cooling, and the building

14 Power & UPS — the chain that keeps everything running

Servers care about clean, continuous power. The grid does not provide this on its own.

Each piece in detail

UPS — Uninterruptible Power Supply

A wall of batteries that bridges the gap when grid power dies. The runtime is short on purpose — UPS is meant to keep the lights on for the 30–60 seconds it takes the diesel generator to start, plus a buffer. Common types:

VRLA (lead-acid): cheap, heavy, 3–5 year life. What our 1 MW build uses.
Lithium-ion: 3× more expensive, half the weight, 10+ year life, less floor space. Increasingly the default.

Generator — diesel or gas

Kicks in within seconds of grid loss. Sized to run the entire facility at full load indefinitely (as long as fuel keeps arriving). The fuel logistics are a real operational concern — long grid outages have starved data centres of diesel.

PDU — Power Distribution Unit

The strip that distributes power to individual servers in a rack. Modern PDUs are “intelligent” — they meter per-outlet power draw, which is how you bill customers in colocation.

Dual feeds (A & B)

Servers have two power supplies (PSUs). Each plugs into a separate PDU, fed by a separate UPS, fed by a separate utility feed. Either side can fail without dropping the server. This is what “2N power” means.

Redundancy notation: N, N+1, 2N, 2N+1

Notation	What it means	Example
N	Just enough capacity for the load. No redundancy.	1 UPS for 100 kW load
N+1	One spare unit. Tolerates 1 failure.	2 UPS units, either can run the full load
2N	Two completely independent paths. Tolerates failure of an entire path.	2 separate UPS systems, each carrying the full load on its own
2N+1	Two paths, plus one spare on each. Highest practical redundancy.	Tier IV facilities

PUE — Power Usage Effectiveness

The efficiency metric the industry uses. PUE = total facility power ÷ IT power.A PUE of 1.0 is theoretical perfection. Modern hyperscale data centres run 1.1–1.2. Older enterprise sites run 1.8–2.5. Lower is better. India's hot climate makes 1.1 hard.

15 Cooling — what nobody warns you about

Servers turn nearly 100% of their electricity into heat. A 1 MW server load produces 1 MW of heat. That heat has to go somewhere.

Hot aisle / cold aisle layout

Servers all face the cold aisle and exhaust into the hot aisle. The hot aisle air goes back to the air conditioner and starts over. Mixing hot and cold air is the most common efficiency killer. Modern designs put physical containment doors between aisles.

Cooling technologies

CRAC — Computer Room Air Conditioner. Big AC units around the perimeter. Standard for the last 30 years.
CRAH — Computer Room Air Handler. Uses chilled water from a building plant. More efficient at scale.
In-row cooling — AC units placed between racks, much shorter air path.
Liquid cooling / direct-to-chip — coolant flows over the CPU directly. Required for the densest GPU racks (40 kW+). Becoming standard for AI training.
Immersion cooling — entire servers submerged in non-conductive fluid. Niche but growing.

Why this matters for India: ambient temperatures of 35–40°C in summer make air cooling inefficient — you spend more energy on AC than on the servers. Captive solar + water-side economisers (using cooler night air) help. Liquid cooling becomes attractive faster than in colder climates.

16 Network topology — leaf-spine and why it replaced everything

Old data-centre networks couldn't keep up with east-west traffic. Modern ones use a different shape.

The old way — three-tier

The modern way — leaf-spine

Why leaf-spine wins

Predictable latency. Every server-to-server path is exactly 2 hops. The 3-tier design varies from 2 to 6.
ECMP — Equal-Cost Multi-Path. Traffic between two leaves spreads across all spines simultaneously. Bandwidth = sum of all spine links, not just one path.
Easy to scale. Need more bandwidth? Add a spine switch. Need more rack ports? Add a leaf. No re-engineering.
No spanning tree. Modern fabric protocols (BGP-EVPN, VXLAN) route around failures in milliseconds.

Our build: the 1 PB pod has just one ToR switch (single-rack — leaf-spine isn't needed yet). The 1 MW build needs leaf-spine: many 100 GbE leaf switches, multiple 400 GbE spines, BGP-EVPN underneath.

17 Tier ratings — Uptime Institute classification

A standard the whole industry uses to describe how resilient a facility is. Memorise these — they're in every RFP.

Tier	Redundancy	Annual downtime	What's needed
Tier I	N (basic)	~28.8 hours	Single power and cooling path. No redundancy. Office-grade.
Tier II	N+1	~22 hours	Redundant components (UPS, generators) but single path. Maintenance requires shutdown.
Tier III	N+1, concurrently maintainable	~1.6 hours	Multiple paths but only one active at a time. Can do maintenance without downtime. Industry sweet spot.
Tier IV	2N or 2N+1, fault tolerant	~0.4 hours	Fully fault tolerant. Tolerates failure of any single piece without service impact. Very expensive.

What customers actually want: Tier III is good enough for 99% of workloads and ~half the cost of Tier IV. Tier IV is for stock exchanges, defence, hospitals.

Watch the language: “Tier III-class” or “Tier III-equivalent” means a vendor built to the spec but didn't pay the Uptime Institute for certification. Real Tier III certification is independent and audit-driven. Our 1 MW build will go through formal review.

Part 5

The product — what we sell and how it scales

18 What we sell

Same platform stack underneath. Three commercial wrappers on top.

Pilot

“Two weeks to first benchmark report”

Customer pushes a sample workload to an isolated bucket
We run Jam, produce compression + restore report
Discounted, with benchmark-rights agreement
1 storage node + control · ~50–200 TB slice
Outcome: convert to Production

Production

“Per-TB / month managed”

S3 endpoint · TLS 1.3 · AES-256 at rest
Per-tenant bucket isolation + IAM
Customer dashboard (TB stored, TB saved, ratio)
Monthly billing export
Erasure-coded redundancy (k=4,m=1 on the 1 PB pod, 25% overhead)
Snapshots + restore-proof drills
Named technical contact · SLA per workload tier
Anchor: M2M tier — ~50% below commercial cloud (e.g. AWS S3)

Enterprise / Gov / Tender

“Bespoke”

Sovereign-data clauses (DPDP Act 2023 residency)
Dedicated network domain (VPN/VPC peering)
Customer-controlled KMS / BYOK
Air-gapped option (Jam on customer's own iron)
Project-financed work-order model (India tender)
Bespoke compliance reporting
Hosted AI inference, customer-isolated

19 1 PB → 1 MW scale-up

Nothing in the proof pod gets thrown away. Every architecture choice carries forward — the 1 MW build adds layers around it.

Dimension	1 PB proof pod (final spec)	1 MW reference build
Physical	Single cabinet in shared colo · 7–9U occupied	~14 cabinets in own facility
Nodes	5 storage + 1 control (all 1U Supermicro)	~68 storage nodes + multiple control
CPU	EPYC 7313P · 16C · Milan · single-socket	EPYC Genoa or successor, mix of single & dual socket
Storage	All-NVMe · 8 × 30.72 TB Samsung PM9A3 per node	Hot NVMe ~5% · Warm SSD ~20% · Bulk HDD ~75% (tiered)
Raw capacity	1.23 PB raw · ~983 TB usable	~50–100 PB raw at full build-out
Network	1 × Mellanox SN2700 100 GbE ToR (refurbished) · DAC cables	Leaf-spine: many 100 GbE leaves + 400 GbE spines
East-west bandwidth	~1 Tbps	10s of Tbps
Redundancy	EC k=4,m=1 (25% overhead, tolerates 1 node loss)	EC k=8,m=2 (25% overhead, tolerates 2 node loss)
Power	4–5 kW typical / ~6 kW peak from colo PDU	~720 kW IT (or ~70 kW backup-tier) from captive solar @ ₹2.35/kWh
UPS / battery	Provided by colo	~180 kWh VRLA usable (15-min runtime)
Generator	Provided by colo	Diesel backup, our own fuel logistics
Compliance	Tier-3 facility class via colo	Tier III formal review (independent MEP audit)
Capex	~$60–80K (refurbished networking saves significant)	$8–10M rack-level
Opex	~$5–8K/month (lease + power + bandwidth)	Power (offset by solar) + maintenance + bandwidth

Carries over unchanged

✓ Jam codec
✓ RustFS S3 layer
✓ Customer dashboards
✓ Telemetry stack

✓ Per-tenant isolation
✓ Compliance posture
✓ Operational playbook
✓ Mellanox / EPYC choices

New at 1 MW

Tiered storage management (hot / warm / bulk)
Leaf-spine network design
UPS / battery engineering
Captive solar PPA
Generator + fuel logistics
Formal Tier III review

The takeaway: the spec choices at the 1 PB level are the seed crystal that defines the 1 MW build. Get them right now and everything compounds.

Where to go deeper

Compliance posture — DPDP, CERT-In, SOC 2, FIPS-140, telemetry-only
Economics — ₹/TB at the 1 PB pod, breakeven on the 1 MW build
Competitive picture — why hyperscalers can't replicate this in India
Tender mechanics — what artefacts procurement actually needs at each stage
Failure modes — what happens when a drive dies mid-write, when a switch dies, when the building loses power

Pick the one that's least clear and we can walk through it in equivalent depth.