Complete Technical Walkthrough · Final Spec · May 2026

Data centres, end to end

From a single drive to a sovereign-cloud build. Every concept you need to read a vendor pitch, evaluate a colo contract, or sit on a procurement call without faking it. Written for someone who has never set up infrastructure before.

Final pod configuration locked: 5 storage + 1 control node, all-NVMe Supermicro 1U, EPYC 7313P, 8 × 30.72 TB Samsung PM9A3, refurbished SN2700 100 GbE, EC k=4,m=1. 1.23 PB raw / ~983 TB usable. Supersedes the earlier three-variant analysis.

Part 1
Foundations — what's physically in the rack

01 The simplest mental model

A storage cloud is just three things stitched together. The hardware is dumb. The software is the product.

CUSTOMERapps · dashboards · S3 SDKSERVERSsoftware decides whatto do with the bytesDRIVESspinning platters · flash chipsnetworkcables
When we say “1 PB pod” we mean physical drive capacity summed across every server in the rack — before any compression or redundancy math. 1 PB = 1 petabyte = 1,000 TB = 1,000,000 GB. A typical iPhone holds ~128 GB; a petabyte is roughly 7,800 of those.

02 Inside one server (a “node”)

Every box in the rack has the same internal anatomy. Trace bytes from the network port to the spinning disk.

STORAGE NODECPU — EPYC 7313P16 cores · Milan · 155 Wruns OS, RustFS, JamRAM (256 GB ECC)RustFS write bufferECC = error-correctingPCIe Gen 4/5 — internal data highwayNICMellanox 100 GbE→ to switchHBA (storage controller)JBOD passthroughno hardware RAIDDRIVES — all NVMe8 × 30.72 TB Samsung PM9A3U.2 · PCIe Gen 4 · TLC · 1 DWPD~246 TB raw / node

What each component does, in plain terms

CPU — the brain

Runs the OS, RustFS, Jam. EPYC is AMD's data-centre CPU line; Xeon is Intel's equivalent. EPYC chips bring lots of cores (parallel workers) and lots of PCIe lanes (~128 on a single socket) — both are what you need for storage.

We picked the EPYC 7313P: 16 cores, 3.0 GHz, 155 W, Milan generation, single-socket. Why this chip:

  • 16 cores is enough. Jam doesn't need many — compression is fast and mostly memory-bandwidth-bound. Paying for 32+ cores would be waste.
  • Milan over Genoa. Genoa is the newer, faster generation but PCIe Gen 5 NVMe drives are still expensive and our PM9A3 drives are Gen 4. Milan + Gen 4 runs cooler (~20% less power) and is significantly cheaper.
  • Single socket. Two-socket boards add cost, complexity, and a second NUMA node that software has to dance around. Not needed at this scale.
  • “P” suffix = single-socket SKU only. Locked at one socket but cheaper than the dual-socket version.

A core is a complete CPU on its own. Threads are virtual cores — usually 2× physical with hyperthreading. NUMA = Non-Uniform Memory Access, the latency penalty when a core reaches across to RAM attached to another socket.

RAM — short-term memory

Where data sits temporarily before it lands on disk. Critical for our workload because RustFS uses it as a write buffer — incoming bytes pile up here while Jam compresses them, then flush to drives in larger chunks. Larger buffer = smoother throughput under bursty load.

ECC = Error-Correcting Code. Cosmic rays do flip RAM bits in production (real, measurable, ~1 flip per GB per year). Without ECC, those flips silently corrupt data. ECC catches and fixes them. Mandatory for production storage.

PCIe bus — the internal highway

The high-speed pipe inside the server connecting CPU to everything (NICs, NVMe drives, HBAs, GPUs). Speed is measured in lanes:

  • PCIe Gen 4: ~2 GB/s per lane. A 16-lane (x16) GPU slot = 32 GB/s.
  • PCIe Gen 5: ~4 GB/s per lane (double Gen 4).
  • Each NIC, NVMe, HBA needs lanes. Servers compete for them — that's why EPYC's 128 lanes matter.

NIC — Network Interface Card

The card that connects the server to the outside world. Speed of NIC = speed at which the server can talk to anything outside it.

  • 1 GbE = 1 gigabit per second = ~125 MB/s. Home internet speeds. Useless for storage.
  • 10 GbE = 1.25 GB/s. Old data centre standard. One HDD can saturate it.
  • 25 GbE = 3.1 GB/s. Common but borderline for our workload.
  • 100 GbE = 12.5 GB/s. What we want. Mellanox is the brand that dominates this tier.

HBA — Host Bus Adapter

The card that connects CPU to physical drives. Two modes:

  • JBOD / passthrough: each drive shows up as an independent device. The OS (and RustFS) sees them all directly. This is what we want.
  • Hardware RAID: the HBA does redundancy math itself, presents one logical “drive” to the OS. Old-school. Bad for our stack because RustFS does redundancy in software and hardware RAID would hide failures from us.

Drives — the actual storage

Three families exist in the market. We use NVMe-only across all five storage nodes.

HDD archival

Hard Disk Drive. Spinning platters, mechanical head. ~250 MB/s sequential, 150–250 IOPS random. ~$15/TB at 20 TB. Not in our pod. May reappear at the 1 MW build for a true bulk archive tier.

SATA SSD middle

Flash drives on the legacy SATA bus. ~600 MB/s ceiling. Cheaper than NVMe per TB but slower. Bottlenecks badly on writes. Not in our pod.

NVMe our pick

Flash chips talking PCIe directly. ~6–7 GB/s sequential, 100,000+ IOPS. The PM9A3 30.72 TB drive: ~$0.06–0.10/GB at scale.

Why the PM9A3 specifically

  • 30.72 TB per drive. 8 drives × 30.72 TB = 245.76 TB raw per node. Five nodes = 1.23 PB raw in 5U of rack space. Insane density.
  • U.2 form factor. Hot-swappable from the front of the chassis. Replace a failed drive without taking the server down.
  • TLC NAND. Triple-Level Cell — three bits per memory cell. Density vs endurance trade-off. Modern TLC is reliable enough for production. (QLC = 4 bits, denser but lower endurance, riskier.)
  • 1 DWPD endurance. “Drive Writes Per Day” — you can rewrite the entire drive once per day for the warranty period (5 years) before flash wears out. For our workload (write-once, read-many), this is plenty. Higher-write workloads need 3 DWPD or “Mixed Use” drives.
  • PCIe Gen 4 ×4. Each drive gets 4 lanes of Gen 4 = ~8 GB/s ceiling. Real-world ~6.5 GB/s. We don't go to Gen 5 because the cost premium isn't justified for this generation of drives.
The new bottleneck math. 8 NVMe × ~6.5 GB/s = ~52 GB/s of raw drive throughput per node. That's wildly faster than any reasonable network. Two 100 GbE ports = 200 Gbps = 25 GB/s — so the network is now the bottleneck, not the disks. Exactly what we want: Jam and RustFS will be CPU-bound long before the drives are saturated, and that means compression ratio (not drive count) is what determines economics.

03 Multiple servers — the cluster

A cloud needs multiple servers, a network fabric connecting them, and software that coordinates them as one logical pool.

INTERNET / CUSTOMERTLS 1.3 over public internet (or VPN)COLO BORDER ROUTER100 GbE uplinks · provided by colo operatorTOR SWITCH — Mellanox SN2700, 32-port 100 GbE (refurbished)Top-of-Rack · 1U · 14 × 100 GbE DAC cables to nodesNODE 1Storage · 1UEPYC 7313P · 256 GB8 × 30.72 TB NVMe2 × 100 GbE246 TBNODE 2Storage · 1U8 × 30.72 TB NVMe246 TBNODE 3Storage · 1U8 × 30.72 TB NVMe246 TBNODE 4Storage · 1U8 × 30.72 TB NVMe246 TBNODE 5Storage · 1U8 × 30.72 TB NVMe246 TBCONTROL NODE — same chassis · L4 GPU · RustFS coordinator · dashboards · billing · AI inference · security-ops VM

A note on jargon you'll hear:

  • North-south traffic — bytes moving in and out of the cluster (customer ↔ servers). Goes through the border router.
  • East-west traffic — bytes moving between nodes inside the cluster (replication, EC chunk distribution, rebalancing). Goes through the ToR switch. Far higher volume than north-south, which is why the ToR fabric matters so much. Our pod sustains ~1 Tbps east-west.
  • Colo — short for colocation. A facility that rents you rack space, power, and network uplinks. You bring your own servers; they bring the building.
  • Rack / cabinet — a metal frame that holds 42–48 servers stacked vertically. Standard width, measured in U (1U = 1.75 inches tall). A 2U server takes up 2 slots. Our entire pod fits in 7–9U of a 42U cabinet (6 × 1U servers + 1U ToR + 1U OOB).
  • DAC vs AOC cables — DAC (Direct Attach Copper) is rigid, cheap (~$50–80/cable), good for runs under 3 m. AOC (Active Optical Cable) is flexible, expensive (~$200–400), good for longer runs. Inside one cabinet, DAC wins.
  • OOB — Out-of-Band management. A second, slow network (1 GbE) used to manage servers when the primary network is down. Every server has a dedicated management port (Dell calls it iDRAC, Supermicro calls it IPMI/BMC). Lets you reboot or reinstall a node remotely.

04 The software stack — bottom to top

Hardware is the bottom. The product is the top. Each layer rests on the one below it.

LAYER 6
What we sell
Pilot · Production (per-TB/mo) · Enterprise / Gov / Tender
LAYER 5
Platform services
Dashboards · monthly reports · restore-proof · audit log · billing · hosted AI inference
LAYER 4
S3 API — RustFS
Speaks S3, translates PUT/GET, buffers writes in RAM, handles bucket isolation + IAM + encryption
LAYER 3
Codec — JAM our IP
Compresses on write (3–8× typical, up to 100×) · decompresses on read · hash-verifies on restore
LAYER 2
Storage presentation — JBOD
Each drive shows up as an independent device. RustFS does redundancy in software.
LAYER 1
OS
AlmaLinux (FIPS) for gov / Ubuntu for commercial. Kernel, drivers, network stack.
LAYER 0
Hardware
CPU + RAM + NIC + HBA + Drives — the physical box from section 2.

What each layer does, in plain terms

  • Layer 0 — Hardware. The physical box. Boring, expensive, breaks occasionally.
  • Layer 1 — OS. The operating system everything else runs on. AlmaLinux is a free Red Hat clone; with the FIPS-140 module it's a US-government-approved cryptography baseline. Ubuntu is the mainstream commercial Linux.
  • Layer 2 — JBOD. A configuration choice on the HBA, not separate software. Tells the OS to treat each drive as its own device.
  • Layer 3 — Jam. Our codec. Sits between RustFS and the disks. Every byte going to disk gets compressed first; every byte read gets decompressed first. Customer never sees Jam — they see their original data.
  • Layer 4 — RustFS. The S3-compatible object server. Customers talk to this layer using standard S3 commands (the same ones they use for AWS S3). RustFS is what makes the pod look like AWS S3 from outside. Buffers writes in RAM for snappy demos.
  • Layer 5 — Platform. The Strata-specific services that wrap raw storage: dashboards, billing, restore-proof, AI inference. This is where we differentiate from “cheap S3.”
  • Layer 6 — Pricing tiers. The three commercial wrappers we sell.
Key insight: The customer never sees Jam directly. They see an S3 endpoint. Compression is invisible — they just notice they're paying for less storage than they expected.

05 Where Jam sits — the data path

Trace what happens when a customer uploads a 100 GB file.

1
Customer issues S3 PUT my-file.dat (100 GB)
TLS 1.3 encrypted in transit
2
NIC receives at up to 100 Gbps
Mellanox 100 GbE
3
RustFS accepts the request, authenticates, picks bucket + drives
Layer 4 — the S3 server
4
RAM buffer holds incoming bytes
256 GB available — keeps demos snappy under load
5
JAM compresses our IP
100 GB → ~26 GB at 3.86× · or ~12.7 GB at 7.85× with ZSTD layered
6
AES-256 encryption at rest
Per-tenant key
7
JBOD layer writes to the physical drive
HDD or NVMe — bytes hit the media
On read, the same flow runs in reverse: bytes off the drive → decrypt → Jam decompresses → RAM buffer → stream to customer.

The bottleneck wins. If the network is 25 GbE, the chain caps at 25 Gbps. If the disk is HDD, the chain caps at HDD speed. If RAM is too small, RustFS thrashes. Compression ratio is set by the codec, not the disk — HDD vs NVMe doesn't change 3.86×.

06 The final BOM — what we're actually building

After the three-variant analysis (HDD bulk vs NVMe demo paths), we landed on a simpler, denser, cost-optimised configuration: five identical all-NVMe storage nodes plus one control / GPU node. Same chassis everywhere.

Why this shape — five identical all-NVMe nodes

  • Five nodes maps perfectly to EC k=4,m=1. One chunk per node, 25% storage overhead, survives loss of any one node. Three nodes forced k=2,m=1 (50% overhead) — far less efficient.
  • All-NVMe collapses the tiering question. No HDD/NVMe split, no “demo path vs bulk path,” no decision tree. Every node is a hot node.
  • 1U Supermicro chassis is denser and cheaper. Six servers + ToR + OOB = 7–9U total. Leaves 30+ U in the cabinet for future expansion.
  • Refurbished networking. SN2700 ToR and ConnectX-5 NICs both bought used with vendor warranty — significant capex saving over new SN3700C.
  • DAC cables. All node-to-ToR runs are under 3 m inside the cabinet, so passive copper DAC cables work and are 4–5× cheaper than active optical.

Storage nodes (× 5)

ChassisSupermicro AS-1115HS-TNR — 1U, 8-bay U.2 NVMe front access
CPUAMD EPYC 7313P — 16C / 3.0 GHz / 155 W · Milan-gen · single socket · “P” SKU
RAM256 GB DDR4-3200 ECC (4 × 64 GB RDIMMs) — upgrade to 512 GB once paying users justify it
Drives8 × Samsung PM9A3 30.72 TB U.2 NVMe — PCIe Gen 4 · TLC NAND · 1 DWPD endurance
NICMellanox ConnectX-5 dual-port 100 GbE — refurbished with vendor warranty
PowerDual hot-plug PSU (A + B feed)
Per-node raw~245.76 TB · 200 Gbps network · ~700 W typical draw

Control / GPU node (× 1)

Chassis + CPU + NICIdentical to storage nodes (parts commonality, simpler ops)
RAM256 GB initially → 512 GB when GPU workload justifies
GPUNVIDIA L4 — 24 GB · 72 W · ~$2–3K · handles inference workloads · upgrade to L40S only when training revenue lands
Drives2 × 1.92 TB NVMe — boot + local cache (no bulk storage on this node)
HostsRustFS coordinator · customer dashboards · billing · audit logs · AI inference · security-ops VM (Lucas's Kali pattern, runs here until a regulated tenant pays for a dedicated box)

Networking

ToR switchMellanox SN2700 — 32-port 100 GbE · refurbished (~$5–8K used vs ~$15–20K new SN3700C)
OOB switchAny 1 GbE managed switch (Netgear / TP-Link enterprise, ~$200) for IPMI / BMC access
Cabling14 × 100 GbE DAC (passive copper, <3 m) + 7 × Cat6A for OOB

Rack & power

Cabinet1 × 42U — colo-provided, included in lease
PDUs2 × vertical 32A · A + B feed · colo-provided
Space used7–9U occupied (6 × 1U servers + 1U ToR + 1U OOB)

Pod totals

1.23 PB
Raw capacity (5 × 245.76 TB)
~983 TB
Usable after EC k=4,m=1
~1 Tbps
East-west fabric (5 × 200 Gbps)
4–5 kW
Typical draw · 6 kW peak
Why Milan over Genoa, Gen 4 over Gen 5: Milan EPYC + Gen 4 NVMe runs noticeably cooler than Genoa + Gen 5 (~20% lower thermal draw), which is meaningful in Indian colo summers. Gen 5 NVMe drives at this density are also still expensive enough that the cost premium isn't worth it for our compression-bound workload — we hit the network ceiling long before drive bandwidth matters.
Part 2
Vocabulary — the words you need to read any vendor pitch

07 Storage performance metrics

Three numbers describe every storage system. They mean different things and you can't substitute one for another.

Throughput (MB/s, GB/s)
How many bytes the system moves per second. Like the width of a hose.
A 12-disk HDD node: ~3 GB/s sequential. A 100 GbE link: ~12.5 GB/s. Matters for: backups, big-file uploads, video.
IOPS (operations/sec)
How many distinct read/write requests the system handles per second, regardless of size. Like the number of separate trips the hose makes.
HDD: ~150–250 random IOPS. NVMe: 100,000+ IOPS. Matters for: databases, web apps, anything with lots of small reads/writes.
Latency (ms, μs)
How long a single operation takes. Like the time from turning on the tap to water arriving.
HDD seek: ~5–10 ms. NVMe read: ~50–100 μs (100× faster). Matters for: user-facing latency, real-time systems.

Why these are different and why it matters

Imagine moving 1 GB of data:

  • One 1 GB file at 250 MB/s sequential takes 4 seconds (HDDs are fine).
  • One million 1 KB files at 200 IOPS takes 5,000 seconds = 83 minutes (HDDs are useless; you need NVMe).

Same total data. Different access pattern. Pick the drive based on the workload, not just the capacity.

Sequential vs random

Sequential

Reading or writing bytes in order, one after the other. HDDs are great at this — the head doesn't have to move.

Examples: streaming video, full-disk backups, large file uploads.

Random

Reading or writing bytes in unpredictable locations. HDDs are terrible because the mechanical head has to seek every time. NVMe doesn't care — no moving parts.

Examples: database queries, virtual machine boot disks, web app traffic.

Read vs write

SSDs and NVMe drives are usually faster at reads than writes. Write-heavy workloads need different drives (or more drives) than read-heavy ones. Vendors often quote the more flattering of the two — read the spec sheet carefully.

08 Durability vs availability — and the “nines”

Two different promises that sound similar. Customers conflate them. Don't.

Durability — “will I lose data?”

Probability that a stored byte is still there next year. Measured in “nines.”

  • 9 nines (99.9999999%): AWS S3 standard. ~1 file lost per billion per year.
  • 11 nines (99.999999999%): AWS S3 marketing claim. ~1 per 100 billion.

Driven by: how many copies you keep, how independent the failure modes are, how fast you detect and repair.

Availability — “can I read my data right now?”

Percentage of time the system answers requests. Measured in “nines” of uptime.

  • 3 nines (99.9%): 8.76 hours downtime/year
  • 4 nines (99.99%): 52 minutes/year
  • 5 nines (99.999%): 5.26 minutes/year — telecoms standard

Driven by: redundant power, redundant network, software failover, geographic distribution.

Common trap: a system can have 11 nines of durability (your data is safe) but 3 nines of availability (you can't reach it for 8 hours/year). Tape libraries are extremely durable and very unavailable. Cache-only stores are very available but undurable. You need both, and they're priced separately.

Other vocabulary you'll see in SLAs

MTBF Mean Time Between Failures
Average time a piece of hardware runs before breaking. HDDs: ~1.5M hours (~170 years on paper, but the real number is much lower).
MTTR Mean Time To Repair
Average time to detect a failure and finish recovery. For us: rebuild a failed drive's worth of EC chunks. Lower MTTR → higher durability.
AFR Annualised Failure Rate
% of drives that fail per year. ~1–3% in practice. Backblaze publishes the best public data.
SLA Service Level Agreement
The contractual promise. Defines availability, durability, response times, and what credits the customer gets when you miss them.
Part 3
Keeping data safe — the five concepts the industry blurs

09 Replication — multiple live copies

Keeping the same byte on N machines simultaneously, kept in sync. Simple, fast to read, expensive to store.

The basic shape

PRIMARYaccepts writes“source of truth”REPLICA 1copy of primaryserves readsREPLICA 2copy of primaryserves readsreplicateStorage cost: 3× original. Survives loss of any 2 of 3 nodes.

Synchronous vs asynchronous — the most important distinction

Synchronous replication

Customer's write isn't acknowledged until every replica has confirmed it. Strongest consistency. Slowest writes. If any replica is slow, all writes are slow.

CustomerPrimaryReplica 1Replica 21. write2. forward3. both ack ✓4. ack to customerPro: zero data lossCon: latency = slowest replicaUse: financial txns, critical state

Asynchronous replication

Primary acknowledges the customer immediately. Replicas catch up later, in the background. Fast. Can lose data if the primary dies before the bytes propagate.

CustomerPrimaryReplica 1Replica 21. write2. immediate ack3. catch up later (async)Pro: fast writesCon: small data-loss windowUse: cross-region backups, logs

Quorum / semi-synchronous

Middle ground. Acknowledge the customer once a majority of replicas confirm (e.g. 2 of 3, 3 of 5). Tolerates one slow node without sacrificing consistency. This is what modern distributed systems (Kafka, etcd, Cassandra, Spanner) actually do.

CustomerPrimaryReplica 1 ✓Replica 2 ✓Replica 3 (slow)2 of 3 ack → doneTolerates 1 slow nodeUsed by: Raft, Paxos, Spanner

Topology choices — primary-replica vs multi-master

Primary-replica (master-slave)

Only one node accepts writes. Replicas serve reads. If the primary dies, a failover promotes a replica.

Used by: PostgreSQL, MySQL replication, MongoDB.

Multi-master (active-active)

Any node can accept writes. Conflicts must be resolved (last-write-wins, CRDTs, vector clocks). Hard to get right.

Used by: Cassandra, DynamoDB, CouchDB.

What we use in the pod: RustFS does erasure coding (next section), not replication. Replication is more common in databases than in object stores. We don't pay for 3× storage when we can pay for 1.25× and get the same durability.

10 Erasure coding — the math that beats replication

Math-based redundancy. Split data into chunks, add parity, recover from any subset. This is what we use.

The notation: k + m

  • k = number of data chunks the original is split into
  • m = number of parity chunks added
  • Total stored = k + m chunks. Storage overhead = m/k.
  • Tolerance = you can lose any m chunks and still recover.
Original 100 GB filesplit + compute parity (Reed-Solomon)D125 GBD225 GBD325 GBD425 GBP125 GBP225 GBk = 4 data chunks · m = 2 parity chunks · total stored = 150 GB (50% overhead)Survives loss of any 2 chunks. Distribute 1 chunk per node → tolerate 2 node failures.parity(redundant)

How recovery works (the magic)

Reed-Solomon coding (the math RustFS uses) treats your file as numbers in a special algebra called a Galois field. The parity chunks are computed so that any k of the (k+m) chunks are enough to reconstruct the original.

If a drive holding D2 dies:

  1. Detect failure (RustFS health check)
  2. Read the remaining 5 chunks (D1, D3, D4, P1, P2) — we only need 4 of them
  3. Solve the linear system → recover D2
  4. Write D2 to a fresh drive on a different node

No human intervention. No replica conflicts. Just math.

Replication vs erasure coding — the trade-off

PropertyReplication (3×)EC k=4,m=2EC k=8,m=2
Storage overhead200% (3× original)50% (1.5×)25% (1.25×)
ToleratesLoss of 2 nodesLoss of 2 nodesLoss of 2 nodes
Read performanceFast (one chunk = one read)Slower (gather k chunks)Slowest (gather 8 chunks)
Write performanceFast (no math)Slower (parity compute)Slower (parity compute)
Repair traffic1× drive's worthk× drive's worthk× drive's worth
Best forHot data, small filesBulk data, large filesCold archives at scale
What we run today: the 5-node pod uses k=4, m=1 — split into 4 data + 1 parity, distributed exactly one chunk per node. 25% storage overhead (1.25× the original size), survives loss of any single node. The five-node count was chosen specifically so this scheme fits cleanly: each node holds exactly one chunk of every object, so recovery from a node failure pulls one chunk from each of the four survivors.

At 1 MW we'll move to k=8, m=2 — same 25% overhead but tolerates loss of two nodes simultaneously, which becomes important once you're operating dozens of storage servers.

11 Snapshots — point-in-time views

A frozen view of a dataset at a moment in time, on the same hardware. Cheap thanks to “copy-on-write.”

How copy-on-write (CoW) works

Time T0 — take snapshot S0block Ablock Bblock C← snapshot S0 references all 3 blocksTime T1 — customer modifies block Bblock Ablock B (old)block Cblock B'S0 still references old block BLive state references B' (new)Only B' is new on diskRollback: just point live state back at the S0 references → instant restore.

What snapshots are good for

  • Ransomware recovery. Take hourly snapshots. If malware encrypts your files, roll back.
  • “Oops I deleted that.” Customer self-service restore from a recent snapshot.
  • Compliance. Audit-friendly point-in-time views (“what did this data look like on March 31?”).
  • Test/dev clones. Spin up a snapshot as a fresh dataset for testing without copying anything.
What snapshots are NOT: they're not backups. They live on the same hardware as the original. If the building burns down, both go with it. Snapshots protect against logical corruption (deletion, ransomware), not physical loss.

12 Backups — copies on different hardware

A point-in-time copy on different hardware (often a different site). Survives total loss of the primary.

The 3-2-1 rule

3 COPIESProduction data (primary)Local backup (same site)Offsite backup (DR site)2 MEDIA TYPESDisk (primary)Tape / object storage / cloudAvoids single-vendor failure modes1 OFFSITEDifferent building, city,or even countrySurvives fires, floods,power-grid outages, ransomware

Backup types — full, incremental, differential

Full backup

A complete copy of everything. Largest. Slowest to make. Fastest to restore (single file → done).

Sun
FULL
Mon
FULL
Tue
FULL
Wed
FULL
Thu
FULL
Fri
FULL
Sat
FULL

Storage cost: very high. Restore time: fastest. Used for: small datasets, weekly anchors.

Incremental backup

Only the changes since the last backup of any kind. Smallest daily size. Slowest to restore — you need the full + every increment.

Sun
FULL
Mon
INC
Tue
INC
Wed
INC
Thu
INC
Fri
INC
Sat
INC

Restore Friday's data: replay Sun + Mon + Tue + Wed + Thu + Fri. Six restore steps.

Differential backup

Only the changes since the last full. Sizes grow through the week. Faster restore than incremental — just full + latest differential.

Sun
FULL
Mon
DIFF
Tue
DIFF
Wed
DIFF
Thu
DIFF
Fri
DIFF
Sat
DIFF

Restore Friday's data: replay Sun + Fri. Two restore steps. Storage middle-ground.

RPO and RTO — the two numbers customers ask for

Last backup⚡ DisasterService restoredRPORecovery Point Objective“how much data can I afford to lose?”RTORecovery Time Objective“how long can I be down?”

RPO — Recovery Point Objective data loss budget

How much data are you willing to lose? Determined by backup frequency.

  • Daily backup → RPO = 24 hours
  • Hourly snapshot → RPO = 1 hour
  • Sync replication → RPO ≈ 0

RTO — Recovery Time Objective downtime budget

How long until service is restored? Determined by recovery process.

  • Restore from tape → RTO = hours-days
  • Restore from disk backup → RTO = minutes-hours
  • Hot DR site failover → RTO = seconds
Pricing tip: RPO and RTO are inversely proportional to cost. RPO=0 / RTO=0 (always-on hot replica) is 5–10× the storage cost of daily-backup-with-2-hour-RTO. Customers should pick targets that match their actual business pain, not what sounds impressive.

Restore-proof drills our differentiator

Backups that are never tested are not backups — they're guesses. Every month we pick a random sample of customer data, restore it from backup, hash-verify against the original, and produce a report with timestamps and ratios. SOC 2 requires this. Most providers don't actually do it.

13 Disaster recovery — surviving site loss

Backups + a plan. The plan matters as much as the data.

Three flavours of standby

TypeWhat's runningFailover timeCost
Cold standbyBackup tapes/files in storage. No live infrastructure.Hours to days (rebuild + restore)Lowest
Warm standbyHardware powered on, software installed, data replicated periodically.MinutesMedium
Hot standbyFully running mirror, sync replication, can take traffic immediately.Seconds (automatic failover)Highest (~2× primary)

Failover and failback

Failover = switching from primary to DR site when something goes wrong. Failback = switching back to primary once it's repaired. Both should be tested quarterly. Many companies have working failover and broken failback because they never practise it.

The grim truth about DR: most organisations discover their DR plan is broken during the actual disaster. The two ways to avoid this are (1) regular DR drills with full traffic cutover, and (2) automated failover that has run successfully under load. Anything else is theatre.
Part 4
Physical plant — power, cooling, and the building

14 Power & UPS — the chain that keeps everything running

Servers care about clean, continuous power. The grid does not provide this on its own.

GRIDutility powerunreliableUPSbattery backup5–30 min runtimePDUpower distribution→ rack outletsSERVERSdual PSUA & B feeds⛽ Generator (kicks in if grid down >30s)

Each piece in detail

UPS — Uninterruptible Power Supply

A wall of batteries that bridges the gap when grid power dies. The runtime is short on purpose — UPS is meant to keep the lights on for the 30–60 seconds it takes the diesel generator to start, plus a buffer. Common types:

  • VRLA (lead-acid): cheap, heavy, 3–5 year life. What our 1 MW build uses.
  • Lithium-ion: 3× more expensive, half the weight, 10+ year life, less floor space. Increasingly the default.

Generator — diesel or gas

Kicks in within seconds of grid loss. Sized to run the entire facility at full load indefinitely (as long as fuel keeps arriving). The fuel logistics are a real operational concern — long grid outages have starved data centres of diesel.

PDU — Power Distribution Unit

The strip that distributes power to individual servers in a rack. Modern PDUs are “intelligent” — they meter per-outlet power draw, which is how you bill customers in colocation.

Dual feeds (A & B)

Servers have two power supplies (PSUs). Each plugs into a separate PDU, fed by a separate UPS, fed by a separate utility feed. Either side can fail without dropping the server. This is what “2N power” means.

Redundancy notation: N, N+1, 2N, 2N+1

NotationWhat it meansExample
NJust enough capacity for the load. No redundancy.1 UPS for 100 kW load
N+1One spare unit. Tolerates 1 failure.2 UPS units, either can run the full load
2NTwo completely independent paths. Tolerates failure of an entire path.2 separate UPS systems, each carrying the full load on its own
2N+1Two paths, plus one spare on each. Highest practical redundancy.Tier IV facilities

PUE — Power Usage Effectiveness

The efficiency metric the industry uses. PUE = total facility power ÷ IT power.A PUE of 1.0 is theoretical perfection. Modern hyperscale data centres run 1.1–1.2. Older enterprise sites run 1.8–2.5. Lower is better. India's hot climate makes 1.1 hard.

15 Cooling — what nobody warns you about

Servers turn nearly 100% of their electricity into heat. A 1 MW server load produces 1 MW of heat. That heat has to go somewhere.

Hot aisle / cold aisle layout

COLD AISLE — 18–22°C, raised floor delivers cool airracks (front)racks (front)HOT AISLE — 35–40°C, exhaust extracted to CRACservers pull cool air in front, push hot air out back

Servers all face the cold aisle and exhaust into the hot aisle. The hot aisle air goes back to the air conditioner and starts over. Mixing hot and cold air is the most common efficiency killer. Modern designs put physical containment doors between aisles.

Cooling technologies

  • CRAC — Computer Room Air Conditioner. Big AC units around the perimeter. Standard for the last 30 years.
  • CRAH — Computer Room Air Handler. Uses chilled water from a building plant. More efficient at scale.
  • In-row cooling — AC units placed between racks, much shorter air path.
  • Liquid cooling / direct-to-chip — coolant flows over the CPU directly. Required for the densest GPU racks (40 kW+). Becoming standard for AI training.
  • Immersion cooling — entire servers submerged in non-conductive fluid. Niche but growing.
Why this matters for India: ambient temperatures of 35–40°C in summer make air cooling inefficient — you spend more energy on AC than on the servers. Captive solar + water-side economisers (using cooler night air) help. Liquid cooling becomes attractive faster than in colder climates.

16 Network topology — leaf-spine and why it replaced everything

Old data-centre networks couldn't keep up with east-west traffic. Modern ones use a different shape.

The old way — three-tier

COREAGG 1AGG 2ACCESSACCESSACCESSACCESSACCESSACCESSEast-west traffic between racks: ACCESS → AGG → CORE → AGG → ACCESS. Many hops. Bottlenecked.

The modern way — leaf-spine

SPINE 1SPINE 2SPINE 3SPINE 4LEAF 1LEAF 2LEAF 3LEAF 4LEAF 5Every leaf connects to every spine. Leaf-to-leaf traffic = always exactly 2 hops (leaf → spine → leaf).Add a spine to add bandwidth. Add a leaf to add ports. Linear scaling.

Why leaf-spine wins

  • Predictable latency. Every server-to-server path is exactly 2 hops. The 3-tier design varies from 2 to 6.
  • ECMP — Equal-Cost Multi-Path. Traffic between two leaves spreads across all spines simultaneously. Bandwidth = sum of all spine links, not just one path.
  • Easy to scale. Need more bandwidth? Add a spine switch. Need more rack ports? Add a leaf. No re-engineering.
  • No spanning tree. Modern fabric protocols (BGP-EVPN, VXLAN) route around failures in milliseconds.
Our build: the 1 PB pod has just one ToR switch (single-rack — leaf-spine isn't needed yet). The 1 MW build needs leaf-spine: many 100 GbE leaf switches, multiple 400 GbE spines, BGP-EVPN underneath.

17 Tier ratings — Uptime Institute classification

A standard the whole industry uses to describe how resilient a facility is. Memorise these — they're in every RFP.

TierRedundancyAnnual downtimeWhat's needed
Tier IN (basic)~28.8 hoursSingle power and cooling path. No redundancy. Office-grade.
Tier IIN+1~22 hoursRedundant components (UPS, generators) but single path. Maintenance requires shutdown.
Tier IIIN+1, concurrently maintainable~1.6 hoursMultiple paths but only one active at a time. Can do maintenance without downtime. Industry sweet spot.
Tier IV2N or 2N+1, fault tolerant~0.4 hoursFully fault tolerant. Tolerates failure of any single piece without service impact. Very expensive.

What customers actually want: Tier III is good enough for 99% of workloads and ~half the cost of Tier IV. Tier IV is for stock exchanges, defence, hospitals.

Watch the language: “Tier III-class” or “Tier III-equivalent” means a vendor built to the spec but didn't pay the Uptime Institute for certification. Real Tier III certification is independent and audit-driven. Our 1 MW build will go through formal review.

Part 5
The product — what we sell and how it scales

18 What we sell

Same platform stack underneath. Three commercial wrappers on top.

Pilot

“Two weeks to first benchmark report”
  • Customer pushes a sample workload to an isolated bucket
  • We run Jam, produce compression + restore report
  • Discounted, with benchmark-rights agreement
  • 1 storage node + control · ~50–200 TB slice
  • Outcome: convert to Production

Production

“Per-TB / month managed”
  • S3 endpoint · TLS 1.3 · AES-256 at rest
  • Per-tenant bucket isolation + IAM
  • Customer dashboard (TB stored, TB saved, ratio)
  • Monthly billing export
  • Erasure-coded redundancy (k=4,m=1 on the 1 PB pod, 25% overhead)
  • Snapshots + restore-proof drills
  • Named technical contact · SLA per workload tier
  • Anchor: M2M tier — ~50% below commercial cloud (e.g. AWS S3)

Enterprise / Gov / Tender

“Bespoke”
  • Sovereign-data clauses (DPDP Act 2023 residency)
  • Dedicated network domain (VPN/VPC peering)
  • Customer-controlled KMS / BYOK
  • Air-gapped option (Jam on customer's own iron)
  • Project-financed work-order model (India tender)
  • Bespoke compliance reporting
  • Hosted AI inference, customer-isolated

19 1 PB → 1 MW scale-up

Nothing in the proof pod gets thrown away. Every architecture choice carries forward — the 1 MW build adds layers around it.

Dimension1 PB proof pod (final spec)1 MW reference build
PhysicalSingle cabinet in shared colo · 7–9U occupied~14 cabinets in own facility
Nodes5 storage + 1 control (all 1U Supermicro)~68 storage nodes + multiple control
CPUEPYC 7313P · 16C · Milan · single-socketEPYC Genoa or successor, mix of single & dual socket
StorageAll-NVMe · 8 × 30.72 TB Samsung PM9A3 per nodeHot NVMe ~5% · Warm SSD ~20% · Bulk HDD ~75% (tiered)
Raw capacity1.23 PB raw · ~983 TB usable~50–100 PB raw at full build-out
Network1 × Mellanox SN2700 100 GbE ToR (refurbished) · DAC cablesLeaf-spine: many 100 GbE leaves + 400 GbE spines
East-west bandwidth~1 Tbps10s of Tbps
RedundancyEC k=4,m=1 (25% overhead, tolerates 1 node loss)EC k=8,m=2 (25% overhead, tolerates 2 node loss)
Power4–5 kW typical / ~6 kW peak from colo PDU~720 kW IT (or ~70 kW backup-tier) from captive solar @ ₹2.35/kWh
UPS / batteryProvided by colo~180 kWh VRLA usable (15-min runtime)
GeneratorProvided by coloDiesel backup, our own fuel logistics
ComplianceTier-3 facility class via coloTier III formal review (independent MEP audit)
Capex~$60–80K (refurbished networking saves significant)$8–10M rack-level
Opex~$5–8K/month (lease + power + bandwidth)Power (offset by solar) + maintenance + bandwidth

Carries over unchanged

  • Jam codec
  • RustFS S3 layer
  • Customer dashboards
  • Telemetry stack
  • Per-tenant isolation
  • Compliance posture
  • Operational playbook
  • Mellanox / EPYC choices

New at 1 MW

  • Tiered storage management (hot / warm / bulk)
  • Leaf-spine network design
  • UPS / battery engineering
  • Captive solar PPA
  • Generator + fuel logistics
  • Formal Tier III review
The takeaway: the spec choices at the 1 PB level are the seed crystal that defines the 1 MW build. Get them right now and everything compounds.

Where to go deeper

  • Compliance posture — DPDP, CERT-In, SOC 2, FIPS-140, telemetry-only
  • Economics — ₹/TB at the 1 PB pod, breakeven on the 1 MW build
  • Competitive picture — why hyperscalers can't replicate this in India
  • Tender mechanics — what artefacts procurement actually needs at each stage
  • Failure modes — what happens when a drive dies mid-write, when a switch dies, when the building loses power

Pick the one that's least clear and we can walk through it in equivalent depth.