Data centres, end to end
From a single drive to a sovereign-cloud build. Every concept you need to read a vendor pitch, evaluate a colo contract, or sit on a procurement call without faking it. Written for someone who has never set up infrastructure before.
Final pod configuration locked: 5 storage + 1 control node, all-NVMe Supermicro 1U, EPYC 7313P, 8 × 30.72 TB Samsung PM9A3, refurbished SN2700 100 GbE, EC k=4,m=1. 1.23 PB raw / ~983 TB usable. Supersedes the earlier three-variant analysis.
01 The simplest mental model
A storage cloud is just three things stitched together. The hardware is dumb. The software is the product.
02 Inside one server (a “node”)
Every box in the rack has the same internal anatomy. Trace bytes from the network port to the spinning disk.
What each component does, in plain terms
CPU — the brain
Runs the OS, RustFS, Jam. EPYC is AMD's data-centre CPU line; Xeon is Intel's equivalent. EPYC chips bring lots of cores (parallel workers) and lots of PCIe lanes (~128 on a single socket) — both are what you need for storage.
We picked the EPYC 7313P: 16 cores, 3.0 GHz, 155 W, Milan generation, single-socket. Why this chip:
- 16 cores is enough. Jam doesn't need many — compression is fast and mostly memory-bandwidth-bound. Paying for 32+ cores would be waste.
- Milan over Genoa. Genoa is the newer, faster generation but PCIe Gen 5 NVMe drives are still expensive and our PM9A3 drives are Gen 4. Milan + Gen 4 runs cooler (~20% less power) and is significantly cheaper.
- Single socket. Two-socket boards add cost, complexity, and a second NUMA node that software has to dance around. Not needed at this scale.
- “P” suffix = single-socket SKU only. Locked at one socket but cheaper than the dual-socket version.
A core is a complete CPU on its own. Threads are virtual cores — usually 2× physical with hyperthreading. NUMA = Non-Uniform Memory Access, the latency penalty when a core reaches across to RAM attached to another socket.
RAM — short-term memory
Where data sits temporarily before it lands on disk. Critical for our workload because RustFS uses it as a write buffer — incoming bytes pile up here while Jam compresses them, then flush to drives in larger chunks. Larger buffer = smoother throughput under bursty load.
ECC = Error-Correcting Code. Cosmic rays do flip RAM bits in production (real, measurable, ~1 flip per GB per year). Without ECC, those flips silently corrupt data. ECC catches and fixes them. Mandatory for production storage.
PCIe bus — the internal highway
The high-speed pipe inside the server connecting CPU to everything (NICs, NVMe drives, HBAs, GPUs). Speed is measured in lanes:
- PCIe Gen 4: ~2 GB/s per lane. A 16-lane (x16) GPU slot = 32 GB/s.
- PCIe Gen 5: ~4 GB/s per lane (double Gen 4).
- Each NIC, NVMe, HBA needs lanes. Servers compete for them — that's why EPYC's 128 lanes matter.
NIC — Network Interface Card
The card that connects the server to the outside world. Speed of NIC = speed at which the server can talk to anything outside it.
- 1 GbE = 1 gigabit per second = ~125 MB/s. Home internet speeds. Useless for storage.
- 10 GbE = 1.25 GB/s. Old data centre standard. One HDD can saturate it.
- 25 GbE = 3.1 GB/s. Common but borderline for our workload.
- 100 GbE = 12.5 GB/s. What we want. Mellanox is the brand that dominates this tier.
HBA — Host Bus Adapter
The card that connects CPU to physical drives. Two modes:
- JBOD / passthrough: each drive shows up as an independent device. The OS (and RustFS) sees them all directly. This is what we want.
- Hardware RAID: the HBA does redundancy math itself, presents one logical “drive” to the OS. Old-school. Bad for our stack because RustFS does redundancy in software and hardware RAID would hide failures from us.
Drives — the actual storage
Three families exist in the market. We use NVMe-only across all five storage nodes.
HDD archival
Hard Disk Drive. Spinning platters, mechanical head. ~250 MB/s sequential, 150–250 IOPS random. ~$15/TB at 20 TB. Not in our pod. May reappear at the 1 MW build for a true bulk archive tier.
SATA SSD middle
Flash drives on the legacy SATA bus. ~600 MB/s ceiling. Cheaper than NVMe per TB but slower. Bottlenecks badly on writes. Not in our pod.
NVMe our pick
Flash chips talking PCIe directly. ~6–7 GB/s sequential, 100,000+ IOPS. The PM9A3 30.72 TB drive: ~$0.06–0.10/GB at scale.
Why the PM9A3 specifically
- 30.72 TB per drive. 8 drives × 30.72 TB = 245.76 TB raw per node. Five nodes = 1.23 PB raw in 5U of rack space. Insane density.
- U.2 form factor. Hot-swappable from the front of the chassis. Replace a failed drive without taking the server down.
- TLC NAND. Triple-Level Cell — three bits per memory cell. Density vs endurance trade-off. Modern TLC is reliable enough for production. (QLC = 4 bits, denser but lower endurance, riskier.)
- 1 DWPD endurance. “Drive Writes Per Day” — you can rewrite the entire drive once per day for the warranty period (5 years) before flash wears out. For our workload (write-once, read-many), this is plenty. Higher-write workloads need 3 DWPD or “Mixed Use” drives.
- PCIe Gen 4 ×4. Each drive gets 4 lanes of Gen 4 = ~8 GB/s ceiling. Real-world ~6.5 GB/s. We don't go to Gen 5 because the cost premium isn't justified for this generation of drives.
03 Multiple servers — the cluster
A cloud needs multiple servers, a network fabric connecting them, and software that coordinates them as one logical pool.
A note on jargon you'll hear:
- North-south traffic — bytes moving in and out of the cluster (customer ↔ servers). Goes through the border router.
- East-west traffic — bytes moving between nodes inside the cluster (replication, EC chunk distribution, rebalancing). Goes through the ToR switch. Far higher volume than north-south, which is why the ToR fabric matters so much. Our pod sustains ~1 Tbps east-west.
- Colo — short for colocation. A facility that rents you rack space, power, and network uplinks. You bring your own servers; they bring the building.
- Rack / cabinet — a metal frame that holds 42–48 servers stacked vertically. Standard width, measured in U (1U = 1.75 inches tall). A 2U server takes up 2 slots. Our entire pod fits in 7–9U of a 42U cabinet (6 × 1U servers + 1U ToR + 1U OOB).
- DAC vs AOC cables — DAC (Direct Attach Copper) is rigid, cheap (~$50–80/cable), good for runs under 3 m. AOC (Active Optical Cable) is flexible, expensive (~$200–400), good for longer runs. Inside one cabinet, DAC wins.
- OOB — Out-of-Band management. A second, slow network (1 GbE) used to manage servers when the primary network is down. Every server has a dedicated management port (Dell calls it iDRAC, Supermicro calls it IPMI/BMC). Lets you reboot or reinstall a node remotely.
04 The software stack — bottom to top
Hardware is the bottom. The product is the top. Each layer rests on the one below it.
What each layer does, in plain terms
- Layer 0 — Hardware. The physical box. Boring, expensive, breaks occasionally.
- Layer 1 — OS. The operating system everything else runs on. AlmaLinux is a free Red Hat clone; with the FIPS-140 module it's a US-government-approved cryptography baseline. Ubuntu is the mainstream commercial Linux.
- Layer 2 — JBOD. A configuration choice on the HBA, not separate software. Tells the OS to treat each drive as its own device.
- Layer 3 — Jam. Our codec. Sits between RustFS and the disks. Every byte going to disk gets compressed first; every byte read gets decompressed first. Customer never sees Jam — they see their original data.
- Layer 4 — RustFS. The S3-compatible object server. Customers talk to this layer using standard S3 commands (the same ones they use for AWS S3). RustFS is what makes the pod look like AWS S3 from outside. Buffers writes in RAM for snappy demos.
- Layer 5 — Platform. The Strata-specific services that wrap raw storage: dashboards, billing, restore-proof, AI inference. This is where we differentiate from “cheap S3.”
- Layer 6 — Pricing tiers. The three commercial wrappers we sell.
05 Where Jam sits — the data path
Trace what happens when a customer uploads a 100 GB file.
S3 PUT my-file.dat (100 GB)The bottleneck wins. If the network is 25 GbE, the chain caps at 25 Gbps. If the disk is HDD, the chain caps at HDD speed. If RAM is too small, RustFS thrashes. Compression ratio is set by the codec, not the disk — HDD vs NVMe doesn't change 3.86×.
06 The final BOM — what we're actually building
After the three-variant analysis (HDD bulk vs NVMe demo paths), we landed on a simpler, denser, cost-optimised configuration: five identical all-NVMe storage nodes plus one control / GPU node. Same chassis everywhere.
Why this shape — five identical all-NVMe nodes
- Five nodes maps perfectly to EC k=4,m=1. One chunk per node, 25% storage overhead, survives loss of any one node. Three nodes forced k=2,m=1 (50% overhead) — far less efficient.
- All-NVMe collapses the tiering question. No HDD/NVMe split, no “demo path vs bulk path,” no decision tree. Every node is a hot node.
- 1U Supermicro chassis is denser and cheaper. Six servers + ToR + OOB = 7–9U total. Leaves 30+ U in the cabinet for future expansion.
- Refurbished networking. SN2700 ToR and ConnectX-5 NICs both bought used with vendor warranty — significant capex saving over new SN3700C.
- DAC cables. All node-to-ToR runs are under 3 m inside the cabinet, so passive copper DAC cables work and are 4–5× cheaper than active optical.
Storage nodes (× 5)
| Chassis | Supermicro AS-1115HS-TNR — 1U, 8-bay U.2 NVMe front access |
| CPU | AMD EPYC 7313P — 16C / 3.0 GHz / 155 W · Milan-gen · single socket · “P” SKU |
| RAM | 256 GB DDR4-3200 ECC (4 × 64 GB RDIMMs) — upgrade to 512 GB once paying users justify it |
| Drives | 8 × Samsung PM9A3 30.72 TB U.2 NVMe — PCIe Gen 4 · TLC NAND · 1 DWPD endurance |
| NIC | Mellanox ConnectX-5 dual-port 100 GbE — refurbished with vendor warranty |
| Power | Dual hot-plug PSU (A + B feed) |
| Per-node raw | ~245.76 TB · 200 Gbps network · ~700 W typical draw |
Control / GPU node (× 1)
| Chassis + CPU + NIC | Identical to storage nodes (parts commonality, simpler ops) |
| RAM | 256 GB initially → 512 GB when GPU workload justifies |
| GPU | NVIDIA L4 — 24 GB · 72 W · ~$2–3K · handles inference workloads · upgrade to L40S only when training revenue lands |
| Drives | 2 × 1.92 TB NVMe — boot + local cache (no bulk storage on this node) |
| Hosts | RustFS coordinator · customer dashboards · billing · audit logs · AI inference · security-ops VM (Lucas's Kali pattern, runs here until a regulated tenant pays for a dedicated box) |
Networking
| ToR switch | Mellanox SN2700 — 32-port 100 GbE · refurbished (~$5–8K used vs ~$15–20K new SN3700C) |
| OOB switch | Any 1 GbE managed switch (Netgear / TP-Link enterprise, ~$200) for IPMI / BMC access |
| Cabling | 14 × 100 GbE DAC (passive copper, <3 m) + 7 × Cat6A for OOB |
Rack & power
| Cabinet | 1 × 42U — colo-provided, included in lease |
| PDUs | 2 × vertical 32A · A + B feed · colo-provided |
| Space used | 7–9U occupied (6 × 1U servers + 1U ToR + 1U OOB) |
Pod totals
07 Storage performance metrics
Three numbers describe every storage system. They mean different things and you can't substitute one for another.
Why these are different and why it matters
Imagine moving 1 GB of data:
- One 1 GB file at 250 MB/s sequential takes 4 seconds (HDDs are fine).
- One million 1 KB files at 200 IOPS takes 5,000 seconds = 83 minutes (HDDs are useless; you need NVMe).
Same total data. Different access pattern. Pick the drive based on the workload, not just the capacity.
Sequential vs random
Sequential
Reading or writing bytes in order, one after the other. HDDs are great at this — the head doesn't have to move.
Examples: streaming video, full-disk backups, large file uploads.
Random
Reading or writing bytes in unpredictable locations. HDDs are terrible because the mechanical head has to seek every time. NVMe doesn't care — no moving parts.
Examples: database queries, virtual machine boot disks, web app traffic.
Read vs write
SSDs and NVMe drives are usually faster at reads than writes. Write-heavy workloads need different drives (or more drives) than read-heavy ones. Vendors often quote the more flattering of the two — read the spec sheet carefully.
08 Durability vs availability — and the “nines”
Two different promises that sound similar. Customers conflate them. Don't.
Durability — “will I lose data?”
Probability that a stored byte is still there next year. Measured in “nines.”
- 9 nines (99.9999999%): AWS S3 standard. ~1 file lost per billion per year.
- 11 nines (99.999999999%): AWS S3 marketing claim. ~1 per 100 billion.
Driven by: how many copies you keep, how independent the failure modes are, how fast you detect and repair.
Availability — “can I read my data right now?”
Percentage of time the system answers requests. Measured in “nines” of uptime.
- 3 nines (99.9%): 8.76 hours downtime/year
- 4 nines (99.99%): 52 minutes/year
- 5 nines (99.999%): 5.26 minutes/year — telecoms standard
Driven by: redundant power, redundant network, software failover, geographic distribution.
Other vocabulary you'll see in SLAs
09 Replication — multiple live copies
Keeping the same byte on N machines simultaneously, kept in sync. Simple, fast to read, expensive to store.
The basic shape
Synchronous vs asynchronous — the most important distinction
Synchronous replication
Customer's write isn't acknowledged until every replica has confirmed it. Strongest consistency. Slowest writes. If any replica is slow, all writes are slow.
Asynchronous replication
Primary acknowledges the customer immediately. Replicas catch up later, in the background. Fast. Can lose data if the primary dies before the bytes propagate.
Quorum / semi-synchronous
Middle ground. Acknowledge the customer once a majority of replicas confirm (e.g. 2 of 3, 3 of 5). Tolerates one slow node without sacrificing consistency. This is what modern distributed systems (Kafka, etcd, Cassandra, Spanner) actually do.
Topology choices — primary-replica vs multi-master
Primary-replica (master-slave)
Only one node accepts writes. Replicas serve reads. If the primary dies, a failover promotes a replica.
Used by: PostgreSQL, MySQL replication, MongoDB.
Multi-master (active-active)
Any node can accept writes. Conflicts must be resolved (last-write-wins, CRDTs, vector clocks). Hard to get right.
Used by: Cassandra, DynamoDB, CouchDB.
10 Erasure coding — the math that beats replication
Math-based redundancy. Split data into chunks, add parity, recover from any subset. This is what we use.
The notation: k + m
- k = number of data chunks the original is split into
- m = number of parity chunks added
- Total stored = k + m chunks. Storage overhead = m/k.
- Tolerance = you can lose any m chunks and still recover.
How recovery works (the magic)
Reed-Solomon coding (the math RustFS uses) treats your file as numbers in a special algebra called a Galois field. The parity chunks are computed so that any k of the (k+m) chunks are enough to reconstruct the original.
If a drive holding D2 dies:
- Detect failure (RustFS health check)
- Read the remaining 5 chunks (D1, D3, D4, P1, P2) — we only need 4 of them
- Solve the linear system → recover D2
- Write D2 to a fresh drive on a different node
No human intervention. No replica conflicts. Just math.
Replication vs erasure coding — the trade-off
| Property | Replication (3×) | EC k=4,m=2 | EC k=8,m=2 |
|---|---|---|---|
| Storage overhead | 200% (3× original) | 50% (1.5×) | 25% (1.25×) |
| Tolerates | Loss of 2 nodes | Loss of 2 nodes | Loss of 2 nodes |
| Read performance | Fast (one chunk = one read) | Slower (gather k chunks) | Slowest (gather 8 chunks) |
| Write performance | Fast (no math) | Slower (parity compute) | Slower (parity compute) |
| Repair traffic | 1× drive's worth | k× drive's worth | k× drive's worth |
| Best for | Hot data, small files | Bulk data, large files | Cold archives at scale |
k=4, m=1 — split into 4 data + 1 parity, distributed exactly one chunk per node. 25% storage overhead (1.25× the original size), survives loss of any single node. The five-node count was chosen specifically so this scheme fits cleanly: each node holds exactly one chunk of every object, so recovery from a node failure pulls one chunk from each of the four survivors.At 1 MW we'll move to
k=8, m=2 — same 25% overhead but tolerates loss of two nodes simultaneously, which becomes important once you're operating dozens of storage servers.11 Snapshots — point-in-time views
A frozen view of a dataset at a moment in time, on the same hardware. Cheap thanks to “copy-on-write.”
How copy-on-write (CoW) works
What snapshots are good for
- Ransomware recovery. Take hourly snapshots. If malware encrypts your files, roll back.
- “Oops I deleted that.” Customer self-service restore from a recent snapshot.
- Compliance. Audit-friendly point-in-time views (“what did this data look like on March 31?”).
- Test/dev clones. Spin up a snapshot as a fresh dataset for testing without copying anything.
12 Backups — copies on different hardware
A point-in-time copy on different hardware (often a different site). Survives total loss of the primary.
The 3-2-1 rule
Backup types — full, incremental, differential
Full backup
A complete copy of everything. Largest. Slowest to make. Fastest to restore (single file → done).
FULL
FULL
FULL
FULL
FULL
FULL
FULL
Storage cost: very high. Restore time: fastest. Used for: small datasets, weekly anchors.
Incremental backup
Only the changes since the last backup of any kind. Smallest daily size. Slowest to restore — you need the full + every increment.
FULL
INC
INC
INC
INC
INC
INC
Restore Friday's data: replay Sun + Mon + Tue + Wed + Thu + Fri. Six restore steps.
Differential backup
Only the changes since the last full. Sizes grow through the week. Faster restore than incremental — just full + latest differential.
FULL
DIFF
DIFF
DIFF
DIFF
DIFF
DIFF
Restore Friday's data: replay Sun + Fri. Two restore steps. Storage middle-ground.
RPO and RTO — the two numbers customers ask for
RPO — Recovery Point Objective data loss budget
How much data are you willing to lose? Determined by backup frequency.
- Daily backup → RPO = 24 hours
- Hourly snapshot → RPO = 1 hour
- Sync replication → RPO ≈ 0
RTO — Recovery Time Objective downtime budget
How long until service is restored? Determined by recovery process.
- Restore from tape → RTO = hours-days
- Restore from disk backup → RTO = minutes-hours
- Hot DR site failover → RTO = seconds
Restore-proof drills our differentiator
Backups that are never tested are not backups — they're guesses. Every month we pick a random sample of customer data, restore it from backup, hash-verify against the original, and produce a report with timestamps and ratios. SOC 2 requires this. Most providers don't actually do it.
13 Disaster recovery — surviving site loss
Backups + a plan. The plan matters as much as the data.
Three flavours of standby
| Type | What's running | Failover time | Cost |
|---|---|---|---|
| Cold standby | Backup tapes/files in storage. No live infrastructure. | Hours to days (rebuild + restore) | Lowest |
| Warm standby | Hardware powered on, software installed, data replicated periodically. | Minutes | Medium |
| Hot standby | Fully running mirror, sync replication, can take traffic immediately. | Seconds (automatic failover) | Highest (~2× primary) |
Failover and failback
Failover = switching from primary to DR site when something goes wrong. Failback = switching back to primary once it's repaired. Both should be tested quarterly. Many companies have working failover and broken failback because they never practise it.
14 Power & UPS — the chain that keeps everything running
Servers care about clean, continuous power. The grid does not provide this on its own.
Each piece in detail
UPS — Uninterruptible Power Supply
A wall of batteries that bridges the gap when grid power dies. The runtime is short on purpose — UPS is meant to keep the lights on for the 30–60 seconds it takes the diesel generator to start, plus a buffer. Common types:
- VRLA (lead-acid): cheap, heavy, 3–5 year life. What our 1 MW build uses.
- Lithium-ion: 3× more expensive, half the weight, 10+ year life, less floor space. Increasingly the default.
Generator — diesel or gas
Kicks in within seconds of grid loss. Sized to run the entire facility at full load indefinitely (as long as fuel keeps arriving). The fuel logistics are a real operational concern — long grid outages have starved data centres of diesel.
PDU — Power Distribution Unit
The strip that distributes power to individual servers in a rack. Modern PDUs are “intelligent” — they meter per-outlet power draw, which is how you bill customers in colocation.
Dual feeds (A & B)
Servers have two power supplies (PSUs). Each plugs into a separate PDU, fed by a separate UPS, fed by a separate utility feed. Either side can fail without dropping the server. This is what “2N power” means.
Redundancy notation: N, N+1, 2N, 2N+1
| Notation | What it means | Example |
|---|---|---|
| N | Just enough capacity for the load. No redundancy. | 1 UPS for 100 kW load |
| N+1 | One spare unit. Tolerates 1 failure. | 2 UPS units, either can run the full load |
| 2N | Two completely independent paths. Tolerates failure of an entire path. | 2 separate UPS systems, each carrying the full load on its own |
| 2N+1 | Two paths, plus one spare on each. Highest practical redundancy. | Tier IV facilities |
PUE — Power Usage Effectiveness
The efficiency metric the industry uses. PUE = total facility power ÷ IT power.A PUE of 1.0 is theoretical perfection. Modern hyperscale data centres run 1.1–1.2. Older enterprise sites run 1.8–2.5. Lower is better. India's hot climate makes 1.1 hard.
15 Cooling — what nobody warns you about
Servers turn nearly 100% of their electricity into heat. A 1 MW server load produces 1 MW of heat. That heat has to go somewhere.
Hot aisle / cold aisle layout
Servers all face the cold aisle and exhaust into the hot aisle. The hot aisle air goes back to the air conditioner and starts over. Mixing hot and cold air is the most common efficiency killer. Modern designs put physical containment doors between aisles.
Cooling technologies
- CRAC — Computer Room Air Conditioner. Big AC units around the perimeter. Standard for the last 30 years.
- CRAH — Computer Room Air Handler. Uses chilled water from a building plant. More efficient at scale.
- In-row cooling — AC units placed between racks, much shorter air path.
- Liquid cooling / direct-to-chip — coolant flows over the CPU directly. Required for the densest GPU racks (40 kW+). Becoming standard for AI training.
- Immersion cooling — entire servers submerged in non-conductive fluid. Niche but growing.
16 Network topology — leaf-spine and why it replaced everything
Old data-centre networks couldn't keep up with east-west traffic. Modern ones use a different shape.
The old way — three-tier
The modern way — leaf-spine
Why leaf-spine wins
- Predictable latency. Every server-to-server path is exactly 2 hops. The 3-tier design varies from 2 to 6.
- ECMP — Equal-Cost Multi-Path. Traffic between two leaves spreads across all spines simultaneously. Bandwidth = sum of all spine links, not just one path.
- Easy to scale. Need more bandwidth? Add a spine switch. Need more rack ports? Add a leaf. No re-engineering.
- No spanning tree. Modern fabric protocols (BGP-EVPN, VXLAN) route around failures in milliseconds.
17 Tier ratings — Uptime Institute classification
A standard the whole industry uses to describe how resilient a facility is. Memorise these — they're in every RFP.
| Tier | Redundancy | Annual downtime | What's needed |
|---|---|---|---|
| Tier I | N (basic) | ~28.8 hours | Single power and cooling path. No redundancy. Office-grade. |
| Tier II | N+1 | ~22 hours | Redundant components (UPS, generators) but single path. Maintenance requires shutdown. |
| Tier III | N+1, concurrently maintainable | ~1.6 hours | Multiple paths but only one active at a time. Can do maintenance without downtime. Industry sweet spot. |
| Tier IV | 2N or 2N+1, fault tolerant | ~0.4 hours | Fully fault tolerant. Tolerates failure of any single piece without service impact. Very expensive. |
What customers actually want: Tier III is good enough for 99% of workloads and ~half the cost of Tier IV. Tier IV is for stock exchanges, defence, hospitals.
Watch the language: “Tier III-class” or “Tier III-equivalent” means a vendor built to the spec but didn't pay the Uptime Institute for certification. Real Tier III certification is independent and audit-driven. Our 1 MW build will go through formal review.
18 What we sell
Same platform stack underneath. Three commercial wrappers on top.
Pilot
- Customer pushes a sample workload to an isolated bucket
- We run Jam, produce compression + restore report
- Discounted, with benchmark-rights agreement
- 1 storage node + control · ~50–200 TB slice
- Outcome: convert to Production
Production
- S3 endpoint · TLS 1.3 · AES-256 at rest
- Per-tenant bucket isolation + IAM
- Customer dashboard (TB stored, TB saved, ratio)
- Monthly billing export
- Erasure-coded redundancy (k=4,m=1 on the 1 PB pod, 25% overhead)
- Snapshots + restore-proof drills
- Named technical contact · SLA per workload tier
- Anchor: M2M tier — ~50% below commercial cloud (e.g. AWS S3)
Enterprise / Gov / Tender
- Sovereign-data clauses (DPDP Act 2023 residency)
- Dedicated network domain (VPN/VPC peering)
- Customer-controlled KMS / BYOK
- Air-gapped option (Jam on customer's own iron)
- Project-financed work-order model (India tender)
- Bespoke compliance reporting
- Hosted AI inference, customer-isolated
19 1 PB → 1 MW scale-up
Nothing in the proof pod gets thrown away. Every architecture choice carries forward — the 1 MW build adds layers around it.
| Dimension | 1 PB proof pod (final spec) | 1 MW reference build |
|---|---|---|
| Physical | Single cabinet in shared colo · 7–9U occupied | ~14 cabinets in own facility |
| Nodes | 5 storage + 1 control (all 1U Supermicro) | ~68 storage nodes + multiple control |
| CPU | EPYC 7313P · 16C · Milan · single-socket | EPYC Genoa or successor, mix of single & dual socket |
| Storage | All-NVMe · 8 × 30.72 TB Samsung PM9A3 per node | Hot NVMe ~5% · Warm SSD ~20% · Bulk HDD ~75% (tiered) |
| Raw capacity | 1.23 PB raw · ~983 TB usable | ~50–100 PB raw at full build-out |
| Network | 1 × Mellanox SN2700 100 GbE ToR (refurbished) · DAC cables | Leaf-spine: many 100 GbE leaves + 400 GbE spines |
| East-west bandwidth | ~1 Tbps | 10s of Tbps |
| Redundancy | EC k=4,m=1 (25% overhead, tolerates 1 node loss) | EC k=8,m=2 (25% overhead, tolerates 2 node loss) |
| Power | 4–5 kW typical / ~6 kW peak from colo PDU | ~720 kW IT (or ~70 kW backup-tier) from captive solar @ ₹2.35/kWh |
| UPS / battery | Provided by colo | ~180 kWh VRLA usable (15-min runtime) |
| Generator | Provided by colo | Diesel backup, our own fuel logistics |
| Compliance | Tier-3 facility class via colo | Tier III formal review (independent MEP audit) |
| Capex | ~$60–80K (refurbished networking saves significant) | $8–10M rack-level |
| Opex | ~$5–8K/month (lease + power + bandwidth) | Power (offset by solar) + maintenance + bandwidth |
Carries over unchanged
- ✓ Jam codec
- ✓ RustFS S3 layer
- ✓ Customer dashboards
- ✓ Telemetry stack
- ✓ Per-tenant isolation
- ✓ Compliance posture
- ✓ Operational playbook
- ✓ Mellanox / EPYC choices
New at 1 MW
- Tiered storage management (hot / warm / bulk)
- Leaf-spine network design
- UPS / battery engineering
- Captive solar PPA
- Generator + fuel logistics
- Formal Tier III review
Where to go deeper
- Compliance posture — DPDP, CERT-In, SOC 2, FIPS-140, telemetry-only
- Economics — ₹/TB at the 1 PB pod, breakeven on the 1 MW build
- Competitive picture — why hyperscalers can't replicate this in India
- Tender mechanics — what artefacts procurement actually needs at each stage
- Failure modes — what happens when a drive dies mid-write, when a switch dies, when the building loses power
Pick the one that's least clear and we can walk through it in equivalent depth.