
Give candidates a real broken environment — a CrashLoopBackOff, a failed scheduled task, a disk at 100% — and watch how they actually troubleshoot. No whiteboards. No trivia. No "explain a linked list." Just the work.
it's open source — run it yourself, free, forever
They crushed the system-design interview. They named all four levels of cache. Then a pod went into CrashLoopBackOff in week two — and they opened a ticket instead of opening the logs.
None of it answers the one thing you need to know: when something breaks at 2am, can this person fix it?
A real environment, a real fault, a real terminal, a running clock. Exactly like the job.
A web app throwing 502s. A scheduled task failing silently. An archive job that can't auth to SQL. Choose from the open scenario library or write your own — each is a real environment with a deliberately-broken fault.
The candidate sees the system healthy first, so they know what "right" looks like. Then the fault is injected. A trouble ticket lands — exactly as vague as the ones your team actually gets. The clock starts.
Live SSH, RDP, and kubectl — through the browser, no setup. Every session is recorded: not just did they fix it, but how they got there. Did they check the logs first, or guess? Did they trace the chain, or thrash?
No "verify" button. Just like production — they decide when it's done, and you find out if they were right. Configs and manifests are snapshotted at submit for review.
Actual EC2, actual Kubernetes, actual databases. They're not clicking through a quiz — they're in a live box that's genuinely broken.
The recording shows their reasoning. Two people fix the same disk; one ran df then du, the other blindly deleted logs. You'll see which is which.
setup/ break/ validate/ — version them, PR them, fork someone else's. Point at any GitHub repo for community scenarios.
Compose scenarios into role-based exams — L1 SRE, L3 SRE, Cloud Engineer. Pods, then DB connectivity, then a disk, then a cross-system incident.
Self-host the whole platform. Your infra, your candidates, your scenarios, your control. No phone-home, no per-seat tax.
Not an HR product with a technical coat of paint. It's the test we wish we'd been able to give — made by people who've carried the pager.
The entire platform — backend, frontend, example scenarios — is open and self-hostable. Clone it, point it at your own AWS account, run it behind your own firewall. No license keys, no phone-home, no per-seat tax. If you'd rather own the whole stack, you can.
Built on the tools you already run: Terraform, Kubernetes,
Guacamole.
Scenarios are portable folders. The schema is documented and
versioned.
Self-hosting means your AWS bill, your provisioning reliability, your teardown logic, your 2am page when an environment won't spin up.
Self-hosting is free, forever.
I'm an SRE. I've sat through interviews where the candidate had every certification and couldn't tail a log.
I've also watched people with thin resumes calmly walk a broken cluster back to health. The difference never showed up on paper — only when something was actually on fire.
So we built the fire. Safely, in a box, on a timer. This is the interview I always wanted to give.
— the FaultWorx team