Accelerate AI Jobs with Real-time Resilience
Transform utilization by eliminating costly restarts.
Disruptive link failures shouldn’t bring AI training to a halt. Clockwork Workload Failover delivers job-aware resilience that absorbs NIC failures and link flaps in real time—keeping clusters productive and avoiding wasted GPU hours.

Disruptive Network Failures and Link Flaps
Are Common and Expensive
Why traditional fabrics force resets instead of recovery.
At scale, even rare NIC or optical failures compound into frequent job restarts. With thousands of links in a cluster, the statistical mean time to failure is measured in minutes, not years—causing GPU stalls, lost hours, and mounting cost. One of the most common problems encountered is Infiniband/RoCE link failure. Even if each NIC-to-leaf switch link had a mean to failure rate of 5 years, due to the high number of transceivers, it would only take 26.28 minutes for the first job failure.


“Achieving high utilization with them (GPUs) is difficult due to the high failure rate of various components, especially networking.
lost per incident
Source: Falcon: Pinpointing and Mitigating Stragglers for Large-Scale Hybrid-Parallel Training, 2024; The Llama 3 Herd of Models, 2024; “Alibaba HPN: A Data Center Network for Large Language Model Training”, ACM SIGCOMM ’24; Gemini: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints, 2023

Clockwork’s Workload Failover Provides
Resilience To Link Flaps
Sustain training momentum with resilient, job-aware networking
Link/NIC flapping
-
Before Clockwork: A NIC failure kills the job, halting training until a restart and checkpoint recovery.
-
After Clockwork: Jobs stay alive — throughput dips briefly, then recovers to full speed within a minute.

Node Inbound
Infiniband Throughput
-
AI job resilience in action: Even with multiple NIC flaps, the workload continues running — no restart required.
-
Graceful recovery: Throughput dips briefly, then automatically restores to full speed, preserving collective progress and avoiding wasted GPU hours

Learn More
Stop wasting GPU cycles. Start scaling smarter.
Clusters must deliver high uptime while running at maximum efficiency.
Turn your GPU clusters into a competitive advantage—not a cost center.
