Clockwork Software-Accelerated Networking for AI Workloads

Built on Breakthrough Clock Sync Technology
This white paper examines the evolution of networking, the distinctions between cloud and AI-specific networks,
and the unique challenges faced by large-scale CPU and GPU clusters.
It introduces Clockwork’s innovative software-centric approach to AI fabric acceleration and monitoring, which eliminates the need for specialized hardware.

Built on a foundation of fine-grained clock synchronization and rich network telemetry, Clockwork’s solution ensures consistently high network performance, optimizing GPU utilization while providing exceptional reliability. Its comprehensive AI monitoring and rich telemetry spans the entire infrastructure footprint, delivering fleet-level health insights alongside granular workload-level monitoring.

Request the whitepaper to learn more.

Clockwork's GPU Cloud Solution

Our approach is fundamentally different. Our unique software-based solution ensures reliability and fabric acceleration without relying on custom hardware or in-band network telemetry. Compatible with standard Ethernet switches and NICs, it can scale beyond 100,000 GPU nodes while cutting costs, boosting flexibility, and enhancing resilience.

Clockwork Job and Network Fleet Monitoring

No real-time insights on connectivity, path quality, message-level, 
and job-level metrics, resulting in AI infra/ops team unable to identify and resolve 
issues quickly.

Clockwork’s software:

  • Create NIC-to-NIC Probe Mesh: Small probe packets traverse these edges
  • Monitor Network Health: Probes continuously check for liveness of paths, whether or not there’s data on the paths
  • Measure NIC—NIC Delays
    Synchronize the clocks at all the NICs
    Obtain accurate one-way delays for every QPair of interest

Clockwork Solution To Link/NIC Flapping

Link / NIC failures due to optics overheating is a common problem in InfiniBand and RoCE, resulting in job crashes, more frequent checkpointing and restarting.

“Alert: Urgent | Cluster: link has flapped more than 8 times within the past hour”

Clockwork’s software:

  • Quickly detect link/NIC failures
  • Use an alternate path
  • Monitor health of failed paths and re-use them when they recover
  • Ensure continuous operations despite failures, leading to higher GPU utilization and faster job completion time.

Link/NIC Flapping: Before and After Clockwork

Without Clockwork, a NIC failure halts AI jobs entirely. With Clockwork, jobs continue at reduced throughput during a failure and quickly return to full capacity, ensuring robust resilience and uninterrupted performance

Clockwork's Solution to Fabric Contention

Bursty traffic with multiple data 
flows collide on links and contend 
for bandwidth, resulting in low throughput, high latency, and degraded NCCL performance.

“Network links get saturated quickly … The last flow dictates starting of next iteration”

Clockwork’s software:

  • Use QPair-level delay measurements,  intelligently
    detects “fabric contention” (the oversubscription of certain paths in the fabric)
  • Balance the load evenly across the network fabric to eliminate contention and increase the total throughput.

Fabric Contention: Before and After Clockwork

Clockwork revolutionizes high-performance networking by enhancing throughput and latency. In an All-to-All workload, it recovers throughput from 39Gbps to 92Gbps under contention, while reducing latency to under 50 microseconds, even with simultaneous jobs.

100% Pure Software: Accelerate any workload on any AI accelerator and network.

Clockwork’s software replaces costly hardware, enabling rapid congestion resolution while ensuring reliability, acceleration, and full network visibility to keep AI jobs running 24/7.

Interested in learning more about Clockwork.io?

We're here to help. Please complete the form and we'll be in touch soon!