The content on this page was provided by an independent third party and syndicated by XPR Media. Members of the editorial and news staff of the USA TODAY Network were not involved in the creation of this content.

Clockwork.io Introduces A New Class of Fault Tolerance to End Failure-Driven GPU Waste in AI Training

New TorchPass solution addresses a multi-million dollar challenge with AI infrastructure; uses Live GPU Migration to keep large-scale AI training running through hardware failures instead of forcing costly restarts

PALO ALTO, CA / ACCESS Newswire / March 11, 2026 / Clockwork.io, the leader in Software-Driven AI Fabrics– a programmable, vendor-neutral software layer that optimizes large-scale GPU clusters for real-time observability, fault tolerance, and deterministic performance-today announced the general availability of TorchPass Workload Fault Tolerance. This new class of software-driven fault-tolerance eliminates one of the most costly failure modes in large-scale AI training: catastrophic job restarts caused by infrastructure faults.

Delivered as a core capability of the Clockwork.io FleetIQ platform, TorchPass applies the principles of Software-Driven AI Fabrics to distributed training, using Live GPU Migration to allow workloads to continue running through GPU failures, network disruptions, driver bugs, and even full node crashes-without checkpoint restarts or lost progress.

“Companies are investing billions in next-gen chips, yet the costs of running distributed AI jobs remains grossly inflated because the ecosystem has accepted failure as a constant,” said Suresh Vasudevan, CEO of Clockwork.io. “We built TorchPass to fundamentally reject that premise. Instead of treating failure as inevitable and restarting after the fact, TorchPass makes infrastructure faults invisible to the workload-training continues through failures transparently, in software. For a typical 2,048-GPU deployment, that translates into over $6 million a year in recovered compute. This is what our Software-Driven AI Fabric approach was designed to deliver: fault-tolerant AI infrastructure.”

Dylan Patel, Founder and CEO of SemiAnalysis agreed that large-scale training jobs are limited by interruptions.

“As Blackwell clusters roll out with an NVL72 domain, and we look to the future with Rubin Ultra’s NVL576 domain, the idea that a single GPU error or network link flap can take down an entire run is totally unacceptable,” said Patel. “TorchPass solves a huge challenge with cluster reliability: it provides transparent failover and live workload migration that keeps MFU high, which in turn drives better GPU economics.”

Why AI Training Fails at Scale

Distributed AI training remains one of the most failure-prone workloads in modern infrastructure. As cluster sizes grow, fragility increases sharply. Research from Meta FAIR shows that mean time to failure drops to 7.9 hours in a 1,024-GPU cluster and to just 1.8 hours at 16,384 GPUs. This means that for most large, AI-focused enterprises or AI clouds, failure-driven restarts are completely inevitable – making this a major barrier to scaling AI’s impact.

Each failure forces training jobs to roll back to the most recent checkpoint, discarding minutes or hours of completed work and wasting additional time on manual intervention, reprovisioning resources and restarting training. These restarts silently cap GPU utilization, making reliability one of the largest hidden costs in AI infrastructure.

TorchPass addresses this problem by proactively addressing costly AI workload failures, solving them before the job stops or needs to restart. Vital for enterprises running large AI workloads and AI clouds alike, TorchPass dramatically improves the reliability of workloads and cluster utilization. For AI clouds, who can now address impacted GPUs while preserving the training run as planned, this translates into better customer SLAs and overall AI cloud economics, improving their ability to protect margin and deliver new models sooner.

“Managing compute output across large-scale GPU clusters is vital to ensuring we’re delivering reliable capacity to our customers. By using TorchPass we have the support of a company that focuses on resilience like it is a core business function: it replaces any specific failing GPU and keeps the rest of the job moving, rather than making one small problem impact our large-scale operations,” said David Power, CTO of Nscale. “In our evaluation, Live GPU Migration preserved both run continuity and throughput under real fault conditions, which is exactly what you need to deliver predictable time-to-train and a better customer experience at scale.”

How Live GPU Migration Works: Reliability Without Restart

TorchPass performs transparent, in-flight migration of impacted training ranks to spare resources when failures occur. TorchPass typically completes recovery in approximately three minutes while the training process continues uninterrupted.

It supports resilience across three failure scenarios:

  • Unplanned migration, handling sudden events such as kernel crashes, power failures, or GPU faults by reconstructing state from healthy replicas

  • Pre-emptive migration, triggered by early warning signals such as rising temperatures or ECC memory errors, enabling controlled migration before a hard failure

  • Planned migration, enabling maintenance, patching, and workload rebalancing without interrupting training

This approach reduces wasted training progress by 95%, cutting lost time from approximately three hours per day to under ten minutes in a 1,024-GPU cluster.

Jordan Nanos, Member of Technical Staff and lead author of ClusterMAX-SemiAnalysis’ independent benchmark for large-scale AI training-stress tested Clockwork.io TorchPass and found it delivered leading performance and efficiency for large-scale distributed training, enabling users to reduce checkpointing overhead in training. He shared the following results:

“In our testing, Clockwork.io TorchPass delivered the fastest and most efficient fault-tolerant performance for a gpt-oss-120B training run. We used TorchTitan on a Kubernetes cluster with 64x H200 GPUs. During our testing we measured job completion time (JCT) and Model FLOPs Utilization (MFU) against a standard approach (checkpoint-restart) and the leading open-source fault-tolerant training framework (TorchFT). We simulated multiple hardware failures on the cluster in order to stress test the fault-tolerant training frameworks.

When compared to checkpoint-restart, TorchPass was significantly faster to recover from failures. This reduced overall JCT and maintained high MFU. And when compared to TorchFT, TorchPass had a significantly higher MFU. This reduced overall JCT while also maintaining an equal time to recover from failures.

Using TorchPass also has a downstream effect where it provides users with an opportunity to reduce or even remove checkpointing from their training code. This means larger effective batch sizes, lower risk of out of memory errors (OOMs), and less time spent thinking about storage. For a research organization, this can ultimately mean a faster time to reach their training objective,” concluded Nanos.

Measurable Business Impact from Software-Driven Fault-Tolerance

For customers operating large AI clusters, the impact is immediate and measurable. In a typical 2,048-GPU H200 deployment, TorchPass Workload Fault Tolerance delivers over $6 million in annual savings by preventing wasted compute.

These savings come from eliminating hundreds of thousands of GPU-hours that would otherwise be lost to failure-driven restarts, cascading retries, and idle recovery time. By keeping training jobs running through infrastructure faults instead of restarting them, TorchPass converts lost GPU time into productive training, significantly improving the return on GPU investments that today often operate at just 30-50% of theoretical performance.

Enabling the Next Generation of AI Infrastructure

By making reliability a software-defined capability rather than a hardware constraint, TorchPass provides the operational confidence required to deploy next-generation, tightly coupled systems such as NVIDIA GB200 and GB300 NVL72 and future rack-scale systems, where dense architectures amplify the cost of even small failures.

TorchPass builds on Clockwork.io’s prior release of Network Fault Tolerance, which applies the same Software-Driven AI Fabric principles to network resilience by transparently rerouting traffic around link failures.

Together, these capabilities form Clockwork.io’s Software-Driven AI Fabric, a vendor-neutral software layer spanning network, compute, and storage. As modern AI workloads run on tightly coupled clusters where hundreds or thousands of processors must operate in coordinated lockstep, infrastructure behaves as a single system, where reliability and performance directly determine overall efficiency. By managing this complexity in software, Clockwork.io enables operators to run heterogeneous AI infrastructure as a unified platform-maintaining high utilization, predictable performance, and resilience while preserving the flexibility to evolve hardware and improve the economics of large-scale AI deployments.

To learn more about the launch of TorchPass, visit the Clockwork.io team in-person at NVIDIA GTC from March 16-19, Booth #205, or visit https://clockwork.io.

About Clockwork.io
Clockwork.io pioneers Software-Driven AI Fabrics™, delivering a programmable software layer that makes large-scale AI clusters observable, deterministic, and resilient by design to drive continuous workload progress and peak cluster utilization. Its FleetIQ platform enables enterprises to train, deploy, and serve the world’s most demanding AI workloads faster, more reliably, and at lower cost. Companies including Uber, Wells Fargo, DCAI, Nebius, Nscale, and White Fiber trust Clockwork.io to power their AI infrastructure. Learn more at www.clockwork.io.

Media Contact
Dana Trismen
clockwork@unshakablemarketinggroup.com
650-269-7478

SOURCE: Clockwork

View the original press release on ACCESS Newswire

Information contained on this page is provided by an independent third-party content provider. XPRMedia and this Site make no warranties or representations in connection therewith. If you are affiliated with this page and would like it removed please contact pressreleases@xpr.media

Gatsby Florida Partners with Hedrick Brothers Construction for ‘The Modern’ at Palm Beach Gardens

Gatsby Florida Partners with Hedrick Brothers Construction for ‘The Modern’ at Palm Beach Gardens

Gatsby Florida Aims for Mid-Year Groundbreaking for “The Modern” We value the trust Gatsby Florida has placed in our

March 13, 2026

Calvary Placement Agency Announces Launch and Ribbon Cutting Ceremony in Newark, New Jersey

Calvary Placement Agency Announces Launch and Ribbon Cutting Ceremony in Newark, New Jersey

Calvary Placement Agency announces opening of its Newark, New Jersey location, focused on comprehensive case management

March 13, 2026

Perixx Introduces a New Era for Ergonomic Mice with Duo Wireless and USB-C Charging

Perixx Introduces a New Era for Ergonomic Mice with Duo Wireless and USB-C Charging

Perixx launches the PERIMICE-715 Series ergonomic vertical mouse with wireless and USB-C charging, designed for comfort

March 13, 2026

VOCIC Reflects on a Successful Showcase at Medtrade 2026

VOCIC Reflects on a Successful Showcase at Medtrade 2026

Innovative Mobility Solutions and Strong Industry Engagement Highlight VOCIC’s Presence at Medtrade 2026 PHOENIX, AZ,

March 13, 2026

AdvertisingMarketplace.com leading Ad Tech Company Launches AI-Powered Tools–Simplify Buying & Selling Of Advertising

AdvertisingMarketplace.com leading Ad Tech Company Launches AI-Powered Tools–Simplify Buying & Selling Of Advertising

Breaking News 3.12.2026 – AI Powered Tools to Simplify Buying & Selling of Advertising Advertising shouldn’t feel

March 13, 2026

Elite Roof and Solar Named Exclusive Charlotte, NC Winner of GAF Master Elite® 3-Star President’s Club Award for 2026

Elite Roof and Solar Named Exclusive Charlotte, NC Winner of GAF Master Elite® 3-Star President’s Club Award for 2026

Charlotte's most decorated roofing contractor earns prestigious honor for fourth consecutive year, and is one of only

March 13, 2026

Village Green Redefines Memory Care for Assisted Living in Conroe, Texas

Village Green Redefines Memory Care for Assisted Living in Conroe, Texas

Village Green Memory Care releases comprehensive overview of residential dementia care services, protocols, and

March 13, 2026

Institutional Digital Asset Infrastructure: The Industrialization of On-Chain Credit and Neo-Bank Convergence

Institutional Digital Asset Infrastructure: The Industrialization of On-Chain Credit and Neo-Bank Convergence

NEW YORK CITY, NY / ACCESS Newswire / March 13, 2026 / Black Titan Corporation (NASDAQ:BTTC)Executive SummaryThe first

March 13, 2026

One Wall Street Launches B&B Italia Turnkey Collection Debuting: The Signature, The Classic and The Pied-à-Terre

One Wall Street Launches B&B Italia Turnkey Collection Debuting: The Signature, The Classic and The Pied-à-Terre

Exclusive B&B Italia Residences Offer a New Benchmark for Artful, Ready-to-Live Residences in Downtown Manhattan's

March 13, 2026

Global Dental Excellence: Helvetic Clinics Deploys AI to Transform Opaque 3D Scans into Transparent Patient Diagnostics

Global Dental Excellence: Helvetic Clinics Deploys AI to Transform Opaque 3D Scans into Transparent Patient Diagnostics

Helvetic Clinics & Diagnocat AI: Transforming 3D Scans into Transparent Dental Diagnostics in Budapest. Setting a

March 13, 2026

UK Dentists Increasingly Turning to Digital Dental Lab Technology for Precision Restorations

UK Dentists Increasingly Turning to Digital Dental Lab Technology for Precision Restorations

Digital dentistry is transforming UK dental practices, with scanners, CAD/CAM and 3D technology enabling more accurate

March 13, 2026

Psychological Thriller Starring Simone Ashley, Austin Stowell, & Suraj Sharma; a marquee film at Cinequest Film Festival

Psychological Thriller Starring Simone Ashley, Austin Stowell, & Suraj Sharma; a marquee film at Cinequest Film Festival

THIS TEMPTING MADNESS will have its North American premiere as a marquee film at Cinequest Film & Creativity

March 13, 2026

Calvary Placement Agency Announces Launch and Ribbon Cutting Ceremony in New York, New York

Calvary Placement Agency Announces Launch and Ribbon Cutting Ceremony in New York, New York

Calvary Placement Agency announces the opening of New York location, focused on providing comprehensive case management

March 13, 2026

PubHive Sets a New Standard in Automated Local Literature Monitoring

PubHive Sets a New Standard in Automated Local Literature Monitoring

PubHive expands its platform with Automated Local Literature Monitoring, helping organisations centralise global and

March 13, 2026

Remodeled, State-of-the Art University Kia Set for March 23 Grand Opening

Remodeled, State-of-the Art University Kia Set for March 23 Grand Opening

University Kia will host a Grand Opening on March 23 from 11 a.m. to 1 p.m. to celebrate its remodeled dealership. The

March 13, 2026

Signature Foundation Launches Back2Sport Fund to Get 10 Million Kids Access to Sport by 2030

Signature Foundation Launches Back2Sport Fund to Get 10 Million Kids Access to Sport by 2030

Global initiative aims to break down the $1,016 paywall keeping millions of children on the sidelines TAMPA, FL, UNITED

March 13, 2026

Finance & Accounting Technology Certificate Expands with Duke Executive Education and AFP Partnership

Finance & Accounting Technology Certificate Expands with Duke Executive Education and AFP Partnership

Finance & Accounting Technology Certificate Expands with Duke Executive Education and AFP Partnership This

March 13, 2026

HAiCook Brings AI-Powered Inspiration to Everyday Cooking

HAiCook Brings AI-Powered Inspiration to Everyday Cooking

HAiCook is an AI-driven recipe app that helps users discover recipes, generate personalized meals, plan weekly menus,

March 13, 2026

Federal Investment Accelerates Transformation of Tampa’s Historic Robles Park Community

Federal Investment Accelerates Transformation of Tampa’s Historic Robles Park Community

$1.2 Million Federal Investment Supports Redevelopment of Mixed-Income Community in Partnership with Tampa Housing

March 13, 2026

FieldBots integrates Cobi 18 by ICE Cobotics

FieldBots integrates Cobi 18 by ICE Cobotics

Effective immediately, the Cobi 18+ autonomous scrubber dryer by ICE Cobotics is integrated at Level 1 into FieldBots.

March 13, 2026

Halemont Capital Highlights Strategic Capital Structure as a Key Factor in Startup Financing Outcomes

Halemont Capital Highlights Strategic Capital Structure as a Key Factor in Startup Financing Outcomes

Halemont Capital emphasizes disciplined capital structure in startup financing to help founders protect ownership and

March 13, 2026

Altamira Experts To Present at the 2026 AMPP Conference + Expo on CUI & Pipeline Integrity of CO2 Pipelines

Altamira Experts To Present at the 2026 AMPP Conference + Expo on CUI & Pipeline Integrity of CO2 Pipelines

Our Altamira team is proud to contribute research that helps operators better understand corrosion risks, manage

March 13, 2026

Michael LoGiudice, LLP Announces 2026 Community Sponsorships and Charitable Commitments

Michael LoGiudice, LLP Announces 2026 Community Sponsorships and Charitable Commitments

Scholarships, Youth Sports, and Community Events Reflect the Firm’s Commitment to the Hudson Valley These sponsorships

March 13, 2026

Discover Stylish 2-Bedroom Living at Heritage Forest Apartments in Newport News.

Discover Stylish 2-Bedroom Living at Heritage Forest Apartments in Newport News.

NEWPORT NEWS, VA, UNITED STATES, March 13, 2026 /EINPresswire.com/ — Heritage Forest Apartments announces the

March 13, 2026

Driving Identity: How Different Generations Choose Cars in 2025

Driving Identity: How Different Generations Choose Cars in 2025

From practicality to self-expression — how age shapes automotive preferences in 2025 Cars are the sculptures of our

March 13, 2026

Influential Women Profiles Sue Marrero: Director of Project Controls at KMI International, Inc.

Influential Women Profiles Sue Marrero: Director of Project Controls at KMI International, Inc.

ORLANDO, FL, UNITED STATES, March 13, 2026 /EINPresswire.com/ — Combining Expertise, Innovation, and Mentorship to

March 13, 2026

Komerz Follows Pathformance Deal With Glassbox Acquisition to Expand Global Growth Platform

Komerz Follows Pathformance Deal With Glassbox Acquisition to Expand Global Growth Platform

Creative without distribution is theatre. Distribution without brand equity is discounting. Bringing both together

March 13, 2026

E-Bike Liability for Parents: Insurance and Legal Risks for Minors

E-Bike Liability for Parents: Insurance and Legal Risks for Minors

In California, e-bikes are generally categorized into three classes based on motor assistance and maximum speed. SAN

March 13, 2026

Fingerwave Technologies Launches a Fast, Transparent Payment Platform for Global Diaspora Communities in US.

Fingerwave Technologies Launches a Fast, Transparent Payment Platform for Global Diaspora Communities in US.

A next‑generation payment solution designed to make U.S.–global transfers faster, safer, and more affordable. NEW YORK

March 13, 2026

Fitch revises Portugal outlook to ‘Positive’ while affirming ‘A’ rating

Fitch revises Portugal outlook to ‘Positive’ while affirming ‘A’ rating

Fitch upgrades Portugal’s outlook to ‘Positive’ while affirming its ‘A’ rating, citing falling debt, fiscal discipline

March 13, 2026

Natalie Jean Releases ‘Unbreakable Spirit,’ A Powerful Album Giving Voice to Women Who Refuse to Be Silenced

Natalie Jean Releases ‘Unbreakable Spirit,’ A Powerful Album Giving Voice to Women Who Refuse to Be Silenced

Americana Country-Folk album features powerful songs including “Born To Lead,” accompanied by a compelling official

March 13, 2026

HCC Loans Redefines Auto Financing Standards in North Carolina

HCC Loans Redefines Auto Financing Standards in North Carolina

HCC Loans Unveils Comprehensive Vehicle Financing Review, Expanding Auto Loan Access & Financial Services Across

March 13, 2026

Krishnan & Associates Expands Vendor Management Services for Energy, Power & Industrial Clients

Krishnan & Associates Expands Vendor Management Services for Energy, Power & Industrial Clients

Helping OEMs and solution providers coordinate specialized suppliers and vendor payments across energy, industrial, and

March 13, 2026

The Whitlock Inn Receives 2025 Best of Georgia Award

The Whitlock Inn Receives 2025 Best of Georgia Award

MARIETTA, GA, UNITED STATES, March 13, 2026 /EINPresswire.com/ — The Whitlock Inn, one of Marietta’s most recognizable

March 13, 2026

Ink Different Tattoos Launches Tattoo Apprenticeship Program in Las Vegas with Trip Ink Tattoo Co.

Ink Different Tattoos Launches Tattoo Apprenticeship Program in Las Vegas with Trip Ink Tattoo Co.

In Partnership with Renowned Tattoo Artist Rick Trip, Ink Different Expands Its Tattoo Apprenticeship to the Heart of

March 13, 2026

Green Globe Gold Awarded to Tasigo Hotel Eskişehir

Green Globe Gold Awarded to Tasigo Hotel Eskişehir

The Gold certification validates the hotel’s ongoing commitment to improving its overall sustainability performance. We

March 13, 2026

Swickard Auto Group Celebrates Four Best of the Best Dealer Recognition Awards

Swickard Auto Group Celebrates Four Best of the Best Dealer Recognition Awards

WILSONVILLE, OR, UNITED STATES, March 13, 2026 /EINPresswire.com/ — Swickard Auto Group is proud to announce that four

March 13, 2026

The CEO Forum Group Announces Official 2025 Transformative CEO Awards

The CEO Forum Group Announces Official 2025 Transformative CEO Awards

Honoring Leaders Who Reinvigorate Companies, Reinvent Industries, and Reboot Society NEW YORK, NY, UNITED STATES, March

March 13, 2026

SMX’s Molecular Traceability Technology Strengthens Verification Across Global Oil and Gas Supply Chains

SMX’s Molecular Traceability Technology Strengthens Verification Across Global Oil and Gas Supply Chains

NEW YORK CITY, NY / ACCESS Newswire / March 13, 2026 / As geopolitical tensions, sanctions enforcement, and shifting

March 13, 2026

Luminar Media Group – Fortun – Provides General Business Update on Corporate Initiatives

Luminar Media Group – Fortun – Provides General Business Update on Corporate Initiatives

MIAMI, FL / ACCESS Newswire / March 13, 2026 / Luminar Media Group, Inc. (OTCID:LRGR) ("Luminar" or the "Company"), a

March 13, 2026