The content on this page was provided by an independent third party and syndicated by XPR Media. Members of the editorial and news staff of the USA TODAY Network were not involved in the creation of this content.

Clockwork.io Introduces A New Class of Fault Tolerance to End Failure-Driven GPU Waste in AI Training

New TorchPass solution addresses a multi-million dollar challenge with AI infrastructure; uses Live GPU Migration to keep large-scale AI training running through hardware failures instead of forcing costly restarts

PALO ALTO, CA / ACCESS Newswire / March 11, 2026 / Clockwork.io, the leader in Software-Driven AI Fabrics– a programmable, vendor-neutral software layer that optimizes large-scale GPU clusters for real-time observability, fault tolerance, and deterministic performance-today announced the general availability of TorchPass Workload Fault Tolerance. This new class of software-driven fault-tolerance eliminates one of the most costly failure modes in large-scale AI training: catastrophic job restarts caused by infrastructure faults.

Delivered as a core capability of the Clockwork.io FleetIQ platform, TorchPass applies the principles of Software-Driven AI Fabrics to distributed training, using Live GPU Migration to allow workloads to continue running through GPU failures, network disruptions, driver bugs, and even full node crashes-without checkpoint restarts or lost progress.

“Companies are investing billions in next-gen chips, yet the costs of running distributed AI jobs remains grossly inflated because the ecosystem has accepted failure as a constant,” said Suresh Vasudevan, CEO of Clockwork.io. “We built TorchPass to fundamentally reject that premise. Instead of treating failure as inevitable and restarting after the fact, TorchPass makes infrastructure faults invisible to the workload-training continues through failures transparently, in software. For a typical 2,048-GPU deployment, that translates into over $6 million a year in recovered compute. This is what our Software-Driven AI Fabric approach was designed to deliver: fault-tolerant AI infrastructure.”

Dylan Patel, Founder and CEO of SemiAnalysis agreed that large-scale training jobs are limited by interruptions.

“As Blackwell clusters roll out with an NVL72 domain, and we look to the future with Rubin Ultra’s NVL576 domain, the idea that a single GPU error or network link flap can take down an entire run is totally unacceptable,” said Patel. “TorchPass solves a huge challenge with cluster reliability: it provides transparent failover and live workload migration that keeps MFU high, which in turn drives better GPU economics.”

Why AI Training Fails at Scale

Distributed AI training remains one of the most failure-prone workloads in modern infrastructure. As cluster sizes grow, fragility increases sharply. Research from Meta FAIR shows that mean time to failure drops to 7.9 hours in a 1,024-GPU cluster and to just 1.8 hours at 16,384 GPUs. This means that for most large, AI-focused enterprises or AI clouds, failure-driven restarts are completely inevitable – making this a major barrier to scaling AI’s impact.

Each failure forces training jobs to roll back to the most recent checkpoint, discarding minutes or hours of completed work and wasting additional time on manual intervention, reprovisioning resources and restarting training. These restarts silently cap GPU utilization, making reliability one of the largest hidden costs in AI infrastructure.

TorchPass addresses this problem by proactively addressing costly AI workload failures, solving them before the job stops or needs to restart. Vital for enterprises running large AI workloads and AI clouds alike, TorchPass dramatically improves the reliability of workloads and cluster utilization. For AI clouds, who can now address impacted GPUs while preserving the training run as planned, this translates into better customer SLAs and overall AI cloud economics, improving their ability to protect margin and deliver new models sooner.

“Managing compute output across large-scale GPU clusters is vital to ensuring we’re delivering reliable capacity to our customers. By using TorchPass we have the support of a company that focuses on resilience like it is a core business function: it replaces any specific failing GPU and keeps the rest of the job moving, rather than making one small problem impact our large-scale operations,” said David Power, CTO of Nscale. “In our evaluation, Live GPU Migration preserved both run continuity and throughput under real fault conditions, which is exactly what you need to deliver predictable time-to-train and a better customer experience at scale.”

How Live GPU Migration Works: Reliability Without Restart

TorchPass performs transparent, in-flight migration of impacted training ranks to spare resources when failures occur. TorchPass typically completes recovery in approximately three minutes while the training process continues uninterrupted.

It supports resilience across three failure scenarios:

  • Unplanned migration, handling sudden events such as kernel crashes, power failures, or GPU faults by reconstructing state from healthy replicas

  • Pre-emptive migration, triggered by early warning signals such as rising temperatures or ECC memory errors, enabling controlled migration before a hard failure

  • Planned migration, enabling maintenance, patching, and workload rebalancing without interrupting training

This approach reduces wasted training progress by 95%, cutting lost time from approximately three hours per day to under ten minutes in a 1,024-GPU cluster.

Jordan Nanos, Member of Technical Staff and lead author of ClusterMAX-SemiAnalysis’ independent benchmark for large-scale AI training-stress tested Clockwork.io TorchPass and found it delivered leading performance and efficiency for large-scale distributed training, enabling users to reduce checkpointing overhead in training. He shared the following results:

“In our testing, Clockwork.io TorchPass delivered the fastest and most efficient fault-tolerant performance for a gpt-oss-120B training run. We used TorchTitan on a Kubernetes cluster with 64x H200 GPUs. During our testing we measured job completion time (JCT) and Model FLOPs Utilization (MFU) against a standard approach (checkpoint-restart) and the leading open-source fault-tolerant training framework (TorchFT). We simulated multiple hardware failures on the cluster in order to stress test the fault-tolerant training frameworks.

When compared to checkpoint-restart, TorchPass was significantly faster to recover from failures. This reduced overall JCT and maintained high MFU. And when compared to TorchFT, TorchPass had a significantly higher MFU. This reduced overall JCT while also maintaining an equal time to recover from failures.

Using TorchPass also has a downstream effect where it provides users with an opportunity to reduce or even remove checkpointing from their training code. This means larger effective batch sizes, lower risk of out of memory errors (OOMs), and less time spent thinking about storage. For a research organization, this can ultimately mean a faster time to reach their training objective,” concluded Nanos.

Measurable Business Impact from Software-Driven Fault-Tolerance

For customers operating large AI clusters, the impact is immediate and measurable. In a typical 2,048-GPU H200 deployment, TorchPass Workload Fault Tolerance delivers over $6 million in annual savings by preventing wasted compute.

These savings come from eliminating hundreds of thousands of GPU-hours that would otherwise be lost to failure-driven restarts, cascading retries, and idle recovery time. By keeping training jobs running through infrastructure faults instead of restarting them, TorchPass converts lost GPU time into productive training, significantly improving the return on GPU investments that today often operate at just 30-50% of theoretical performance.

Enabling the Next Generation of AI Infrastructure

By making reliability a software-defined capability rather than a hardware constraint, TorchPass provides the operational confidence required to deploy next-generation, tightly coupled systems such as NVIDIA GB200 and GB300 NVL72 and future rack-scale systems, where dense architectures amplify the cost of even small failures.

TorchPass builds on Clockwork.io’s prior release of Network Fault Tolerance, which applies the same Software-Driven AI Fabric principles to network resilience by transparently rerouting traffic around link failures.

Together, these capabilities form Clockwork.io’s Software-Driven AI Fabric, a vendor-neutral software layer spanning network, compute, and storage. As modern AI workloads run on tightly coupled clusters where hundreds or thousands of processors must operate in coordinated lockstep, infrastructure behaves as a single system, where reliability and performance directly determine overall efficiency. By managing this complexity in software, Clockwork.io enables operators to run heterogeneous AI infrastructure as a unified platform-maintaining high utilization, predictable performance, and resilience while preserving the flexibility to evolve hardware and improve the economics of large-scale AI deployments.

To learn more about the launch of TorchPass, visit the Clockwork.io team in-person at NVIDIA GTC from March 16-19, Booth #205, or visit https://clockwork.io.

About Clockwork.io
Clockwork.io pioneers Software-Driven AI Fabrics™, delivering a programmable software layer that makes large-scale AI clusters observable, deterministic, and resilient by design to drive continuous workload progress and peak cluster utilization. Its FleetIQ platform enables enterprises to train, deploy, and serve the world’s most demanding AI workloads faster, more reliably, and at lower cost. Companies including Uber, Wells Fargo, DCAI, Nebius, Nscale, and White Fiber trust Clockwork.io to power their AI infrastructure. Learn more at www.clockwork.io.

Media Contact
Dana Trismen
clockwork@unshakablemarketinggroup.com
650-269-7478

SOURCE: Clockwork

View the original press release on ACCESS Newswire

Information contained on this page is provided by an independent third-party content provider. XPRMedia and this Site make no warranties or representations in connection therewith. If you are affiliated with this page and would like it removed please contact pressreleases@xpr.media

Plate & Dish Brings High-End, Custom Kitchen Design to South Tampa

Plate & Dish Brings High-End, Custom Kitchen Design to South Tampa

Plate & Dish, a high-end kitchen design studio, recently opened its South Tampa location to provide homeowners with

March 13, 2026

When Dogs Struggle to Get Up or Stop Greeting at the Door: What Pet Owners Are Noticing and Why ZenaPet Is Part of the Mobility Conversation

When Dogs Struggle to Get Up or Stop Greeting at the Door: What Pet Owners Are Noticing and Why ZenaPet Is Part of the Mobility Conversation

Costa Mesa, California – March 13, 2026 – PRESSADVANTAGE – For many dog owners, one of the most recognizable signs of a

March 13, 2026

Worship Leader & Singer-Songwriter thurane Launches New Single: ‘Lift Him Up’ – a Call to Worship

Worship Leader & Singer-Songwriter thurane Launches New Single: ‘Lift Him Up’ – a Call to Worship

Worship Leader & Singer-Songwriter thurane launches new Single, "Lift Him Up" on April 10, 2026. To be included in

March 13, 2026

Karns & Karns Personal Injury and Accident Attorneys Launch Texas 18-Wheeler & Trucking Division

Karns & Karns Personal Injury and Accident Attorneys Launch Texas 18-Wheeler & Trucking Division

San Antonio trial team offers direct-advocacy alternative to marketing referral firms for Amazon and UPS accidents. SAN

March 13, 2026

Karns & Karns Personal Injury and Accident Attorneys Focus on California Pedestrian & Slip-and-Fall Safety

Karns & Karns Personal Injury and Accident Attorneys Focus on California Pedestrian & Slip-and-Fall Safety

Family-owned firm deploys specialized investigative team and to combat rising urban safety hazards and negligent

March 13, 2026

Ghost Uncovers a Centuries Old Mystery Hidden in a Quiet New Hampshire Town

Ghost Uncovers a Centuries Old Mystery Hidden in a Quiet New Hampshire Town

In Ghost, author Jim Bellisle tells the story of an intuitive canine whose instincts lead to the discovery of a mystery

March 13, 2026

Global Sleep Crisis: Sleep Solutions for World Sleep Day

Global Sleep Crisis: Sleep Solutions for World Sleep Day

AchievingSleep.com Introduces Programs to Help People Sleep in 15 Minutes Sleep is the best recovery you can have.”—

March 13, 2026

Canopii Debuts Autonomous Robotic Greenhouse and Launches Seed Round to Scale Local Food Production

Canopii Debuts Autonomous Robotic Greenhouse and Launches Seed Round to Scale Local Food Production

HUBBARD, OR, UNITED STATES, March 12, 2026 /EINPresswire.com/ — Canopii Inc., an Oregon-based ag-tech startup with a

March 13, 2026

National Patient Safety Awareness Week Highlights Need for Vigilance in Nursing Homes, Solomon & Relihan Says

National Patient Safety Awareness Week Highlights Need for Vigilance in Nursing Homes, Solomon & Relihan Says

Phoenix law firm urges families to watch for signs of neglect and use available resources to protect vulnerable loved

March 13, 2026

Klepsydra Technologies and BrainChip Announce Strategic Partnership for Heterogeneous AI Runtime for Akida™ Processors

Klepsydra Technologies and BrainChip Announce Strategic Partnership for Heterogeneous AI Runtime for Akida™ Processors

Brainchip Limited Holding Co (ASX:BRN)"BrainChip’s Akida is the ideal neuromorphic partner in delivering the

March 13, 2026

Digit Raises $3M in New Capital, Bringing Total Funding to $6.3 million as Demand Accelerates for a NetSuite Alternative

Digit Raises $3M in New Capital, Bringing Total Funding to $6.3 million as Demand Accelerates for a NetSuite Alternative

Digit, a modern ERP for the AI-era, raises $3M in oversubscribed funding, bringing total funding to date to $6.3M.

March 13, 2026

$5 Billion Industry Prepares to Celebrate National Quilting Day, March 21, 2026

$5 Billion Industry Prepares to Celebrate National Quilting Day, March 21, 2026

The National Quilt Museum Celebrates National Quilting Day with exhibitions by engineers who quilt, driving new STEAM

March 13, 2026

New WEBINAR Explores How Enterprises Evaluate Document Automation Vendors in 2026

New WEBINAR Explores How Enterprises Evaluate Document Automation Vendors in 2026

This Webinar explores how enterprises evaluate document automation and Intelligent Document Processing platforms before

March 13, 2026

Airoi Announces Strategic Collaboration with Simple Machine Mind

Airoi Announces Strategic Collaboration with Simple Machine Mind

Enhance Its Net Zero Planner with Advanced AI LIVERMORE, CA, UNITED STATES, March 12, 2026 /EINPresswire.com/ — Airoi,

March 13, 2026

Ono Hawaiian BBQ Kicks Off First-Ever MLS Partnership with LAFC, Launching ‘LAFC Scores First’ Trigger Promotion

Ono Hawaiian BBQ Kicks Off First-Ever MLS Partnership with LAFC, Launching ‘LAFC Scores First’ Trigger Promotion

Fans score a $5.99 Chicken Plate Lunch the next business day when LAFC nets first in the first half at home LOS

March 13, 2026

Gesundheitsverzeichnis Schweiz Enhances Verified Healthcare Provider Directory Across Swiss Cantons

Gesundheitsverzeichnis Schweiz Enhances Verified Healthcare Provider Directory Across Swiss Cantons

Winterthur, Zurich – March 12, 2026 – PRESSADVANTAGE – Gesundheitsverzeichnis Schweiz, the comprehensive online

March 12, 2026

Amana Care Clinic Expands Walk-In Medical Services to Meet Growing Community Healthcare Needs

Amana Care Clinic Expands Walk-In Medical Services to Meet Growing Community Healthcare Needs

MUSCATINE, Iowa – March 12, 2026 – PRESSADVANTAGE – Amana Care Clinic – Muscatine continues to expand its comprehensive

March 12, 2026

Arizona’s First Logo-Free Black License Plate Launching March 26

Arizona’s First Logo-Free Black License Plate Launching March 26

$17 from each specialty license plate to benefit Arizona children’s charities Our mission is simple. Raise as much as

March 12, 2026

Fuzzy Friends Foundation Announces Completion of Construction, Awaits Final Use Permits to Begin Operations

Fuzzy Friends Foundation Announces Completion of Construction, Awaits Final Use Permits to Begin Operations

CHERRY VALLEY, CA, UNITED STATES, March 12, 2026 /EINPresswire.com/ — The Fuzzy Friends Foundation, a newly

March 12, 2026

Telegent Appoints Industry Leader Isaac Jacobson as Chief Strategy Officer

Telegent Appoints Industry Leader Isaac Jacobson as Chief Strategy Officer

SALT LAKE CITY, UT, UNITED STATES, March 12, 2026 /EINPresswire.com/ — Telegent today announced that technology

March 12, 2026

Weatherstone Capital Partners Launches Search for Acquisition Focused on Partnership and Long-Term Growth

Weatherstone Capital Partners Launches Search for Acquisition Focused on Partnership and Long-Term Growth

Mike Peng founded Weatherstone on a single ethos: partnership. Relationships first, long-term commitment always. My

March 12, 2026

Alejandro Hernandez III Joins SCCE and Completes ABA Compliance Course on IOLTA Trust Accounts

Alejandro Hernandez III Joins SCCE and Completes ABA Compliance Course on IOLTA Trust Accounts

Alejandro Hernandez III Joins SCCE and Completes ABA Compliance Course on IOLTA Trust Accounts LOS ANGELES, CA, UNITED

March 12, 2026

Amend Treatment Announces New Ownership, Marking a New Chapter in Comprehensive Mental Health Care

Amend Treatment Announces New Ownership, Marking a New Chapter in Comprehensive Mental Health Care

MALIBU, CA, UNITED STATES, March 12, 2026 /EINPresswire.com/ — Amend Treatment, a premier residential mental health

March 12, 2026

Nonaste Launches at Raley’s Supermarkets This March: Eco-Friendly Laundry + Pet Odor Solutions That Don’t Mess Around

Nonaste Launches at Raley’s Supermarkets This March: Eco-Friendly Laundry + Pet Odor Solutions That Don’t Mess Around

Nonaste, the high-performance cleaning brand, announced today that its eco-friendly laundry detergents and pet-odor

March 12, 2026

Protected Lane Fallacy: Right-Hook Risks in Expanding LA Bike Lanes

Protected Lane Fallacy: Right-Hook Risks in Expanding LA Bike Lanes

Los Angeles has invested heavily in expanding this infrastructure as part of broader multimodal transportation

March 12, 2026

Vegas Chamber to Host Ribbon Cutting for LasVegasSEO.ai Launch of VegasCitations.com

Vegas Chamber to Host Ribbon Cutting for LasVegasSEO.ai Launch of VegasCitations.com

Celebrate the launch of VegasCitations.com, connect with local business owners, and enter to win an Ultimate Citations

March 12, 2026

Young Adults Are Falling Through the Mental Health Care System: Specialized Programs Are Stepping In

Young Adults Are Falling Through the Mental Health Care System: Specialized Programs Are Stepping In

THETFORD CENTER, VT, UNITED STATES, March 12, 2026 /EINPresswire.com/ — As mental health concerns continue to rise

March 12, 2026

Leading Nonprofit Budgeting Software Platforms for Finance Teams Compared

Leading Nonprofit Budgeting Software Platforms for Finance Teams Compared

Comparison of non-profit budgeting software including Budgyt, Martus Solutions, BudgetPak, and Aplos Spreadsheets

March 12, 2026

Trac-Rite Unveils Refreshed Brand Identity

Trac-Rite Unveils Refreshed Brand Identity

New identity and tagline, “The Way to Open,” signal continued innovation and growth while reinforcing Trac-Rite’s

March 12, 2026

Schmidt Glass Company Expands Mobile Glass Service to Fort Myers and Cape Coral

Schmidt Glass Company Expands Mobile Glass Service to Fort Myers and Cape Coral

ARCADIA, FL, UNITED STATES, March 12, 2026 /EINPresswire.com/ — Schmidt Glass Company is pleased to announce that it

March 12, 2026

Neolith Introduces Four New Products and Two New Finishes for 2026

Neolith Introduces Four New Products and Two New Finishes for 2026

Neolith's new Colosseo, Toscano, Nivola, and Pasadena strengthen a nature-inspired proposal where continuity, light,

March 12, 2026

NorCal Treatment Centers Opens Luxury Addiction Treatment Facility, Expanding Access to High-End Care Across NorCal

NorCal Treatment Centers Opens Luxury Addiction Treatment Facility, Expanding Access to High-End Care Across NorCal

AUBURN, CA, UNITED STATES, March 12, 2026 /EINPresswire.com/ — NorCal Treatment Centers proudly announces the opening

March 12, 2026

GreenBanana SEO Examines How AI Overviews Are Changing Citation Click Behavior

GreenBanana SEO Examines How AI Overviews Are Changing Citation Click Behavior

March 12, 2026 – PRESSADVANTAGE – Changes in Google’s AI-powered search experiences are beginning to alter the way

March 12, 2026

Environmental Service Pros Emphasizes Critical Role of Professional Mold Testing in Property Health Assessment

Environmental Service Pros Emphasizes Critical Role of Professional Mold Testing in Property Health Assessment

NASHVILLE, TN – March 12, 2026 – PRESSADVANTAGE – Environmental Service Pros, a Nashville-based environmental services

March 12, 2026

Go Industries Expands Winch Grille Guards System for Commercial Trucks

Go Industries Expands Winch Grille Guards System for Commercial Trucks

Richardson, TX – March 12, 2026 – PRESSADVANTAGE – Go Industries has expanded its commercial-grade winch grille guard

March 12, 2026

Kraken Bond Reinforces Industry Leadership Through Comprehensive Safety Certifications

Kraken Bond Reinforces Industry Leadership Through Comprehensive Safety Certifications

CHANTILLY, VA – March 12, 2026 – PRESSADVANTAGE – Kraken Bond, a leading manufacturer of high-performance chemical

March 12, 2026

Williams and Sons Custom Construction and Design Expands New Construction Services Throughout Southwest Missouri

Williams and Sons Custom Construction and Design Expands New Construction Services Throughout Southwest Missouri

CASSVILLE, MO – March 12, 2026 – PRESSADVANTAGE – Williams and Sons Custom Construction and Design has expanded its

March 12, 2026

Wellness Counseling Announces Attendance at RAM Fest This May

Wellness Counseling Announces Attendance at RAM Fest This May

March 12, 2026 – PRESSADVANTAGE – Wellness Counseling will attend and sponsor RAM Fest (Ramsey Aware Mental Health

March 12, 2026

ClearSight Outlines LASIK Scheduling Timelines Following Consultation in New Patient Resource

ClearSight Outlines LASIK Scheduling Timelines Following Consultation in New Patient Resource

PLANO, TX – March 12, 2026 – PRESSADVANTAGE – ClearSight released a new educational article, "How soon can someone

March 12, 2026

Chef’s Deal Restaurant Equipment Expands Design-Build Services for New Restaurant Openings Across Tennessee

Chef’s Deal Restaurant Equipment Expands Design-Build Services for New Restaurant Openings Across Tennessee

Nashville, Tennessee – March 12, 2026 – PRESSADVANTAGE – Chef's Deal Restaurant Equipment, a leading commercial kitchen

March 12, 2026