White Paper

Checkpointing for Resiliency and Performance in AI Pipelines

Today’s organizations are increasingly tasked to bring AI technologies to market quickly and predictably. To be successful, there needs to be a highly performant and reliable supporting architecture for their AI initiatives. Beyond just infrastructure, the technique that has become dominant to assist with maintaining resiliency in AI/ML is checkpointing. WEKA's ability to have high performance checkpointing across any model size allows for more checkpoints to be taken during model training. This ensures faster re-start of training when various failure events occur, resulting in less impact to GPU utilization, less downtime in training and less disruption to model developers and data scientists.