September 27-30, 2021
Seattle, Washington, USA + Virtual
View More Details & Registration

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for Open Source Summit + Embedded Linux Conference + OSPOCon 2021 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

This schedule is automatically displayed in Pacific Daylight Time (UTC -7). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change.

IMPORTANT NOTE: Timing of sessions and room locations are subject to change.

Back To Schedule
Wednesday, September 29 • 2:45pm - 3:35pm
(IN-PERSON) Efficient Data Parallel Distributed Training with Flyte, Spark & Horovod - Ketan Umare & Katrina Rogan, Union.ai

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Feedback form is now closed.
Tensorflow & Pytorch have become de facto tools to build Deep learning models. These frameworks provide abstractions to build the networks and manage the data transport from disk to GPUs. For real world applications, a single machine (with one or more GPUs) is enough to train the model. In practice, the amount of available data greatly outweighs the size of the model. DataParallel training scales model training to multiple machines, often requiring sharing weights after one batch of training data. Horovod (LF AI & Data project) simplifies synchronization across multiple machines. On the other hand Spark is used to extract, preprocess and partition the data. Chaining these pieces together involves complexity and managing disparate infrastructure for each framework. Flyte (LF AI & Data incubating project) provides an innovative and user-friendly way to manage each of these pieces. Flyte simplifies the process of connecting the data output of a Spark transform to a training system using Horovod, while ensuring high utilization of GPU’s. We will describe how we use Flyte to manage and orchestrate data parallel training workflows end to end. We will discuss the benefits and challenges of integrating distributed data science tools in a cost-efficient and developer-friendly way.

avatar for Ketan Umare

Ketan Umare

Chief Software Architect, Union.ai
Ketan Umare is the TSC Chair for Flyte (incubating under LF AI & Data). He is also currently the Chief Software Architect at Union.ai. Previously he had multiple Senior Lead roles at Lyft, Oracle and Amazon ranging from Cloud, Distributed storage, Mapping (map making) and machine... Read More →
avatar for Katrina Rogan

Katrina Rogan

Software Engineer, Union.ai
Katrina is software engineer who previously worked at Lyft and Google. She has experience working on data pipelines for mapping, travel search and ad performance reporting.

Wednesday September 29, 2021 2:45pm - 3:35pm PDT
Room 502