September 27-30, 2021
Seattle, Washington, USA + Virtual
View More Details & Registration

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for Open Source Summit + Embedded Linux Conference + OSPOCon 2021 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

This schedule is automatically displayed in Pacific Daylight Time (UTC -7). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change.

IMPORTANT NOTE: Timing of sessions and room locations are subject to change.

Back To Schedule
Monday, September 27 • 4:50pm - 5:40pm
(IN-PERSON) Enforcing Data Quality in Data Processing and ML Pipelines with Flyte and Pandera - Niels Bantilan, union.ai

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Feedback form is now closed.
Pandas has become one of the de-facto libraries for data manipulation of tabular data in the Python ecosystem. In recent years, several projects have emerged, such as Dask, Modin, and Koalas, whose goal is to reproduce the Pandas API in order to ease the learning curve for scaling data processing logic. Coupled with ML orchestration tools like Flyte, machine learning practitioners can benefit from reproducibility and data lineage tracking while using the data processing tools they are familiar with. However, as powerful as dataframes are, they can often be difficult to reason about in terms of their data types and statistical properties as data is reshaped from its raw form into one that’s ready for modeling. In this session, data science and machine learning practitioners will learn how to combine Flyte’s rich type system and flexible data pipeline composition syntax with Pandera’s intuitive schema-declaration API so they can spend less time worrying about the correctness of their dataframes and more time obtaining insights and training models. This talk will first introduce Pandera, a package that provides an expressive data validation API, and then dive into a practical case study to illustrate the benefits of integrating Pandera with Flyte.

avatar for Niels Bantilan

Niels Bantilan

Machine Learning Software Engineer, union.ai
Niels is a machine learning engineer and core maintainer of Flyte, an open source ML orchestration tool and author and maintainer of Pandera, a data testing tool for dataframes. He has a Masters in Public Health with a specialization in sociomedical science and public health informatics... Read More →

Monday September 27, 2021 4:50pm - 5:40pm PDT
Elwha B