September 27-30, 2021
Seattle, Washington, USA + Virtual
View More Details & Registration

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for Open Source Summit + Embedded Linux Conference + OSPOCon 2021 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

This schedule is automatically displayed in Pacific Daylight Time (UTC -7). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change.

IMPORTANT NOTE: Timing of sessions and room locations are subject to change.

Back To Schedule
Monday, September 27 • 2:30pm - 3:20pm
(VIRTUAL) Simplifying Testing of Spark Applications - Megan Yow, Sobeys & Han Wang, Lyft

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Feedback form is now closed.
Data practitioners use distributed computing frameworks such as Apache Spark to work with big data. One of the major pain points of Apache Spark is its testability. In order to run tests on simple code changes, users have to spin up a local PySpark instance, which takes a few minutes. Some users submit jobs to a cluster for test code. Even worse, libraries such as databricks-connect forward all of the local Spark code to be executed on a cluster. This means that simple tests spin up the Spark cluster to run. This leads to very expensive projects, considering both developer time wasted, and unneeded cluster usage. The lack of testability also leads to slow development cycles. In the case of machine learning applications, rapid iterations are needed to achieve optimum performance. In this demo, we introduce a library called Fugue that serves as an abstraction layer for distributed compute frameworks. Users can write code in native Python or Pandas, and then port it to Spark and Dask during execution time. This allows users to test code much faster, and free from Spark dependencies. When ready to run on a cluster, users just need to specify the engine for execution (Pandas or Spark). Fugue dramatically speeds up development cycles and makes data projects cheaper.

avatar for Han Wang

Han Wang

Machine Learning Engineer, Lyft
Han Wang is a staff Machine Learning Engineer at Lyft and author of the Fugue package.

Megan Yow

Data Scientist, Spotify
Megan Yow is a Data Scientist at Spotify. She is a contributor for Fugue, an abstraction layer that keeps your code and computation native to python yet easily portable to spark clusters.

Monday September 27, 2021 2:30pm - 3:20pm PDT
MeetingPlay Platform + Virtual Learning Lab