Why we’re backing Metaflow

We believe Metaflow is the optimal tool for easy implementation of complex Data Science workloads while keeping the rigour of Software Engineering structure and validation. Today, we explain why.

Robbie Anderson
4 min readApr 28, 2022
Photo by Sierra Koder on Unsplash

In previous articles, we’ve discussed some of the benefits of using the Aero Platform. These benefits however wouldn’t be possible without the core underlying technologies we use. Today, we’re going to go through Metaflow — a data orchestration library at the heart of Aero, which we believe is going to be the next big thing in data science (why else would we use it!).

Metaflow was open-sourced by Netflix in late 2019 and is now maintained in tandem by Netflix and Outerbounds. It uses a DAG model for code orchestration, similar to Airflow, but is designed to be human-centric — where all the tools required to complete a task are within the platform itself and not bolt-on extras. The Metaflow team have extensive documentation and guides for using the platform, which can be found here.

You don’t need to add extensions for core functionality

Out of the box, Metaflow provides:

  • Seamless metadata storage to cache data between tasks and after completion
  • Dependency management through Conda (and Mamba)
  • A rich UI for tracking jobs
  • Functionality for creating branches of the same workflows, allowing integration into MLOps processes
  • Retry logic for tasks
  • Scheduling for workflows

And this is on top of the core aspects of running tasks either locally or on AWS. On other platforms such as Airflow or Luigi, you would need to bolt on many of these features yourself or develop custom solutions to fit your needs.

It follows tried and tested software practices

Almost all data orchestration platforms use a scripting approach to defining workflows, where they’re defined using a collection of functions that are then gathered into a list. As a classically trained software engineer, I find this difficult to understand, concepts such as abstraction have aided developers for years in building complex systems — so why aren’t they applied here?

Metaflow turned this on its head, opting to use a class-based definition for its workflows that conforms to more traditional software development — and I love it. I think it makes the definition for workflows far clearer and encourages further abstraction while developing.

It enforces other software engineering principles, such as version pinning to ensure no unexpected updates break your codebase.

It’s easily extensible

This one is a bit harder to swallow for the average data scientist but bear with me. Metaflow uses standardised AWS services to deliver most of its functionality. Compute is provided by AWS Batch, event management and triggering by EventBridge and Scheduled executions by AWS Step Functions. This allows a developer to hook into these systems and add further functionality. Within my first week of using Metaflow, I’d written a Lambda Function which received events from AWS Batch and updated an internal dashboard showing the state of the workflows executing. That extensibility just isn’t possible without modifying the source code of any other platform.

For us at Aero, this was crucial in allowing us to get up and running quickly with Metaflow — then adding further functionality and features down the line without having to worry about understanding a massive codebase.

The downsides

Infrastructure management

All the extensibility of Metaflow comes at a cost however, it requires developers to maintain their own AWS infrastructure — something which many are not willing or able to do. This is where Aero comes in, we think Metaflow is the future and we want to support as many people using it as possible.

Aero removes the requirement for managing infrastructure, allowing anyone, regardless of knowledge, to leverage the power of the cloud.

Reusability of classes

Other frameworks, such as Luigi and Prefect, allow the reuse of Tasks between different workflows — which can be a great way to minimise code execution. This isn’t directly supported by Metaflow but can be implemented by applying Software Abstraction techniques and moving key logic into an external class/library.

At Aero, we recognise this issue is bigger than just code re-use and we have a solution. In future updates, we are looking to support combining workflows to create a seamless pipeline of multiple flows. This would allow users to consume data from a single data cleaning flow, for example, reducing code complexity and making monitoring easier. We will be discussing this further in our future blog posts exploring microservices within Data Science and Aero.

Conclusion

We think Metaflow is an amazing tool with considerable potential, being easy to use whilst maintaining the core set of features to implement complex workflows. We intend to make this process even smoother with Aero, removing the infrastructure burden and building in new features to enhance functionality. If you have any new features you’d like tweet at @aeroplatform or follow on Medium for more!

--

--

Robbie Anderson

Building new Data Platforms with Aero & Senior Software Engineer at Tumelo.