Robbie Anderson
7 min readApr 25, 2022

--

There is no best Data Orchestration Platform. Part 1: From an Infrastructure Perspective

From Twilio to Kubernetes, there is a mish-mash of tools to help with Data Orchestration. Let’s define a set of criteria for choosing the best for your use case.

Photo by Lachlan Donald on Unsplash

There is now a wealth of data orchestration tools, from no-code solutions all the way through to integration into cutting-edge computation systems; and this is great. It allows you to pick exactly what you need, when you need it. It does pose a problem though — how can you pick the right tool?

In this article, we’re going to begin to evaluate the most popular platforms based on their infrastructure capabilities — and more specifically how to pick the right one for your use case. All too often, we get caught up in which platform is “best” without considering which is the right to use. This often leads to overly complex, overly bloated systems being deployed and maintained without any real need for most of their features.

Following the principles of Occam’s razor, we’re going to try to find the simplest platform which can suit your use case, so let’s get going!

What is my use case?

Firstly, you need to decide on a use case — this can be an overarching prediction of what you intend to do, or it can be in current requirements for a project.

Getting Data

The first step in any Data Science or Machine Learning workload is obtaining data. It doesn’t matter if you’re building an ETL system or training models, this data still needs to be injected into the workflow. So let’s go through the fundamental questions to answer here. As most of the providers we’re discussing today implement methods of code execution, there is little difference in utilising the supported services. Instead, we’ll focus on finding the best tool to execute the tasks you need on the data.

Is it a custom data source?

A custom data source is one that is implemented directly by a company for its own use, as opposed to generic sources such as Google Analytics or Twitter feeds. This is important as custom sources are less likely to have support from the major no-code providers (Fivetran, Twilio), so you’d have to roll out your own connector instead. It’s worth checking if your source has connectors that could reduce the burden.

Where can the data be accessed from?

An often overlooked requirement! If your company ensures lots of its data can only be accessed internally — you will need to make sure any tools you choose to support that. This mostly eliminates hosted services such as Fivetran or (depending on setup) cloud-provider integrations.

Are you processing a data stream?

Data doesn’t have to be at rest to be processed. You could have a stream of data from an IoT sensor that you want to perform operations over. This source model isn’t supported by any of the major open-source platforms and is only implemented in Databricks using Apache Spark API.

Compute + Orchestration

Probably the most discussed, and arguably least important, is how your workloads will be run. This ranges from the orchestration method — how your workload is organised, to how it is executed on a range of hardware. While there are many differences in how these platforms define workflows, we will not look into that here. Instead, let’s focus on how the compute requirements and the organisational situation can affect the platform we ultimately select.

Do you need to control scaling or latency?

Compute requirements hinge on two main factors, scaling and latency. You need to understand to what lengths your workload can be affected by these parameters.

Scaling can be a killer feature, often drastically reducing costs. However, it can also lead to unnecessary complexity and increased latency. If you will have spikey workloads (ie lots of jobs submitted at once) then scaling would be a key feature, but if you have a consistent workload — then it may be cheaper to obtain a fixed compute amount.

Tools such as Prefect allow some scaling, but this has to be manually added which can be frustrating. Metaflow, Luigi and Kubeflow can all be deployed onto autoscaling clusters. These set the compute levels according to demand — saving money in the long run but at a cost of system latency.

If scaling or latency control isn’t required, all of the DAG-based models can be run locally — which could be optimal for many workloads.

Do you need customisable workflows?

If you need to do simple transformations, then your workflows will look considerably different to those who are implementing a large-scale training pipeline for an ML model. There are now many tools, all of which orchestrate jobs in different ways — but they broadly fall into two categories. Tools such as Metaflow, Airflow, Luigi and Dagster build directed acyclic graphs while tools such as Prefect and Notebooks allow singular operations without definition of a structure.

When selecting an orchestration tool, consider that DAGs are a great way to both visualise and understand a workflow — but they’re pointless if you just want to perform one operation. Similarly, If you have multiple interleaving workflows then being able to represent these easily using DAGs would make the workflow far easier to digest.

If you’re just performing simple transformations, then Fivetran or Twilio Segments may be the way to go. Anything more complex, and you may be out of luck.

Do you want a managed platform?

Large tech companies can afford to pay teams of people to keep any underlying infrastructure going. This allows developers to worry less about the infrastructure below them and focus on adding value. If you’re not in this boat, you need to think about what tools are realistically available to you, and if you’d rather hand over infrastructure responsibility to a third party instead. Using externally-hosted tools can be a great way to get off the ground quickly, but can often end up being more expensive in years to come.

Using tools provided by cloud providers (Sagemaker, Step functions and Cloud Composer) could be a good middle ground — providing the stability we’ve come to expect whilst allowing some freedom in implementation choices. These will require considerable development time to get up and running.

Hosting

This is the most overlooked aspect of Machine Learning and Data Science workflows. We all get caught up in developing something new and exciting, without stopping to realise how this could actually end up adding value. The best model in the world is no good if it can’t be utilised.

How will the workflow be deployed?

MLOps is an often used — rarely understood — term to describe how a Machine Learning model makes its way from development into production using a strict deployment flow. This CI/CD model can be effective at finding bugs, spotting regressions and reducing complexity for developers, but it comes at a cost. MLOps is hard.

For any project in production, you need to decide the risk you’re willing to accept. These decisions can then govern what you want to implement, if it’s a “full” MLOps pipeline then tools that can be based on git (such as Metaflow, Prefect or Dagster) are a great choice as they allow you to use existing CI/CD systems such as Jenkins or Atlassian Bamboo to trigger the remote execution of a workflow.

However, if you don’t have any such requirements, notebook based tools are often the easiest to develop — therefore achieving value faster.

Is the workflow business-critical?

If you’re powering an internal dashboard, to say predict daily usage of a system, then you may not need the rigour of a production system — as the damage caused by any bugs would be small. However, if this is a workflow that is supporting a critical function of your business, it might be of great importance to ensure minimal downtime. In a production system you need reliability and stability, so pick a battle-hardened tool such as Airflow or Luigi. These tools have better UI’s for managing production-scheduled workflows and have larger community support, at the cost of worse user interaction.

Metaflow is also an interesting one to consider here. While we’ve placed it on the non-critical path, it allows developers to easily attach custom components using standard AWS services. While this isn’t out of the box, it does allow for a level of customisation to fit a use case. It’s something I’m planning on covering in future blog posts, so follow for more!

Fully-managed services, such as Databricks or cloud provider offerings are also a good option as they will provide direct support for any issues, which can give peace of mind to a business looking to pick a platform.

Kubernetes-based tools may also be useful, as they might tie into existing knowledge in your business — allowing smoother production operation. Tools such as Kuebflow or Argo could be perfect.

Conclusion

There is a wealth of tools that can provide orchestration services, and they come in all shapes and sizes. Today we’ve begun outlining a checklist for when selecting a new service — but this is not a complete list. If you have more items to add, please reach out in the comments or directly on Medium — I’d love to hear what you think!

--

--

Robbie Anderson

Building new Data Platforms with Aero & Senior Software Engineer at Tumelo.