Machine Learning is hard. Make it easier with Aero

With no infrastructure management - deploy, manage and scale Data Science & Machine Learning workflows in a reliable and easy to implement platform.

Robbie Anderson
5 min readFeb 21, 2022
Photo by Kelvin Ang on Unsplash

We’ve released Aero! Our free trial is now available, please visit here to follow along.

Building software is hard. Building software that is scalable, maintainable, and reliable is even harder. It is estimated that 41 cents of every dollar spent on software is wasted [1]. That’s a huge amount, and even with all the work that’s gone into improving software projects, they are still inefficient at delivering outcomes. Why is this? Software is intricate, 10-year-old bugs can surface at the slightest change, wreaking havoc on running systems. A great example appeared just a few months ago, where Microsoft Outlook had a bug that occurred due to the year switching to 2022 [2], a bug that has been sitting undiscovered in one of the most used pieces of corporate software for years.

So, imagine taking all of the complexity of normal software development, then trying to build brand new Data Science and Machine Learning workflows on top of this intricate environment. It’s a nightmare.

Aero has set out to change that. But first, let’s see if we can identify where some of that wasted money is coming from with Machine Learning projects.

The resources war

As the datasets operated on scale-up, the resources being provided to a job must also scale. Without doing this, developers will face countless hours debugging “Out of Memory Errors”. This is a problem that can be very difficult to debug, often manifesting itself in odd ways. Removing this burden from developers can allow them to use resources at will, ultimately saving time, and increasing efficiency.

Developers wearing multiple hats

As the complexity for Machine Learning projects grows, from some analysis in a notebook to productionising workflows, more and more is being asked of developers. They are being asked to assume the role of developer, systems administrator and cloud architect, each of which contains a near-infinite pool of knowledge. This is exemplified perfectly by the table below, which shows the most common required skills in job adverts for Machine Learning engineers in the UK.

The skills required for a standard machine learning engineer. This is madness!

AWS + Azure are above SQL! How is that even possible?! It’s because engineers are required to deploy, manage, and build their own environments. This requires them to not only carry out their own job but also jobs for which they may not be qualified, trained or want to do.

A slow development cycle

A timeless classic, but one that’s getting more and more true for Data Science & Machine Learning workflows [3]

Need I say more here? Computers are fast, let them do the heavy lifting. If your workflow takes 10 minutes to deploy each time you want to test it, catching each of those small bugs is going to take a lot longer. The answer to this problem is to use your local machine! Developers need an environment that can be quickly spun up locally to test, then mimicked in production down the line. This fast iteration can allow developers to stay in flow and maximise productivity.

The prototype/production divide

As discussed in my previous article, one of the fundamental challenges facing Machine Learning workloads is getting access to a consistent test environment to run prototypes in. Imagine you are building a simple email recommender system, which ingests some data, performs some simplistic statistics, and returns a set of predictions. You could test this locally using a small dataset and perfect it, but how do you then run it on a schedule in production? Well, you can’t on your machine — so you push it elsewhere, such as to a server. But you now haven’t tested in the new environment, so now it requires recertification.

Development workflows like this are terrible. They require you to write code for two platforms and ensure they run in an identical manner, without any issues. As we’ve said before, software is hard — don’t make it harder on developers by adding further complexity.

The fix

So at the start of the article, we alluded to a new service that could help fix these problems: Aero. Aero is a platform built to take responsibility for infrastructure, security, and orchestration away from developers and allow them to focus on adding value, be that to a personal project or a company.

It is built on top of the open-source standards of Metaflow [4], allowing developers to run workflows locally or via the proprietary Aero compute platform. These python workflows can be anything, from large-scale ETL systems to a script to check the outside windspeed, we don’t discriminate. Regardless of your cloud knowledge, you can leverage its power using the Aero platform.

So how does Aero solve the above problems?

  • It solves the resources war by providing developers with a gateway to the powerful compute that cloud providers can supply, offering vertical and horizontal scaling of workflows as and when required.
  • Aero also allows users to focus on adding value through their code, not by maintaining and patching existing infrastructure. It’s as simple as signing in and submitting jobs with no configuration required.
  • As Aero utilises Metaflow to execute workflows — all tasks operate the same regardless of where they are run, be it locally on your laptop or on the largest instance we have to offer. Any jobs which work as prototypes can be immediately run or scheduled to instantly start adding value, with no alterations required. All of this allows for seamless transitions between prototype and production.

We’ve released Aero! Our free trial is now available, please visit here to follow along.

Disclaimer: I am a Director of Aero Technology, and while I am possibly biased, it’s an amazing tool that I think could revolutionise how we access compute!

References:

[1] — https://www.projectsmart.co.uk/risk-management/is-software-development-risk-costing-you-money.php

[2] — https://www.engadget.com/microsoft-exchange-year-2022-bug-fix-215225070.html

[3] — https://xkcd.com/303/

[4] — https://metaflow.org/

--

--

Robbie Anderson

Building new Data Platforms with Aero & Senior Software Engineer at Tumelo.