BetterDataEngineering

A Publication to keep up with the latest and greatest in data engineering, from Infrastructure to…

Follow publication

From COVID to Politics — Analysing GDELT with Aero

Aero is a platform built to take responsibility for infrastructure, security and orchestration away from developers and allow them to focus on adding value, be that for a personal project or a company.

It’s built on top of the open-source standard of Metaflow [1], which allows developers to run workflows locally on their machine or via our proprietary compute platform. These python workflows can be anything, from ETL systems to a script to check the outside windspeed — we don’t discriminate.

In our previous blog post here, we discussed many of the challenges facing Data Scientists and ML Engineers, showing some of the features Aero employs to mitigate these challenges. Today, we will go on to utilise these features in a practical example, walking you through the development of a Python workflow.

One of the classic Data Science tasks is ingesting data. Whatever you’re working on, you are likely to need supplementary data from some source. These tasks can often be cumbersome, requiring developers to host code on a remote machine, provision a database and install monitoring tools. In this walkthrough, we will show how Aero can be used as a fast and flexible platform for building Extract Transform and Load(ETL) workflows.

The Task

Today, let’s explore how we can build ETL pipelines easily using Aero. We’ve been tasked with building a system to ingest data from the GDELT project (https://www.gdeltproject.org/), which provides a historical dataset of news articles that contain metadata such as the tone of the article and related actors. From this, we want to be able to identify which articles should be recommended to users of our super-cool new news platform, Newz.

This will be the first part of three covering this topic, today we’ll cover:

Part 1

  • Setup Aero Account
  • Investigating the dataset
  • Build a basic Flow

Next time:

Part 2: Scaling up to the Cloud

Part 3: Scheduling our Flow

Setup

Please note, we only support Linux and Mac at this time

Firstly, ensure you’ve created an account with us here.

In this tutorial, we will use Conda for the environment and package management. To install Conda separately, follow the instructions for your OS here: https://docs.conda.io/en/latest/miniconda.html.

To create a new environment, run the following:

conda create --name aero-tutorial python=3.8
conda activate aero-tutorial

From your Linux/Mac terminal, you can install our CLI tool — this will give you access to the platform.

pip install aeroplatform

NOTE: You’re free to use your own python environment, just be sure to run the aero configure command to install the required dependencies.

Finally, run:

aero account login

This will provision your credentials for accessing Aero, which will expire after 24 hours. After that time, you will have to log in again.

Now, with all that out the way — we can get started!

Getting to grips with our data

As with all data projects, it’s best to first analyse the data we will be ingesting to understand how it’s presented, what format it’s in, and how we could best read it.

The GDELT dataset is provided in a few different formats, but we will be analyzing the raw dataset, which is indexed with a frequently updated txt file, found here:

http://data.gdeltproject.org/gdeltv2/masterfilelist.txt

For our first example, let’s pick a file and evaluate it. Looking through the data, the most interesting files seem to be the .export files, so let’s pick one of those, unzip it, open it using your favourite spreadsheet application and take a look at the columns.

As we can see, each row contains a ton of data!

For this walkthrough, let’s reduce our columns down to:

  • avgtone
  • goldsteinscale
  • sourceurl
  • actor1name
  • actor2name
  • numarticles
  • nummentions
  • numsources

So, now we have our test dataset, and we know what we want from it, let’s get programming!

Let’s begin programming

To start with, we need to create our flow outline which we will expand during this tutorial. We need a class definition that implements Metaflow’s FlowSpec class, with two functions — start and end. These functions are represented as steps in the Metaflow state machine.

We will also need to tell python to run this Flow, so a python main function is required after our class definition.

Now we can run our basic flow locally, as shown below.

aero run local gdelt_etl.py

Excellent! You should see you’ve now logged in. Now let’s get to building that ETL system.

We plan to just pull one file for now, but it’s likely we will want to pull more data in the future — so let’s add the file URL as a parameter to the Flow. This will allow us to easily pull new datasets without editing any code. To do this, we need to add a Parameter to our flow.

We can then reference this parameter in any of our steps, and when we execute the script Aero will require us to pass this parameter. Alter the start step to print the parameter, as shown below, and try re-running the flow. What happens?

$ aero run local gdelt_etl.pyAero executing FetchGDELTData for user:aeroUsage: aero run local gdelt_etl.py [OPTIONS]Error: Missing option ‘ —- url’.

As we didn’t pass a parameter this time, the flow failed to execute. Alter your command to match the following command:

aero run local gdelt_etl.py --args “--url http://data.gdeltproject.org/gdeltv2/20210304164500.export.CSV.zip"

Now we can dynamically change our URL, let’s build a step to fetch the data.

The following code uses the Pandas library to pull the CSV data from the URL, uncompress it and load it into a DataFrame. Make sure you install pandas into your Conda environment before running the next step.

As there are lots of columns, let’s also create a separate file to store them. We’ve named this file gdelt_columns.py.

Our rather long set of columns!
Pulling the GDELT data and saving it into self.

Now we can see we’ve loaded our data. We need to do something with it. One of the many features of Metaflow is it allows us to store data into the self-object, then retrieve said data in later steps. This data will be backed up to S3 between steps — allowing for easier debugging in the future. Let’s have a look at the data we just saved, using the code below we can analyse the outputs of the steps from the previous execution of our Flow.

python get_data.py

Let’s add a new step and perform some filtering on the dataset we just pulled — extracting all the columns we want.

To do this, we can use pandas to reduce our dataset to its required columns.

We now have a DataFrame that contains the required columns, and we’ve saved that dataset back into S3 where it can be extracted for further analysis. That was easy!

So let’s recap on what we’ve achieved

  • Pulled a CSV from the GDELT dataset remotely
  • Parse and load into a DataFrame
  • Inspected the results using the Flow results object
  • Filter the DataFrame to remove unnecessary columns

In doing this, we’ve not had to provision any infrastructure or push our code to any remote machine. There’s been no battle to get MySQL working or time debugging why we can’t write to DynamoDB — we’ve focused our time on solving the problem set to us.

In the next blog post, we’ll go through how Aero can push jobs to the cloud — continuing our journey of building an ETL system for GDELT data. In the meantime, visit our site for more or email us contact@aeroplatform.co.uk.

References

[1] — https://metaflow.org/

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

BetterDataEngineering
BetterDataEngineering

Published in BetterDataEngineering

A Publication to keep up with the latest and greatest in data engineering, from Infrastructure to User Interface.

Robbie Anderson
Robbie Anderson

Written by Robbie Anderson

Building new Data Platforms with Aero & Senior Software Engineer at Tumelo.

No responses yet

Write a response