
Investigating air quality & other developmental metrics
Overview
In this article I discuss a personal project that I'm in the process of building. The primary aim of this project to refine my data engineering skills by building an end-to-end data project from scratch, while also deepening my knowledge of Apache Airflow and dbt.
My project data question comes from a personal experience. Growing up, I struggled with sinusitis, likely caused by industrial and mine-related pollution in the general region where I lived. Now, living in a different part of the country, the air quality appears to be much better.
But what does the data say? How much do empirical air quality measurements vary across cities, regions and countries and what common trends can be observed?
My approach is as follows:
- Step 1: Build a functional MVP, with automated data flows for a single data source from ingestion, to loading, transformation and visualisation.
- Step 2: Increase the level of complexity and sophistication of the pipeline by incorporating multiple data sources and more sophisticated transformations
- Step 3: Use historical air pollution data and identified developmental metrics to train a predictive model which can be viewed via embedded visualisations or queried via an online application (ideally embedded right here, in this article).
As at the time of writing, I have completed Step 1 and am working on Step 2. I will update this article as I complete Steps 2 and 3.
Tools used




Google BigQuery
Apache Airflow
dbt
Looker Studio
As mentioned above, a primary aim of this project is to deepen my understanding of Apache Airflow and dbt by building a project from scratch, while also getting to do some more work with BigQuery and Looker Studio.
Data ingestion / extraction
I selected Airflow as my ingestion tool because (1) it is a widely used, industry standard application; (2) I wanted to get some deeper hands on experience setting it up; and (3) it is open-source and can be self-hosted. I did briefly investigate using cloud-hosted Airflow, but found the running costs to be unjustifiably high for a small personal project like this.
Prior to this project, I had limited experience using Airflow and no prior hands-on experience. I have enjoyed getting to work through the fundamentals of Airflow and have learnt a lot. One downside of self-hosting an application that needs to run periodically is that my computer needs to be running every day at the same time. Two solutions that I've implemented are: (1) setting the DAGs to run daily during working hours and (2) making use of historical data APIs where possible to decrease DAG run frequency.
Data warehousing
BigQuery is my go-to data warehouse for personal projects, for several reasons:
- Thanks to Google Cloud's generous free tier, BigQuery is highly cost effective as long as you stay within the free tier limits;
- It has relatively minimal setup overhead;
- BigQuery is an enterprise grade tool. Although GCP has the lowest market share of the Big 3 cloud providers, BigQuery is still used by very big organisations for very sophisticated applications.
Data transformation
To handle transformations, I have selected dbt cloud (i.e. the free version). While I have prior exposure to dbt (dbt fundamentals course, daily views of run results, exploring the dbt models in the Company code repo and asking colleagues lots of detailed questions), I have no prior experience setting up the tool from scratch.
Like everyone I've met who's used dbt, I am of the view that dbt is extremely good at its job. It's a fantastic product which makes almost everyone on the data team's lives easier. I particularly like how effective dbt is for proactive testing for errors.
BI / Visualisation
When it comes to BI tools, I have the most experience with Looker, and then Looker Studio (formerly Google DataStudio). This is why I defaulted to using Looker Studio for this project - it's free and familiar and the focus of this project is the pipelines. However, for Steps 2 and 3, I will likely try use another BI tool like Tableau Public or PowerBI, to challenge myself a bit further.
Video walkthrough
The video below is a quick walkthrough of Step 1 - the first functional version of this project.
I'm aiming to post a walkthrough of Step 2, the more sophisticated version which incorporates a range of data sources and transformations, in the coming week or so.
Next steps
As mentioned above, I'm currently working on Step 2, which entails incorporating a number of additional data sources and more complex data transformations. With most of the fundamental configurations already in place, I'm able to focus more on finding interesting and useful additional data sources. I'm excited to see how it comes together!
Once Step 2 is done, I'll move onto Step 3. More to come soon!