datetime is used for us to specify dates and durations, and pendulum is for us to define timezones, which are essential for scheduling the workflows at the correct time. In this first block, we import the necessary libraries to make our code work. With the concept for the DAG defined, let’s run through the code blocks. Therefore, the pipeline we intend to build is: When we change the TICKER we want to collect data on, our DAG will run without throwing any errors resulting from the relation (table) not existing in Postgres. To resolve this, we use a TICKER variable in all our steps, and add a new task at the very beginning of the pipeline to create a table for the ticker if it does not already exist. Our scripts have hardcoded the FB ticker in the extract and load steps, and we don’t have a GDX table yet. Consider this scenario: our DAG is currently configured to collect data on FB, and we would like to now switch over to gold (GDX). Second, we write in the flexibility to collect data on new tickers. By then, we’ve already retrieved data from the API and processed it, but treating all three steps as one task, Airflow would re-run the entire thing. Suppose that step 3 fails in the pipeline. We separate the ETL steps into different tasks in the DAG, because dumping all our code into one function may be problematic. What’s our DAG going to look like?įirst, we already have our template: the ETL job we’ve written so far! That is, (1) query the TD Ameritrade (TDA) API, (2) process it into the required format, and (3) load it into Postgres. DAGs are essentially Python scripts that contain the code for each step in the data pipeline, and contain the standard code blocks:īefore we dive into the code blocks, it’s important that we plan ahead. This is the fourth post in my series, Towards Open Options Chains: A Data Pipeline for Collecting Options Data at Scale:Īirflow defines DAGs as a “core concept of Airflow, collecting Tasks together, organised with dependencies and relationships to say how they should run”. In this post, we will build on our work in Part II: Foundational ETL Code and Part III: Getting Started with Airflow by converting our ETL pipeline into a Directed Acyclic Graph (DAG), which comprises the tasks and dependencies for the pipeline on Airflow.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |