---
title: "Architecture"
page-category: "searchable"
---
This document describes in detail the underlying architecture of Conductor, and serves as a good orientation for people wishing to contribute.

## Package Structure
Conductor consists of three packages, which we will refer to throughout this document in order to point out where specific pieces of functionality are implemented:
- [conductor](https://git.xarth.tv/ml/conductor) - core library that serves as the user's entrypoint.
- [conductor-cdk](https://git.xarth.tv/ml/conductor-cdk) - Typescript CDK package that provisions and routes operators to AWS resources.
- [twitch-airflow-components](https://git.xarth.tv/ml/twitch-airflow-components) - standalone Airflow hooks and operators, and can be used without the `conductor` packages.

![Package Structure]({{"/assets/images/package_structure.png" | relative_url}})

## DAG Lifecycle
When a user has made changes to their project and want to deploy and test their changes, they run the `conductor deploy` command to build and push their project resources.
The best way to understand Conductor's architecture is to walk through what happens under the hood when the `conductor deploy` and subsequent DAG execution steps are run.

At a high level, the deploy process is as follows:
1. `project_config.py` at the root of the project is detected.
2. CDK Resources (including S3 bucket for data and ECR repository) are deployed.
3. Conductor objects defined in the `dags/` directory of the project generate Airflow DAGs and Tasks.
4. DAG objects are pickled into byte strings and uploaded to Airflow environment.
5. Docker container is built and pushed to ECR Repository.
6. DAGs are executed by the Airflow environment.

These steps are outlined in the below diagram.
![Deploy Flow]({{"/assets/images/deploy_flow.png" | relative_url}})

The subsequent sections will dive into each of these steps in more detail.

## Project Config
Project Config is loaded in a pass through fashion directly from the `conductor-cdk` package.
The contents of `conductor/config.py` are simply:
```py
# flake8: noqa
from conductor_cdk import AirflowConfig, EnvironmentConfig, Project, RedshiftConfig, SageMakerConfig, VPCConfig
```
The implementations of each of these classes is done in Typescript, and transpiled to Python via JSII.
See [Python vs Typescript CDK](https://docs.google.com/document/d/1CQN5kjtAk1wJ0Blc55429nGrjo7pk-c5bZcyxzAYvno) for a discussion on why this architectural choice was made.
The most important outcome of building the CDK in Typescript is the ability to import CDK modules built by other teams at Twitch and leverage their functionality.

## AWS Resource Provisioning 
#### Project Level Resources
Based on the `Project` configuration above, for each project Conductor will create a set of default resources that are shared among the DAGs within the project.
The most import among these are the S3 Bucket, which is the default location that Tasks send their outputs, and the ECR Repo, which hosts the Docker container that is used to execute business logic.
Having a single S3 bucket per project makes it substantially easier to share data between the DAGs of the project.
Likewise, since container layers will be common across branches/DAGs of the same project using one ECR repo for the project leads to substantially faster container pushes.

See [Conductor Environment and Resource Naming](https://docs.google.com/document/d/1sgnkfkgDv2IecAZ9PjitcobOTcZ-nmqN0rC8nZfHVgY) for a discussion on how these resources are named.

#### DAG Level Resources
DAG Level resources are typically infrastructural artifacts produced by DAGs, such as SageMaker Endpoints and their associated alarms.
They rely on operators to deploy the associated CDK at the correct step of the DAG, rather than being deployed as part of the `conductor deploy` CLI command.
> **Note:** This feature is under development. More details to come shortly!
{:.note}

## Conductor Operators
Conductor operators do not inherit from `airflow.models.BaseOperator`.
Rather, they pass configuration to Airflow operators from either the Airflow `provider` packages or `twitch-airflow-components`, in order to achieve specific functionality.
See [Operators]({{ "/pages/documentation/operators.html" | relative_url}}) for a list of Conductor's current operators.

Read the implementation of the [Redshift Operator](https://git.xarth.tv/ml/conductor/blob/master/conductor/operators/redshift.py) to get an idea of what typically happens when a Conductor operator is used.
Note in the return statement, copied below from the Redshift Operator, that the only object that the Airflow DAG instance `self.dag` has a direct reference to is the `PostgresOperator` instance, which is provided by the `twitch-airflow-components` and not the `conductor` package.
This allows us to perform DAG serialization with a very limited number of dependencies installed on the Airflow cluster, which will be detailed in the next two sections.
```python
return RedshiftTaskWithOutputs(
    PostgresOperator(
        task_id=task_id, sql=query, connection=connection, dag=self.dag
    ),
    output_s3_prefix,
)
```

## MWAA Environment Dependencies
Updating dependencies within an MWAA environemnt is done by updating a `requirements.txt` file located in an S3 bucket that the cluster reads.
This process incurs a cluster restart, leading to downtime of 10-15 minutes as well as extra engineering maintenance work to pause recurring DAGs and wait for existing DAG runs to complete before upgrading.
Therefore, we want to MWAA cluster's `requirements.txt` to change as infrequently as possible.

The way we accomplish this is to decouple the code that *configures* Airflow operators from the code that *executes* them.
More specifically, among the packages listed previously, only `twitch-airflow-components` is installed on the remote Airflow environment, and the environment has no knowledge that it is running a Conductor DAG.
Compared to `conductor`, `twitch-airflow-components` is relatively stable, with very general interfaces like `PostgresOperator` that can be configured to run a large number of particular use cases.

## DAG Serialization
Because Conductor (as well as the user's project dependencies) are not installed on the Airflow cluster, the DAG files we upload cannot reference them.
In order to allow the DAGs to be processed by the remote Airflow scheduler in this setting, we pickle the underlying DAG objects and send it to the cluster's S3 bucket as a serialized byte string wrapped inside a python file, shown below.

```python
import pickle
encoded = b'\x80\x03cairflow.models.dag\nDAG\nq\x00)\x81q\x01}q\x02(X\x13\x00\x00\...'
dag = pickle.loads(encoded)
```
When this file is executed by the scheduler, the `dag` variable will be detected and registered in the Airflow backend.

## Custom Container Logic
#### Container Structure
To run custom Python business logic, we need to be able to build the user's local code into an executable container.
The default container that Conductor provides is designed to support SageMaker integrations with BYOC, but can be used generically with other container services like ECS or Batch.
The files that are automatically included in the container can be found [here](https://git.xarth.tv/ml/conductor/tree/master/conductor/docker), and include generic entrypoints for processing and training, as well as a Gunicorn + NGINX serving stack to be used with SageMaker Batch Transform and SageMaker Endpoints.

The default generic compute provider that Conductor suggests is the SageMaker Processing job.
The rationale for choosing this over ECS, EKS, or Batch is that it requires the least amount of infrastructure to spin up as well as good interfaces for handling inputs and outputs.
Since the Processing job can execute arbitrary scripts, very few single-container tasks should be precluded by this choice.

#### Dependency Handling
The Conductor Dockerfile uses poetry to manage dependencies.
Non-dev dependencies within the user's `pyproject.toml` will be installed by poetry during the container build.

In addition, there is a privileged directory name called `wheels_dev` defined in the Conductor Dockerfile that will allow devs to directly include a wheel files as the source of a dependency in their `pyproject.toml`
This escape hatch is especially useful for integration testing local builds of Conductor generated during development.
In the `pyproject.toml` the dependency becomes:
```toml
[tool.poetry.dependencies]
python = "3.7.10"
twitch-conductor = {path = "wheels_dev/twitch_conductor-0.0.1-py3-none-any.whl"}
```
Poetry is then able to resolve the wheel from the `wheels_dev` directory during docker build.
