Skip to content

Build System

Motivation

In data science, analysis pipelines often consist of many complex, slow steps. This creates two challenges:

  • Iteration: It is rare to run a pipeline once, produce an analysis, and be done. Usually, one must repeatedly tweak the steps, re-run the pipeline, and reexamine the result. To avoid wasting time, it is therefore desirable that after each change, only impacted steps should be rerun.
  • Lineage: Given the complexity of many data science workflows, there is considerable room for error. It is therefore desirable to be able to interrogate the final product of a workflow to trace its "lineage": the precise sequence of steps that produced it.

These challenges motivate the development of a data science build system.

Framework

The build system used in this project is based on the framework of Mokhov et al.12 I outline this framework below.

Key Concepts

Asset

An asset is any file or directory consumed or produced by the build. In this project, examples of assets include:

Task

A task is an operation that constructs a particular asset.

Here is the source code for the Task base class:

Task Class Definition
"""
Instructions for materializing an asset.
"""

from abc import ABC, abstractmethod
from pathlib import Path

from mecfs_bio.build_system.asset.base_asset import Asset
from mecfs_bio.build_system.meta.asset_id import AssetId
from mecfs_bio.build_system.meta.meta import Meta
from mecfs_bio.build_system.rebuilder.fetch.base_fetch import Fetch
from mecfs_bio.build_system.wf.base_wf import WF


class GeneratingTask(ABC):
    """
    Instructions for materializing an asset.
    """

    @property
    @abstractmethod
    def meta(self) -> Meta:
        """
        Metadata describing the target asset.
        """
        pass

    @property
    @abstractmethod
    def deps(self) -> list["Task"]:
        """
        List of tasks whose assets are needed to produce the target asset.
        """
        pass

    @property
    def asset_id(self) -> AssetId:
        return self.meta.asset_id

    @abstractmethod
    def execute(
        self,
        scratch_dir: Path,
        fetch: Fetch,
        wf: WF,
    ) -> Asset:
        """
        Materialize the target asset, using the 'fetch' callback to access asset dependencies.
        """
        pass


Task = GeneratingTask
  • The meta property returns a metadata object describing the asset created by the Task. This metadata object must include a string id uniquely identifying the asset.
  • The deps property returns a list of pre-requisite Tasks that must be executed prior to the current Task. The build system uses deps to construct the Task dependency graph.
  • The key method is execute. Subclasses override this method to specify how to construct their assets. The fetch parameter is a special callback passed to execute by the build system. Instead of directly accessing its dependencies, execute should fetch its dependencies from the build system

Concrete Task subclasses are defined here.

Rebuilder

Given a Task that generates an Asset, together with a data storage object called Info, the job of a Rebuilder is to decide whether the current version of the Asset is up-to-date. If so, that Asset can be directly returned without executing the Task. If not, the rebuilder uses the Task to materialize an up-to-date version of the asset.

Here is source code for the Rebuilder base class:

"""
Abstract base class for Rebuilders.  See Andrey Mokhov, Neil Mitchell, and Simon Peyton Jones. Build systems à la carte.
"""

from abc import ABC, abstractmethod
from pathlib import Path

from mecfs_bio.build_system.asset.base_asset import Asset
from mecfs_bio.build_system.rebuilder.fetch.base_fetch import Fetch
from mecfs_bio.build_system.rebuilder.metadata_to_path.base_meta_to_path import (
    MetaToPath,
)
from mecfs_bio.build_system.task.base_task import Task
from mecfs_bio.build_system.wf.base_wf import WF


class Rebuilder[Info](ABC):
    """
    Key Operations:
    - Decide whether a given asset is up-to-date using information from Info.
    - If the asset is up-to-date, return it together with Info.
    - If the asset is not up-to-date, bring it up-to-date, update Info, and return the new values of both.
    """

    @abstractmethod
    def rebuild(
        self,
        task: Task,
        asset: Asset | None,
        fetch: Fetch,
        wf: WF,
        info: Info,
        meta_to_path: MetaToPath,
    ) -> tuple[Asset, Info]:
        pass

    @classmethod
    @abstractmethod
    def save_info(cls, info: Info, path: Path):
        pass

Currently, there is one concrete implementation of Rebuilder, called the VerifyingTraceRebuilder. It uses file hashes to decide whether an Asset is up-to-date.

Scheduler

Given one or more target assets requested by the user, it is the job of the scheduler to determine which tasks need to be run in what order to produce those assets. The scheduler delegates the actual running of these tasks to the Rebuilder.

Currently, there is one concrete scheduler: the topological scheduler. The topological scheduler constructs a directed acyclic graph of the dependencies of the requested assets, then traverses this graph in topological order.


  1. Andrey Mokhov, Neil Mitchell, and Simon Peyton Jones. Build systems à la carte. Proceedings of the ACM on Programming Languages, 2(ICFP):1–29, 2018. URL: https://dl.acm.org/doi/abs/10.1145/3236774

  2. Andrey Mokhov, Neil Mitchell, and Simon Peyton Jones. Build systems à la carte: Theory and practice. Journal of Functional Programming, 30:e11, 2020. URL: https://www.cambridge.org/core/journals/journal-of-functional-programming/article/build-systems-a-la-carte-theory-and-practice/097CE52C750E69BD16B78C318754C7A4