DVC Pipelines¶
Orchestrating data science workflows and tracking computation artefacts and their lineage, using DVC.
Initialise the Project¶
In [1]:
Copied!
!dvc init --subdir
!dvc init --subdir
Initialized DVC repository. You can now commit the changes to git. +---------------------------------------------------------------------+ | | | DVC has enabled anonymous aggregate usage analytics. | | Read the analytics documentation (and how to opt-out) here: | | <https://dvc.org/doc/user-guide/analytics> | | | +---------------------------------------------------------------------+ What's next? ------------ - Check out the documentation: <https://dvc.org/doc> - Get help and share ideas: <https://dvc.org/chat> - Star us on GitHub: <https://github.com/iterative/dvc>
Setup a Remote Artefact Location¶
In [2]:
Copied!
!dvc remote add -d s3 s3://dvc-example-artefacts/pipelines
!dvc remote add -d s3 s3://dvc-example-artefacts/pipelines
Setting 's3' as a default remote.
Define the Pipeline¶
The pipeline is defined in a YAML file, which is reproduced below. This is all that is required to get DVC to track the various artefacts and metrics.
In [3]:
Copied!
!cat dvc.yaml
!cat dvc.yaml
stages: get_data: cmd: python stages/get_data.py deps: - stages/get_data.py outs: - artefacts/dataset.csv train_model: cmd: python stages/train_model.py deps: - artefacts/dataset.csv - stages/get_data.py params: - train.random_state outs: - artefacts/model.joblib metrics: - metrics/metrics.json: cache: false
The implied DAG can be reproduced as follows,
In [4]:
Copied!
!dvc dag
!dvc dag
+----------+ | get_data | +----------+ * * * +-------------+ | train_model | +-------------+
Run the Pipeline¶
The pipeline can be run with one command,
In [6]:
Copied!
!dvc repro
!dvc repro
Stage 'get_data' didn't change, skipping core> Running stage 'train_model': > python stages/train_model.py Updating lock file 'dvc.lock' To track the changes with git, run: git add dvc.lock To enable auto staging, run: dvc config core.autostage true Use `dvc push` to send your updates to remote storage.
Version Control the Artefacts and Metrics¶
In [9]:
Copied!
!git add dvc.lock
!git commit -m "Pipeline run #1"
!dvc push
!git add dvc.lock
!git commit -m "Pipeline run #1"
!dvc push
[dvc 3cc326d] Pipeline run #1 1 file changed, 31 insertions(+) create mode 100644 dvc-pipelines/dvc.lock 0% Transferring| |0/2 [00:00<?, ?file/s] ! 0%| |f7f0dc59a7a416f004a31ab305f320 0.00/? [00:00<?, ?B/s] 0%| |f7f0dc59a7a416f004a31ab305f320 0.00/849 [00:00<?, ?B/s] ! 0%| |e47ada0122c0951fcc98bc1e26ca50 0.00/? [00:00<?, ?B/s] 0%| |e47ada0122c0951fcc98bc1e26ca50 0.00/38.0k [00:00<?, ?B/s] 100%|██████████|f7f0dc59a7a416f004a31ab305f320849/849 [00:00<00:00, 4.39kB/s] 50% Transferring|███████████████▌ |1/2 [00:00<00:00, 2.92file/s] 100%|██████████|e47ada0122c0951fcc98bc1e2638.0k/38.0k [00:00<00:00, 142kB/s] 2 files pushed
Displaying Metrics¶
All metrics can be retrieved wth one command.
In [10]:
Copied!
!dvc metrics show
!dvc metrics show
Path MAE core> metrics/metrics.json 0.07843
Thoughts and Conclusions¶
It's worth noting that running dvc pull
on a clone of this repository will pull the latest version of all the files from S3 and into the local directory. Use dvc import
if the initial dataset exists in a different repo (e.g., in a dedicated DVC data registry).