DVC Pipelines¶

Orchestrating data science workflows and tracking computation artefacts and their lineage, using DVC.

Initialise the Project¶

In [1]:

Copied!

!dvc init --subdir
!dvc init --subdir

Initialized DVC repository.

You can now commit the changes to git.

+---------------------------------------------------------------------+
|                                                                     |
|        DVC has enabled anonymous aggregate usage analytics.         |
|     Read the analytics documentation (and how to opt-out) here:     |
|             <https://dvc.org/doc/user-guide/analytics>              |
|                                                                     |
+---------------------------------------------------------------------+

What's next?
------------
- Check out the documentation: <https://dvc.org/doc>
- Get help and share ideas: <https://dvc.org/chat>
- Star us on GitHub: <https://github.com/iterative/dvc>

Setup a Remote Artefact Location¶

In [2]:

Copied!

!dvc remote add -d s3 s3://dvc-example-artefacts/pipelines
!dvc remote add -d s3 s3://dvc-example-artefacts/pipelines

Setting 's3' as a default remote.

Define the Pipeline¶

The pipeline is defined in a YAML file, which is reproduced below. This is all that is required to get DVC to track the various artefacts and metrics.

In [3]:

Copied!

!cat dvc.yaml
!cat dvc.yaml

stages:
  get_data:
    cmd: python stages/get_data.py
    deps:
      - stages/get_data.py
    outs:
      - artefacts/dataset.csv
  train_model:
    cmd: python stages/train_model.py
    deps:
      - artefacts/dataset.csv
      - stages/get_data.py
    params:
      - train.random_state
    outs:
      - artefacts/model.joblib
    metrics:
      - metrics/metrics.json:
          cache: false

The implied DAG can be reproduced as follows,

In [4]:

Copied!

!dvc dag
!dvc dag

  +----------+   
  | get_data |   
  +----------+   
        *        
        *        
        *        
+-------------+  
| train_model |  
+-------------+

Run the Pipeline¶

The pipeline can be run with one command,

In [6]:

Copied!

!dvc repro
!dvc repro

Stage 'get_data' didn't change, skipping                              core>
Running stage 'train_model':
> python stages/train_model.py
Updating lock file 'dvc.lock'                                                   

To track the changes with git, run:

    git add dvc.lock

To enable auto staging, run:

	dvc config core.autostage true
Use `dvc push` to send your updates to remote storage.

Version Control the Artefacts and Metrics¶

In [9]:

Copied!

!git add dvc.lock
!git commit -m "Pipeline run #1"
!dvc push
!git add dvc.lock
!git commit -m "Pipeline run #1"
!dvc push

[dvc 3cc326d] Pipeline run #1
 1 file changed, 31 insertions(+)
 create mode 100644 dvc-pipelines/dvc.lock
  0% Transferring|                                   |0/2 [00:00<?,     ?file/s]
!
  0%|          |f7f0dc59a7a416f004a31ab305f320     0.00/? [00:00<?,        ?B/s]
  0%|          |f7f0dc59a7a416f004a31ab305f320   0.00/849 [00:00<?,        ?B/s]

!

  0%|          |e47ada0122c0951fcc98bc1e26ca50     0.00/? [00:00<?,        ?B/s]

  0%|          |e47ada0122c0951fcc98bc1e26ca50 0.00/38.0k [00:00<?,        ?B/s]
100%|██████████|f7f0dc59a7a416f004a31ab305f320849/849 [00:00<00:00,    4.39kB/s]
 50% Transferring|███████████████▌               |1/2 [00:00<00:00,  2.92file/s]

100%|██████████|e47ada0122c0951fcc98bc1e2638.0k/38.0k [00:00<00:00,     142kB/s]

2 files pushed

Displaying Metrics¶

All metrics can be retrieved wth one command.

In [10]:

Copied!

!dvc metrics show
!dvc metrics show

Path                  MAE                                             core>
metrics/metrics.json  0.07843

Thoughts and Conclusions¶

It's worth noting that running dvc pull on a clone of this repository will pull the latest version of all the files from S3 and into the local directory. Use dvc import if the initial dataset exists in a different repo (e.g., in a dedicated DVC data registry).