Data and Model Versioning¶

Exploring the fundamentals of DVC for ML artefact versioning, as outlined in the DVC tutorial.

Initialise the Project¶

In [2]:

Copied!

!dvc init --subdir
!dvc init --subdir

Initialized DVC repository.

You can now commit the changes to git.

+---------------------------------------------------------------------+
|                                                                     |
|        DVC has enabled anonymous aggregate usage analytics.         |
|     Read the analytics documentation (and how to opt-out) here:     |
|             <https://dvc.org/doc/user-guide/analytics>              |
|                                                                     |
+---------------------------------------------------------------------+

What's next?
------------
- Check out the documentation: <https://dvc.org/doc>
- Get help and share ideas: <https://dvc.org/chat>
- Star us on GitHub: <https://github.com/iterative/dvc>

Setup a Remote Artefact Location¶

Which for this demo will be an AWS S3 bucket.

In [1]:

Copied!

!dvc remote add -d s3 s3://dvc-example-artefacts
!dvc remote add -d s3 s3://dvc-example-artefacts

Setting 's3' as a default remote.

Note that the -d flag will set this remote as the default, so that dvc commands like dvc add will use it as a default.

Start Tracking a Dataset¶

We start by creating the dataset using Pandas, but this could be a proxy for any data ingestion operation - e.g., querying a database to retrieve the latest tranche of training data.

In [9]:

Copied!

import pandas as pd

df = pd.DataFrame({"x": [1, 2, 3, 4, 5], "z": ["a", "b", "c", "d", "e"]})
df.to_csv("datasets/example.csv", index=False)
import pandas as pd

df = pd.DataFrame({"x": [1, 2, 3, 4, 5], "z": ["a", "b", "c", "d", "e"]})
df.to_csv("datasets/example.csv", index=False)

Next, we get DVC to start tracking this new dataset.

In [10]:

Copied!

!dvc add datasets/example.csv
!dvc add datasets/example.csv

                                                                          ⠋ Checking graph
Adding...                                                                       
!
  0%|          |                                   0.00/? [00:00<?,        ?B/s]
                                                                                
!
  0%|          |.WTQTmCvBSY3whkYBU8QFuH.tmp        0.00/? [00:00<?,        ?B/s]
  0%|          |.WTQTmCvBSY3whkYBU8QFuH.tmp     0.00/4.00 [00:00<?,        ?B/s]
100% Adding...|████████████████████████████████████████|1/1 [00:00, 73.55file/s]

To track the changes with git, run:

    git add datasets/.gitignore datasets/example.csv.dvc

To enable auto staging, run:

	dvc config core.autostage true

Note, that datasets/example.csv will not be tracked by Git as it is automatically setup to ignore that file, within that directory,

In [15]:

Copied!

!cat datasets/.gitignore
!cat datasets/.gitignore

/example.csv

Instead, we need to track the metadata file datasets/example.csv.dvc (using Git), and use dvc push to move the data to remote storage (see below).

In [16]:

Copied!

!cat datasets/example.csv.dvc
!cat datasets/example.csv.dvc

outs:
- md5: 553afb5628d5a62daecac40d8442f189
  size: 35
  path: example.csv

Don't forget to commit the changes to the metadata file!

In [ ]:

Copied!

!git commit -m "Added dataset v1"
!git commit -m "Added dataset v1"

And push them to the remote repository when the time comes.

Push Dataset to S3¶

In [14]:

Copied!

!dvc push datasets/example.csv
!dvc push datasets/example.csv

  0% Transferring|                                   |0/1 [00:00<?,     ?file/s]
!
  0%|          |3afb5628d5a62daecac40d8442f189     0.00/? [00:00<?,        ?B/s]
  0%|          |3afb5628d5a62daecac40d8442f189  0.00/35.0 [00:00<?,        ?B/s]
100%|██████████|3afb5628d5a62daecac40d8442f135.0/35.0 [00:00<00:00,      171B/s]
1 file pushed

Workflow for Updating Dataset Versions¶

Assemble the latest dataset.

In [18]:

Copied!

df = pd.DataFrame({"x": [5, 4, 3, 2, 1], "z": ["e", "d", "c", "b", "q"]})
df.to_csv("datasets/example.csv", index=False)
df = pd.DataFrame({"x": [5, 4, 3, 2, 1], "z": ["e", "d", "c", "b", "q"]})
df.to_csv("datasets/example.csv", index=False)

In [20]:

Copied!





!dvc add datasets/example.csv
!git add datasets/example.csv.dvc
!git commit -m "Added dataset v2"
!dvc push
!dvc add datasets/example.csv
!git add datasets/example.csv.dvc
!git commit -m "Added dataset v2"
!dvc push

                                                                          ⠋ Checking graph
Adding...                                                                       
!
  0%|          |.3Catcx2tcfkEdcYYi49FWo.tmp        0.00/? [00:00<?,        ?B/s]
  0%|          |.3Catcx2tcfkEdcYYi49FWo.tmp     0.00/4.00 [00:00<?,        ?B/s]
100% Adding...|███████████████████████████████████████|1/1 [00:00, 117.25file/s]

To track the changes with git, run:

    git add datasets/example.csv.dvc

To enable auto staging, run:

	dvc config core.autostage true
[dvc c78de9c] Added dataset v2
 1 file changed, 1 insertion(+), 1 deletion(-)
Everything is up to date.

Note that 'auto staging' can be configured so that the git add can be omitted.

Fetching Old Dataset Versions¶

Start by listing the Git commits.

In [21]:

Copied!

!git checkout 9360cd7
!dvc checkout
!cat datasets/example.csv
!git checkout 9360cd7
!dvc checkout
!cat datasets/example.csv

M	dvc/data_and_model_versioning.ipynb
M	pytorch/MNIST.ipynb
M	pytorch/linear_regression_sgd.ipynb
M	pytorch/logistic_regression_sgd.ipynb
M	pytorch/requirements.txt
Note: switching to '9360cd7'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at 9360cd7 First DVC demo.
  0% Checkout|                                       |0/1 [00:00<?,     ?file/s]
!
  0%|          |.cihz3NpRoEeBjiXnsPygBA.tmp        0.00/? [00:00<?,        ?B/s]
  0%|          |.cihz3NpRoEeBjiXnsPygBA.tmp     0.00/4.00 [00:00<?,        ?B/s]
M       datasets/example.csv                                           
,x,z
0,1,a
1,2,b
2,3,c
3,4,d
4,5,e

And reverting back again...

In [22]:

Copied!

!git checkout dvc
!dvc checkout
!cat datasets/example.csv
!git checkout dvc
!dvc checkout
!cat datasets/example.csv

M	dvc/data_and_model_versioning.ipynb
M	pytorch/MNIST.ipynb
M	pytorch/linear_regression_sgd.ipynb
M	pytorch/logistic_regression_sgd.ipynb
M	pytorch/requirements.txt
Previous HEAD position was 9360cd7 First DVC demo.
Switched to branch 'dvc'
  0% Checkout|                                       |0/1 [00:00<?,     ?file/s]
!
  0%|          |.cn9FrU5cbyJMnyCJxcFfj9.tmp        0.00/? [00:00<?,        ?B/s]
  0%|          |.cn9FrU5cbyJMnyCJxcFfj9.tmp     0.00/4.00 [00:00<?,        ?B/s]
M       datasets/example.csv                                           
,x,z
0,5,e
1,4,d
2,3,c
3,2,b
4,1,q

Thoughts and Conclusions¶

It feels to me as if the Git repo used by DVC may benefit from being standalone and not part of any code repo?