Data and Model Versioning¶
Exploring the fundamentals of DVC for ML artefact versioning, as outlined in the DVC tutorial.
Initialise the Project¶
!dvc init --subdir
Initialized DVC repository. You can now commit the changes to git. +---------------------------------------------------------------------+ | | | DVC has enabled anonymous aggregate usage analytics. | | Read the analytics documentation (and how to opt-out) here: | | <https://dvc.org/doc/user-guide/analytics> | | | +---------------------------------------------------------------------+ What's next? ------------ - Check out the documentation: <https://dvc.org/doc> - Get help and share ideas: <https://dvc.org/chat> - Star us on GitHub: <https://github.com/iterative/dvc>
Setup a Remote Artefact Location¶
Which for this demo will be an AWS S3 bucket.
!dvc remote add -d s3 s3://dvc-example-artefacts
Setting 's3' as a default remote.
Note that the -d
flag will set this remote as the default, so that dvc commands like dvc add
will use it as a default.
Start Tracking a Dataset¶
We start by creating the dataset using Pandas, but this could be a proxy for any data ingestion operation - e.g., querying a database to retrieve the latest tranche of training data.
import pandas as pd
df = pd.DataFrame({"x": [1, 2, 3, 4, 5], "z": ["a", "b", "c", "d", "e"]})
df.to_csv("datasets/example.csv", index=False)
Next, we get DVC to start tracking this new dataset.
!dvc add datasets/example.csv
⠋ Checking graph
Adding...
!
0%| | 0.00/? [00:00<?, ?B/s]
!
0%| |.WTQTmCvBSY3whkYBU8QFuH.tmp 0.00/? [00:00<?, ?B/s]
0%| |.WTQTmCvBSY3whkYBU8QFuH.tmp 0.00/4.00 [00:00<?, ?B/s]
100% Adding...|████████████████████████████████████████|1/1 [00:00, 73.55file/s]
To track the changes with git, run:
git add datasets/.gitignore datasets/example.csv.dvc
To enable auto staging, run:
dvc config core.autostage true
Note, that datasets/example.csv
will not be tracked by Git as it is automatically setup to ignore that file, within that directory,
!cat datasets/.gitignore
/example.csv
Instead, we need to track the metadata file datasets/example.csv.dvc
(using Git), and use dvc push
to move the data to remote storage (see below).
!cat datasets/example.csv.dvc
outs: - md5: 553afb5628d5a62daecac40d8442f189 size: 35 path: example.csv
Don't forget to commit the changes to the metadata file!
!git commit -m "Added dataset v1"
And push them to the remote repository when the time comes.
Push Dataset to S3¶
!dvc push datasets/example.csv
0% Transferring| |0/1 [00:00<?, ?file/s] ! 0%| |3afb5628d5a62daecac40d8442f189 0.00/? [00:00<?, ?B/s] 0%| |3afb5628d5a62daecac40d8442f189 0.00/35.0 [00:00<?, ?B/s] 100%|██████████|3afb5628d5a62daecac40d8442f135.0/35.0 [00:00<00:00, 171B/s] 1 file pushed
Workflow for Updating Dataset Versions¶
Assemble the latest dataset.
df = pd.DataFrame({"x": [5, 4, 3, 2, 1], "z": ["e", "d", "c", "b", "q"]})
df.to_csv("datasets/example.csv", index=False)
!dvc add datasets/example.csv
!git add datasets/example.csv.dvc
!git commit -m "Added dataset v2"
!dvc push
⠋ Checking graph
Adding...
!
0%| |.3Catcx2tcfkEdcYYi49FWo.tmp 0.00/? [00:00<?, ?B/s]
0%| |.3Catcx2tcfkEdcYYi49FWo.tmp 0.00/4.00 [00:00<?, ?B/s]
100% Adding...|███████████████████████████████████████|1/1 [00:00, 117.25file/s]
To track the changes with git, run:
git add datasets/example.csv.dvc
To enable auto staging, run:
dvc config core.autostage true
[dvc c78de9c] Added dataset v2
1 file changed, 1 insertion(+), 1 deletion(-)
Everything is up to date.
Note that 'auto staging' can be configured so that the git add
can be omitted.
Fetching Old Dataset Versions¶
Start by listing the Git commits.
!git checkout 9360cd7
!dvc checkout
!cat datasets/example.csv
M dvc/data_and_model_versioning.ipynb
M pytorch/MNIST.ipynb
M pytorch/linear_regression_sgd.ipynb
M pytorch/logistic_regression_sgd.ipynb
M pytorch/requirements.txt
Note: switching to '9360cd7'.
You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.
If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:
git switch -c <new-branch-name>
Or undo this operation with:
git switch -
Turn off this advice by setting config variable advice.detachedHead to false
HEAD is now at 9360cd7 First DVC demo.
0% Checkout| |0/1 [00:00<?, ?file/s]
!
0%| |.cihz3NpRoEeBjiXnsPygBA.tmp 0.00/? [00:00<?, ?B/s]
0%| |.cihz3NpRoEeBjiXnsPygBA.tmp 0.00/4.00 [00:00<?, ?B/s]
M datasets/example.csv
,x,z
0,1,a
1,2,b
2,3,c
3,4,d
4,5,e
And reverting back again...
!git checkout dvc
!dvc checkout
!cat datasets/example.csv
M dvc/data_and_model_versioning.ipynb
M pytorch/MNIST.ipynb
M pytorch/linear_regression_sgd.ipynb
M pytorch/logistic_regression_sgd.ipynb
M pytorch/requirements.txt
Previous HEAD position was 9360cd7 First DVC demo.
Switched to branch 'dvc'
0% Checkout| |0/1 [00:00<?, ?file/s]
!
0%| |.cn9FrU5cbyJMnyCJxcFfj9.tmp 0.00/? [00:00<?, ?B/s]
0%| |.cn9FrU5cbyJMnyCJxcFfj9.tmp 0.00/4.00 [00:00<?, ?B/s]
M datasets/example.csv
,x,z
0,5,e
1,4,d
2,3,c
3,2,b
4,1,q
Thoughts and Conclusions¶
It feels to me as if the Git repo used by DVC may benefit from being standalone and not part of any code repo?