Data science and machine learning are iterative processes. It is never possible\nto successfully complete a data science project in a single…

Data science and machine learning are iterative processes. It is never possible\nto successfully complete a data science project in a single pass. A data\nscientist constantly tries new ideas and changes steps of his pipeline:

This is only a small episode in a data scientist’s daily life and it is what\nmakes our job different from a regular engineering job.

Business context, ML algorithm knowledge and intuition all help you to find a\ngood model faster. But you never know for sure what ideas will bring you the\nbest value.

This is why the iteration time is a critical parameter in data science process.\nThe quicker you iterate, the more you can check ideas and build a better model.

A data science iteration tool

To speed up the iterations in data science projects we have created an open\nsource tool data version control or DVC.org.

DVC takes care of dependencies between commands that you run, generated data\nfiles, and code files and allows you to easily reproduce any steps of your\nresearch with regards to files changes.

You can think about DVC as a Makefile for a data science project even though you\ndo not create a file explicitly. DVC tracks dependencies in your data science\nprojects when you run data processing or modeling code through a special\ncommand:

dvc run works as a proxy for your commands. This allows DVC to track input and\noutput files, construct the dependency graph\n(DAG), and store the\ncommand and parameters for a future command reproduction.

The previous command will be automatically piped with the next command because\nof the file data/Posts.tsv is an output for the previous command and the input\nfor the next one:

# Split training and testing dataset. Two output files.\n# 0.33 is the test dataset splitting ratio.\n# 20170426 is a seed for randomization.\n$ dvc run python code/split_train_test.py \\\n                 data/Posts.tsv 0.33 20170426 \\\n                 data/Posts-train.tsv data/Posts-test.tsv

DVC derives the dependencies automatically by looking to the list of the\nparameters (even if your code ignores the parameters) and noting the file\nchanges before and after running the command.

If you change one of your dependencies (data or code) then all the affected\nsteps of the pipeline will be reproduced:

# Change the data preparation code.\n$ vi code/xml_to_tsv.py\n\n# Reproduce.\n$ dvc repro data/Posts-train.tsv\nReproducing run command for data item data/Posts.tsv.\nReproducing run command for data item data/Posts-train.tsv.

The pipeline might have a lot of steps and forms of acyclic dependencies between\nthe steps. Below is an example of a canonical machine learning pipeline (more\ndetails in the DVC tutorials:

\n \n\n

\n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n

	# Install DVC
	$ pip install dvc
	\n
	# Initialize DVC repository
	$ dvc init
	\n
	# Download a file and put to data/ directory.
	$ dvc import https://s3-us-west-2.amazonaws.com/dvc-share/so/25K/Posts.xml.tgz data/
	\n
	# Extract XML from the archive.
	$ dvc run tar zxf data/Posts.xml.tgz -C data/
	\n
	# Prepare data.
	$ dvc run python code/xml_to_tsv.py data/Posts.xml data/Posts.tsv python
	\n
	# Split training and testing dataset. Two output files.
	# 0.33 is the test dataset splitting ratio. 20170426 is a seed for randomization.
	$ dvc run python code/split_train_test.py data/Posts.tsv 0.33 20170426 data/Posts-train.tsv data/Posts-test.tsv
	\n
	# Extract features from text data. Two TSV inputs and two pickle matrixes outputs.
	$ dvc run python code/featurization.py data/Posts-train.tsv data/Posts-test.tsv data/matrix-train.p data/matrix-test.p
	\n
	# Train ML model out of the training dataset. 20170426 is another seed value.
	$ dvc run python code/train_model.py data/matrix-train.p 20170426 data/model.p
	\n
	# Evaluate the model by the testing dataset.
	$ dvc run python code/evaluate.py data/model.p data/matrix-test.p data/evaluation.txt
	\n
	# The result.
	$ cat data/evaluation.txt
	AUC: 0.596182

\n\n\n

\n\n

\n view raw\n dvc_pipeline.sh\n hosted with ❤ by GitHub\n

Why are regular pipeline tools not enough?

Regular pipeline tools like Airflow and\nLuigi are good for representing static and\nfault tolerant workflows. A huge portion of their functionality is created for\nmonitoring, optimization and fault tolerance. These are very important and\nbusiness critical problems. However, these problems are irrelevant to data\nscientists’ daily lives.

Data scientists need a lightweight, dynamic workflow management system. In\ncontrast to the traditional airflow-like system, DVC reflects the process of\nresearching and looking for a great model (and pipeline), not optimizing and\nmonitoring an existing one. This is why DVC is a good fit for iterative machine\nlearning processes. When a good model was discovered with DVC, the result could\nbe incorporated into a data engineering pipeline (Luigi or Airflow).

Pipelines and data sharing

In addition to pipeline description, data reproduction and dynamic nature, DVC\nhas one more important feature. It was designed in accordance with the best\nsoftware engineering practices. DVC is based on Git. It keeps code, and stores\nDAG in the Git repository which allows you to share your research results. But\nit moves the actual file content outside the Git repository (in .cache\ndirectory which DVC includes in .gitignore) since Git is not designed to\naccommodate large data files.

The data files can be shared between data scientists through cloud storages\nusing a simple command:

Conclusion

The productivity of data scientists can be improved by speeding up iteration\nprocesses and the DVC tool takes care of this.

We are very interested in your opinion and feedback. Please post your comments\nhere or contact us on Twitter — FullStackML.