{"componentChunkName":"component---src-templates-blog-post-tsx","path":"/how-a-data-scientist-can-improve-his-productivity","result":{"data":{"markdownRemark":{"id":"04248599-1e93-52f2-b646-b908d381a3c6","excerpt":"<p>Data science and machine learning are iterative processes. It is never possible\nto successfully complete a data science project in a single…</p>","html":"<p>Data science and machine learning are iterative processes. It is never possible\nto successfully complete a data science project in a single pass. A data\nscientist constantly tries new ideas and changes steps of his pipeline:</p>\n<ol>\n<li>extract new features and accidentally find noise in the data;</li>\n<li>clean up the noise, find one more promising feature;</li>\n<li>extract the new feature;</li>\n<li>rebuild and validate the model, realize that the learning algorithm\nparameters are not perfect for the new feature set;</li>\n<li>change machine learning algorithm parameters and retrain the model;</li>\n<li>find the ineffective feature subset and remove it from the feature set;</li>\n<li>try a few more new features;</li>\n<li>try another ML algorithm. And then a data format change is required.</li>\n</ol>\n<p>This is only a small episode in a data scientist’s daily life and it is what\nmakes our job different from a regular engineering job.</p>\n<p>Business context, ML algorithm knowledge and intuition all help you to find a\ngood model faster. But you never know for sure what ideas will bring you the\nbest value.</p>\n<p>This is why the iteration time is a critical parameter in data science process.\nThe quicker you iterate, the more you can check ideas and build a better model.</p>\n<blockquote>\n<p>“A well-engineered pipeline gets data scientists iterating much faster, which\ncan be a big competitive edge” From\n<a href=\"http://blog.untrod.com/2012/10/engineering-practices-in-data-science.html\">Engineering Practices in Data Science</a>\nBy Chris Clark.</p>\n</blockquote>\n<h2>A data science iteration tool</h2>\n<p>To speed up the iterations in data science projects we have created an open\nsource tool <a href=\"http://dvc.org\">data version control</a> or <a href=\"http://dvc.org\">DVC.org</a>.</p>\n<p>DVC takes care of dependencies between commands that you run, generated data\nfiles, and code files and allows you to easily reproduce any steps of your\nresearch with regards to files changes.</p>\n<p>You can think about DVC as a Makefile for a data science project even though you\ndo not create a file explicitly. DVC tracks dependencies in your data science\nprojects when you run data processing or modeling code through a special\ncommand:</p>\n<html><head></head><body><div class=\"gatsby-highlight\" data-language=\"dvc\"><pre class=\"language-dvc\"><code class=\"language-dvc\"><span class=\"token line\"><span class=\"token input\">$ </span><span class=\"token dvc\">dvc run</span> python code/xml_to_tsv.py <span class=\"token punctuation\">\\</span>\n                 data/Posts.xml data/Posts.tsv</span></code></pre></div></body></html>\n<p><html><head></head><body><code class=\"language-text\">dvc run</code></body></html> works as a proxy for your commands. This allows DVC to track input and\noutput files, construct the dependency graph\n(<a href=\"https://en.wikipedia.org/wiki/Directed_acyclic_graph\">DAG</a>), and store the\ncommand and parameters for a future command reproduction.</p>\n<p>The previous command will be automatically piped with the next command because\nof the file <html><head></head><body><code class=\"language-text\">data/Posts.tsv</code></body></html> is an output for the previous command and the input\nfor the next one:</p>\n<html><head></head><body><div class=\"gatsby-highlight\" data-language=\"dvc\"><pre class=\"language-dvc\"><code class=\"language-dvc\"><span class=\"token comment\"># Split training and testing dataset. Two output files.</span>\n<span class=\"token comment\"># 0.33 is the test dataset splitting ratio.</span>\n<span class=\"token comment\"># 20170426 is a seed for randomization.</span>\n<span class=\"token line\"><span class=\"token input\">$ </span><span class=\"token dvc\">dvc run</span> python code/split_train_test.py <span class=\"token punctuation\">\\</span>\n                 data/Posts.tsv <span class=\"token number\">0.33</span> <span class=\"token number\">20170426</span> <span class=\"token punctuation\">\\</span>\n                 data/Posts-train.tsv data/Posts-test.tsv</span></code></pre></div></body></html>\n<p>DVC derives the dependencies automatically by looking to the list of the\nparameters (even if your code ignores the parameters) and noting the file\nchanges before and after running the command.</p>\n<p>If you change one of your dependencies (data or code) then all the affected\nsteps of the pipeline will be reproduced:</p>\n<html><head></head><body><div class=\"gatsby-highlight\" data-language=\"dvc\"><pre class=\"language-dvc\"><code class=\"language-dvc\"><span class=\"token comment\"># Change the data preparation code.</span>\n<span class=\"token line\"><span class=\"token input\">$ </span><span class=\"token command\">vi</span> code/xml_to_tsv.py\n</span>\n<span class=\"token comment\"># Reproduce.</span>\n<span class=\"token line\"><span class=\"token input\">$ </span><span class=\"token dvc\">dvc repro</span> data/Posts-train.tsv\n</span>Reproducing run command for data item data/Posts.tsv.\nReproducing run command for data item data/Posts-train.tsv.</code></pre></div></body></html>\n<p>This is a powerful way of quickly iterating through your pipeline.</p>\n<p>The pipeline might have a lot of steps and forms of acyclic dependencies between\nthe steps. Below is an example of a canonical machine learning pipeline (more\ndetails in <a href=\"https://dvc.org/doc/tutorials\">the DVC tutorials</a>:</p>\n<p><html><head></head><body><div id=\"gist47206784\" class=\"gist\">\n    <div class=\"gist-file\">\n      <div class=\"gist-data\">\n        <div class=\"js-gist-file-update-container js-task-list-container file-box\">\n  <div id=\"file-dvc_pipeline-sh\" class=\"file\">\n    \n\n  <div itemprop=\"text\" class=\"Box-body p-0 blob-wrapper data type-shell\">\n      \n<table class=\"highlight tab-size js-file-line-container\" data-tab-size=\"8\">\n      <tbody><tr>\n        <td id=\"file-dvc_pipeline-sh-L1\" class=\"blob-num js-line-number\" data-line-number=\"1\"></td>\n        <td id=\"file-dvc_pipeline-sh-LC1\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-c\"><span class=\"pl-c\">#</span> Install DVC</span></td>\n      </tr>\n      <tr>\n        <td id=\"file-dvc_pipeline-sh-L2\" class=\"blob-num js-line-number\" data-line-number=\"2\"></td>\n        <td id=\"file-dvc_pipeline-sh-LC2\" class=\"blob-code blob-code-inner js-file-line\">$ pip install dvc</td>\n      </tr>\n      <tr>\n        <td id=\"file-dvc_pipeline-sh-L3\" class=\"blob-num js-line-number\" data-line-number=\"3\"></td>\n        <td id=\"file-dvc_pipeline-sh-LC3\" class=\"blob-code blob-code-inner js-file-line\">\n</td>\n      </tr>\n      <tr>\n        <td id=\"file-dvc_pipeline-sh-L4\" class=\"blob-num js-line-number\" data-line-number=\"4\"></td>\n        <td id=\"file-dvc_pipeline-sh-LC4\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-c\"><span class=\"pl-c\">#</span> Initialize DVC repository</span></td>\n      </tr>\n      <tr>\n        <td id=\"file-dvc_pipeline-sh-L5\" class=\"blob-num js-line-number\" data-line-number=\"5\"></td>\n        <td id=\"file-dvc_pipeline-sh-LC5\" class=\"blob-code blob-code-inner js-file-line\">$ dvc init</td>\n      </tr>\n      <tr>\n        <td id=\"file-dvc_pipeline-sh-L6\" class=\"blob-num js-line-number\" data-line-number=\"6\"></td>\n        <td id=\"file-dvc_pipeline-sh-LC6\" class=\"blob-code blob-code-inner js-file-line\">\n</td>\n      </tr>\n      <tr>\n        <td id=\"file-dvc_pipeline-sh-L7\" class=\"blob-num js-line-number\" data-line-number=\"7\"></td>\n        <td id=\"file-dvc_pipeline-sh-LC7\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-c\"><span class=\"pl-c\">#</span> Download a file and put to data/ directory.</span></td>\n      </tr>\n      <tr>\n        <td id=\"file-dvc_pipeline-sh-L8\" class=\"blob-num js-line-number\" data-line-number=\"8\"></td>\n        <td id=\"file-dvc_pipeline-sh-LC8\" class=\"blob-code blob-code-inner js-file-line\">$ dvc import https://s3-us-west-2.amazonaws.com/dvc-share/so/25K/Posts.xml.tgz data/</td>\n      </tr>\n      <tr>\n        <td id=\"file-dvc_pipeline-sh-L9\" class=\"blob-num js-line-number\" data-line-number=\"9\"></td>\n        <td id=\"file-dvc_pipeline-sh-LC9\" class=\"blob-code blob-code-inner js-file-line\">\n</td>\n      </tr>\n      <tr>\n        <td id=\"file-dvc_pipeline-sh-L10\" class=\"blob-num js-line-number\" data-line-number=\"10\"></td>\n        <td id=\"file-dvc_pipeline-sh-LC10\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-c\"><span class=\"pl-c\">#</span> Extract XML from the archive.</span></td>\n      </tr>\n      <tr>\n        <td id=\"file-dvc_pipeline-sh-L11\" class=\"blob-num js-line-number\" data-line-number=\"11\"></td>\n        <td id=\"file-dvc_pipeline-sh-LC11\" class=\"blob-code blob-code-inner js-file-line\">$ dvc run tar zxf data/Posts.xml.tgz -C data/</td>\n      </tr>\n      <tr>\n        <td id=\"file-dvc_pipeline-sh-L12\" class=\"blob-num js-line-number\" data-line-number=\"12\"></td>\n        <td id=\"file-dvc_pipeline-sh-LC12\" class=\"blob-code blob-code-inner js-file-line\">\n</td>\n      </tr>\n      <tr>\n        <td id=\"file-dvc_pipeline-sh-L13\" class=\"blob-num js-line-number\" data-line-number=\"13\"></td>\n        <td id=\"file-dvc_pipeline-sh-LC13\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-c\"><span class=\"pl-c\">#</span> Prepare data.</span></td>\n      </tr>\n      <tr>\n        <td id=\"file-dvc_pipeline-sh-L14\" class=\"blob-num js-line-number\" data-line-number=\"14\"></td>\n        <td id=\"file-dvc_pipeline-sh-LC14\" class=\"blob-code blob-code-inner js-file-line\">$ dvc run python code/xml_to_tsv.py data/Posts.xml data/Posts.tsv python</td>\n      </tr>\n      <tr>\n        <td id=\"file-dvc_pipeline-sh-L15\" class=\"blob-num js-line-number\" data-line-number=\"15\"></td>\n        <td id=\"file-dvc_pipeline-sh-LC15\" class=\"blob-code blob-code-inner js-file-line\">\n</td>\n      </tr>\n      <tr>\n        <td id=\"file-dvc_pipeline-sh-L16\" class=\"blob-num js-line-number\" data-line-number=\"16\"></td>\n        <td id=\"file-dvc_pipeline-sh-LC16\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-c\"><span class=\"pl-c\">#</span> Split training and testing dataset. Two output files.</span></td>\n      </tr>\n      <tr>\n        <td id=\"file-dvc_pipeline-sh-L17\" class=\"blob-num js-line-number\" data-line-number=\"17\"></td>\n        <td id=\"file-dvc_pipeline-sh-LC17\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-c\"><span class=\"pl-c\">#</span> 0.33 is the test dataset splitting ratio. 20170426 is a seed for randomization.</span></td>\n      </tr>\n      <tr>\n        <td id=\"file-dvc_pipeline-sh-L18\" class=\"blob-num js-line-number\" data-line-number=\"18\"></td>\n        <td id=\"file-dvc_pipeline-sh-LC18\" class=\"blob-code blob-code-inner js-file-line\">$ dvc run python code/split_train_test.py data/Posts.tsv 0.33 20170426 data/Posts-train.tsv data/Posts-test.tsv</td>\n      </tr>\n      <tr>\n        <td id=\"file-dvc_pipeline-sh-L19\" class=\"blob-num js-line-number\" data-line-number=\"19\"></td>\n        <td id=\"file-dvc_pipeline-sh-LC19\" class=\"blob-code blob-code-inner js-file-line\">\n</td>\n      </tr>\n      <tr>\n        <td id=\"file-dvc_pipeline-sh-L20\" class=\"blob-num js-line-number\" data-line-number=\"20\"></td>\n        <td id=\"file-dvc_pipeline-sh-LC20\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-c\"><span class=\"pl-c\">#</span> Extract features from text data. Two TSV inputs and two pickle matrixes outputs.</span></td>\n      </tr>\n      <tr>\n        <td id=\"file-dvc_pipeline-sh-L21\" class=\"blob-num js-line-number\" data-line-number=\"21\"></td>\n        <td id=\"file-dvc_pipeline-sh-LC21\" class=\"blob-code blob-code-inner js-file-line\">$ dvc run python code/featurization.py data/Posts-train.tsv data/Posts-test.tsv data/matrix-train.p data/matrix-test.p</td>\n      </tr>\n      <tr>\n        <td id=\"file-dvc_pipeline-sh-L22\" class=\"blob-num js-line-number\" data-line-number=\"22\"></td>\n        <td id=\"file-dvc_pipeline-sh-LC22\" class=\"blob-code blob-code-inner js-file-line\">\n</td>\n      </tr>\n      <tr>\n        <td id=\"file-dvc_pipeline-sh-L23\" class=\"blob-num js-line-number\" data-line-number=\"23\"></td>\n        <td id=\"file-dvc_pipeline-sh-LC23\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-c\"><span class=\"pl-c\">#</span> Train ML model out of the training dataset. 20170426 is another seed value.</span></td>\n      </tr>\n      <tr>\n        <td id=\"file-dvc_pipeline-sh-L24\" class=\"blob-num js-line-number\" data-line-number=\"24\"></td>\n        <td id=\"file-dvc_pipeline-sh-LC24\" class=\"blob-code blob-code-inner js-file-line\">$ dvc run python code/train_model.py data/matrix-train.p 20170426 data/model.p</td>\n      </tr>\n      <tr>\n        <td id=\"file-dvc_pipeline-sh-L25\" class=\"blob-num js-line-number\" data-line-number=\"25\"></td>\n        <td id=\"file-dvc_pipeline-sh-LC25\" class=\"blob-code blob-code-inner js-file-line\">\n</td>\n      </tr>\n      <tr>\n        <td id=\"file-dvc_pipeline-sh-L26\" class=\"blob-num js-line-number\" data-line-number=\"26\"></td>\n        <td id=\"file-dvc_pipeline-sh-LC26\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-c\"><span class=\"pl-c\">#</span> Evaluate the model by the testing dataset.</span></td>\n      </tr>\n      <tr>\n        <td id=\"file-dvc_pipeline-sh-L27\" class=\"blob-num js-line-number\" data-line-number=\"27\"></td>\n        <td id=\"file-dvc_pipeline-sh-LC27\" class=\"blob-code blob-code-inner js-file-line\">$ dvc run python code/evaluate.py data/model.p data/matrix-test.p data/evaluation.txt</td>\n      </tr>\n      <tr>\n        <td id=\"file-dvc_pipeline-sh-L28\" class=\"blob-num js-line-number\" data-line-number=\"28\"></td>\n        <td id=\"file-dvc_pipeline-sh-LC28\" class=\"blob-code blob-code-inner js-file-line\">\n</td>\n      </tr>\n      <tr>\n        <td id=\"file-dvc_pipeline-sh-L29\" class=\"blob-num js-line-number\" data-line-number=\"29\"></td>\n        <td id=\"file-dvc_pipeline-sh-LC29\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-c\"><span class=\"pl-c\">#</span> The result.</span></td>\n      </tr>\n      <tr>\n        <td id=\"file-dvc_pipeline-sh-L30\" class=\"blob-num js-line-number\" data-line-number=\"30\"></td>\n        <td id=\"file-dvc_pipeline-sh-LC30\" class=\"blob-code blob-code-inner js-file-line\">$ cat data/evaluation.txt</td>\n      </tr>\n      <tr>\n        <td id=\"file-dvc_pipeline-sh-L31\" class=\"blob-num js-line-number\" data-line-number=\"31\"></td>\n        <td id=\"file-dvc_pipeline-sh-LC31\" class=\"blob-code blob-code-inner js-file-line\">AUC: 0.596182</td>\n      </tr>\n</tbody></table>\n\n\n  </div>\n\n  </div>\n</div>\n\n      </div>\n      <div class=\"gist-meta\">\n        <a href=\"https://gist.github.com/dmpetrov/7704a5156bdc32c7379580a61e2fe3b6/raw/166cf09a233861902f1765e9179c1dce556fdcf5/dvc_pipeline.sh\" style=\"float:right\">view raw</a>\n        <a href=\"https://gist.github.com/dmpetrov/7704a5156bdc32c7379580a61e2fe3b6#file-dvc_pipeline-sh\">dvc_pipeline.sh</a>\n        hosted with ❤ by <a href=\"https://github.com\">GitHub</a>\n      </div>\n    </div>\n</div></body></html></p>\n<h2>Why are regular pipeline tools not enough?</h2>\n<blockquote>\n<p>“Workflows are expected to be mostly static or slowly changing.” (See\n<a href=\"https://airflow.incubator.apache.org/\">Airflow</a>.)</p>\n</blockquote>\n<p>Regular pipeline tools like <a href=\"http://airflow.incubator.apache.org\">Airflow</a> and\n<a href=\"https://github.com/spotify/luigi\">Luigi</a> are good for representing static and\nfault tolerant workflows. A huge portion of their functionality is created for\nmonitoring, optimization and fault tolerance. These are very important and\nbusiness critical problems. However, these problems are irrelevant to data\nscientists’ daily lives.</p>\n<p>Data scientists need a lightweight, dynamic workflow management system. In\ncontrast to the traditional airflow-like system, DVC reflects the process of\nresearching and looking for a great model (and pipeline), not optimizing and\nmonitoring an existing one. This is why DVC is a good fit for iterative machine\nlearning processes. When a good model was discovered with DVC, the result could\nbe incorporated into a data engineering pipeline (Luigi or Airflow).</p>\n<h2>Pipelines and data sharing</h2>\n<p>In addition to pipeline description, data reproduction and dynamic nature, DVC\nhas one more important feature. It was designed in accordance with the best\nsoftware engineering practices. DVC is based on Git. It keeps code, and stores\nDAG in the Git repository which allows you to share your research results. But\nit moves the actual file content outside the Git repository (in <html><head></head><body><code class=\"language-text\">.cache</code></body></html>\ndirectory which DVC includes in <html><head></head><body><code class=\"language-text\">.gitignore</code></body></html>) since Git is not designed to\naccommodate large data files.</p>\n<p>The data files can be shared between data scientists through cloud storages\nusing a simple command:</p>\n<html><head></head><body><div class=\"gatsby-highlight\" data-language=\"dvc\"><pre class=\"language-dvc\"><code class=\"language-dvc\"><span class=\"token comment\"># Data scientists 1 syncs data to the cloud.</span>\n<span class=\"token line\"><span class=\"token input\">$ </span><span class=\"token command\">dvc</span> <span class=\"token function\">sync</span> data/</span></code></pre></div></body></html>\n<p><html><head></head><body><span class=\"gatsby-resp-image-wrapper\" style=\"position: relative; display: block; margin-left: auto; margin-right: auto;  max-width: 307px;\">\n      <a class=\"gatsby-resp-image-link\" href=\"/static/6890171452971f3e3cd847014a526e03/937a5/git-server-or-github.jpg\" style=\"display: block\" target=\"_blank\" rel=\"noopener\">\n    <span class=\"gatsby-resp-image-background-image\" style=\"padding-bottom: 58.63192182410424%; position: relative; bottom: 0; left: 0; background-image: url(&#x27;data:image/jpeg;base64,/9j/2wBDABALDA4MChAODQ4SERATGCgaGBYWGDEjJR0oOjM9PDkzODdASFxOQERXRTc4UG1RV19iZ2hnPk1xeXBkeFxlZ2P/2wBDARESEhgVGC8aGi9jQjhCY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2P/wgARCAAMABQDASIAAhEBAxEB/8QAGQAAAQUAAAAAAAAAAAAAAAAAAAECAwQF/8QAFgEBAQEAAAAAAAAAAAAAAAAAAgED/9oADAMBAAIQAxAAAAHSitIdQeUf/8QAGRAAAgMBAAAAAAAAAAAAAAAAAQIDEhMR/9oACAEBAAEFArtqsknImJXNb4oAihR//8QAFxEAAwEAAAAAAAAAAAAAAAAAAAERMf/aAAgBAwEBPwGoen//xAAUEQEAAAAAAAAAAAAAAAAAAAAQ/9oACAECAQE/AT//xAAbEAACAgMBAAAAAAAAAAAAAAAAAREhAjFBEv/aAAgBAQAGPwJrhlLWuFnqLHWyEf/EABoQAQEBAQEBAQAAAAAAAAAAAAERADEhQVH/2gAIAQEAAT8h6V6/JmM6eKNR1b93u6MHBII10Ww3/9oADAMBAAIAAwAAABBLD//EABoRAAICAwAAAAAAAAAAAAAAAAABESExQVH/2gAIAQMBAT8QpxzRBtB//8QAFxEAAwEAAAAAAAAAAAAAAAAAAAERMf/aAAgBAgEBPxCMWH//xAAeEAEAAgEEAwAAAAAAAAAAAAABABEhMUFRYZGhsf/aAAgBAQABPxA0Iml8GO5R6JtUb6fsburTRfqC1O83u4XxK0AB3ED5xb28z//Z&#x27;); background-size: cover; display: block;\"></span>\n  <picture>\n        <source srcset=\"/static/6890171452971f3e3cd847014a526e03/c54d4/git-server-or-github.webp 175w, /static/6890171452971f3e3cd847014a526e03/a3432/git-server-or-github.webp 350w, /static/6890171452971f3e3cd847014a526e03/5316f/git-server-or-github.webp 614w\" sizes=\"(max-width: 614px) 100vw, 614px\" type=\"image/webp\">\n        <source srcset=\"/static/6890171452971f3e3cd847014a526e03/8dc06/git-server-or-github.jpg 175w, /static/6890171452971f3e3cd847014a526e03/f4417/git-server-or-github.jpg 350w, /static/6890171452971f3e3cd847014a526e03/937a5/git-server-or-github.jpg 614w\" sizes=\"(max-width: 614px) 100vw, 614px\" type=\"image/jpeg\">\n        <img class=\"gatsby-resp-image-image\" src=\"/static/6890171452971f3e3cd847014a526e03/937a5/git-server-or-github.jpg\" alt=\"git server or github\" title=\"git server or github\" loading=\"lazy\" style=\"width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;\">\n      </picture>\n  </a>\n    </span></body></html></p>\n<p>Currently, AWS S3 and GCP storage are supported by DVC.</p>\n<h2>Conclusion</h2>\n<p>The productivity of data scientists can be improved by speeding up iteration\nprocesses and the DVC tool takes care of this.</p>\n<p>We are very interested in your opinion and feedback. Please post your comments\nhere or contact us on Twitter — <a href=\"https://twitter.com/FullStackML\">FullStackML</a>.</p>\n<p>If you found this tool useful, <strong>please “star” the\n<a href=\"https://github.com/iterative/dvc\">DVC Github repository</a></strong>.</p>","timeToRead":6,"fields":{"slug":"/how-a-data-scientist-can-improve-his-productivity"},"frontmatter":{"title":"How A Data Scientist Can Improve His Productivity","date":"May 15, 2017","description":"Data science and machine learning are iterative processes. It is never\npossible to successfully complete a data science project in a single pass.\n","descriptionLong":"The iteration time is a critical parameter in data science process. The\nquicker you iterate, the more you can check ideas and build a better model.\nThe productivity of data scientists can be improved by speeding up iteration\nprocesses and the DVC tool takes care of this.\n","tags":["DVC","Productivity","Python","Tutorial"],"commentsUrl":"https://discuss.dvc.org/t/how-a-data-scientist-can-improve-their-productivity/301","author":{"childMarkdownRemark":{"frontmatter":{"name":"Dmitry Petrov","avatar":{"childImageSharp":{"fixed":{"base64":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAUCAIAAAAC64paAAAACXBIWXMAAAsSAAALEgHS3X78AAAEvklEQVQ4yw3O51NaBwAA8Ne79pqkPTM0tmZ48bKapK02RnOKUWOMmCDKEIEyFSdxRFHBATgeiiibJ0t4bGQ8dkAEjEQlejYf+qHXu/41zd3vD/gB3kQ0mt1zRQLeeCya2Tc6nTq1OuzbpY1MtuCpvz1rqKp/3YAhjy5IWCO8WhQa1dyB7WER6P1ERh8QyaQd4YAziuTPPkdyab3FaNWpA/Ydg1q5CYKCKd776TlS/zh3cY3MGatGoeuaMOgucjMag6fRgcTHbDKfjeb2js7P0pmUZQsUcTlTDDIb84rW3tLb2S4aH1Gur00KV9/SBmubO+pednb0MDEkMm1kBIhk07lCPnawnzz6pJcuzVEw4DB1kYVfHSA7Nld926qFflp3U8PKohBLH3hc01yDamt8jUXjCO0EIuBLJQKpuCMaCqWSomHmnsf639lhAbHlfXDWY41ZtGGjcoJK3JifZ7L7y+788rjq+Ve1DS31La8BbzSkg3fkMOzx2JHtzX++nH453C/EAge71ozTmLBoEUgmmxrkD7CJBFLRtdLbdx81vsWzxqfHRauAPxHRmSHhlsxpM+cRx7/nJ3/n059DzrxL/2FbagcFO0szkHgGWgPxXbgfr5eV3blPYA/xJDL+hgKAHFZQLVtSyf2RYMYPn8R8MCjQvGOAjM4ZfOvo2yYhE78xxY2YtFM04g8/375+q6KTweHLFMuQAVjXyafFgg295tN5wW9QuhWrmplhMZOgnBjg0wjysV54ZW5bLEjBhgEK6WLpjeKb5fVvutjvZ7F0NmD2OeQmLeSCj85Pz/aQlE2fNKvtyzOHHrNthS8boOwu8UyLE/tuC5tK+e5yccnN8pLyitKKB1dvlQP+VARJRw1eZzx/WDg+OI95onKRdqDHxB9eondxX9XoRun+9VmHTPR7ZdX3RVcvl5bduP/oCaoRhe0EdhNI+ii7E3SHMulIKnGWDKR1EiGpdZmJU71j8HCv1jj44OacCVy4W1l96WpxUclP9/6obe35s2+WDxi9dovfCbnh3OmJIxyO79rzdpVDPI5sCc2zw0b+kEcqsC6NL0yO3Xv+4kLRlUtXih/W1LdT6czpaUBh1YOaLZkJSh0f2sKIXKMMaSUJw9ppCA5siRwrvFkGnlBXPcTm3q998e2Fi1/nVQ0vW7qIT5rRgHgLHBNM8iUirRtW2SygBgpaYckgRTnFkU/0MVvrn1bcpr9po2FopJ7eS9eufXPhYkn5nbuVT4sfPAL8vojd6t1UqZwefyKaiSO5wsEX6TQfV/OYWP+U1FTHZ5J5DBLq12cKKbypsOJIVAq1D0tm9nJngcLBX9n4id+xn018zsaPM7GjRCDbjaUMEroX2UTpKGvtHau/o43yprsHx7FbP0SQI6/jwG3LBn2fgFyqkI4d241RuymOeHIhVybszjBIQ7g2ymA3a76PsjzCIbZ211Q2E7CDcqkbUiHG7ZhC6lVt7gLp2FHE/9GkDui2vD7HPqyP+exp1bqNjOWwSZNjbF4viYuqxqEbSaDQpFMienVYrw6ti+F5nhZIhvN+V0Yj82hkLpclaVQHzbqQQR1QSR0K0LYyp5mb3Fidh1YE0LYqaIRiJl1UpwzKJU5wced/Vc5FEciJlIEAAAAASUVORK5CYII=","width":40,"height":40,"src":"/static/77b5dd3dd92976d1bfd5b3f8a8c6fa01/4d3a4/dmitry_petrov.png","srcSet":"/static/77b5dd3dd92976d1bfd5b3f8a8c6fa01/4d3a4/dmitry_petrov.png 1x,\n/static/77b5dd3dd92976d1bfd5b3f8a8c6fa01/4c8bc/dmitry_petrov.png 1.5x,\n/static/77b5dd3dd92976d1bfd5b3f8a8c6fa01/c0e17/dmitry_petrov.png 2x","srcWebp":"/static/77b5dd3dd92976d1bfd5b3f8a8c6fa01/e145b/dmitry_petrov.webp","srcSetWebp":"/static/77b5dd3dd92976d1bfd5b3f8a8c6fa01/e145b/dmitry_petrov.webp 1x,\n/static/77b5dd3dd92976d1bfd5b3f8a8c6fa01/0d42c/dmitry_petrov.webp 1.5x,\n/static/77b5dd3dd92976d1bfd5b3f8a8c6fa01/f46db/dmitry_petrov.webp 2x"}}}}}},"picture":{"childImageSharp":{"fluid":{"base64":"data:image/jpeg;base64,/9j/2wBDABALDA4MChAODQ4SERATGCgaGBYWGDEjJR0oOjM9PDkzODdASFxOQERXRTc4UG1RV19iZ2hnPk1xeXBkeFxlZ2P/2wBDARESEhgVGC8aGi9jQjhCY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2P/wgARCAANABQDASIAAhEBAxEB/8QAGAAAAwEBAAAAAAAAAAAAAAAAAAMFAgT/xAAWAQEBAQAAAAAAAAAAAAAAAAABAAL/2gAMAwEAAhADEAAAAWtkdujZNC//xAAYEAEAAwEAAAAAAAAAAAAAAAABAAIDEv/aAAgBAQABBQKsscjDVm2tgdbM/8QAFREBAQAAAAAAAAAAAAAAAAAAACH/2gAIAQMBAT8BiP/EABYRAAMAAAAAAAAAAAAAAAAAAAABEf/aAAgBAgEBPwGMjP/EABsQAAEEAwAAAAAAAAAAAAAAAAABMUGRAlFh/9oACAEBAAY/AnQ5sfEihEIo/8QAHBABAAIDAAMAAAAAAAAAAAAAAQAhEUFRMWGR/9oACAEBAAE/IUZCnsum+E8rJ9w+vlCcmHSRDLD/2gAMAwEAAgADAAAAEBAv/8QAFhEAAwAAAAAAAAAAAAAAAAAAABFR/9oACAEDAQE/EHA4P//EABURAQEAAAAAAAAAAAAAAAAAAABh/9oACAECAQE/EKKP/8QAHhABAQACAQUBAAAAAAAAAAAAAREAITFhcZGx0fD/2gAIAQEAAT8QqHeCGOBCFAEbxzJgLCWkUfGBw/R0x8A1QJoJ7cmVMmx8z//Z","aspectRatio":1.499388004895961,"src":"/static/24507c6d2f27fa69b97f6ea7c01ee0d5/6fdf8/post-image.jpg","srcSet":"/static/24507c6d2f27fa69b97f6ea7c01ee0d5/9fc73/post-image.jpg 213w,\n/static/24507c6d2f27fa69b97f6ea7c01ee0d5/ee221/post-image.jpg 425w,\n/static/24507c6d2f27fa69b97f6ea7c01ee0d5/6fdf8/post-image.jpg 850w,\n/static/24507c6d2f27fa69b97f6ea7c01ee0d5/15102/post-image.jpg 1225w","srcWebp":"/static/24507c6d2f27fa69b97f6ea7c01ee0d5/5c1d9/post-image.webp","srcSetWebp":"/static/24507c6d2f27fa69b97f6ea7c01ee0d5/99b2d/post-image.webp 213w,\n/static/24507c6d2f27fa69b97f6ea7c01ee0d5/23220/post-image.webp 425w,\n/static/24507c6d2f27fa69b97f6ea7c01ee0d5/5c1d9/post-image.webp 850w,\n/static/24507c6d2f27fa69b97f6ea7c01ee0d5/a5363/post-image.webp 1225w","sizes":"(max-width: 850px) 100vw, 850px","presentationWidth":850}}},"pictureComment":null}}},"pageContext":{"next":{"fields":{"slug":"/r-code-and-reproducible-model-development-with-dvc"},"frontmatter":{"title":"R code and reproducible model development with DVC"}},"previous":null,"currentPage":20,"slug":"/how-a-data-scientist-can-improve-his-productivity"}}}