<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Data Version Control · DVC]]></title><description><![CDATA[Data Version Control Blog. We write about machine learning workflow. From data versioning and processing to model productionization. We share our news, findings, interesting reads, community takeaways.]]></description><link>https://blog.dvc.org</link><generator>GatsbyJS</generator><lastBuildDate>Mon, 17 Feb 2020 18:03:26 GMT</lastBuildDate><item><title><![CDATA[February '20 DVC❤️Heartbeat]]></title><link>https://blog.dvc.org/february-20-dvc-heartbeat</link><guid isPermaLink="false">https://blog.dvc.org/february-20-dvc-heartbeat</guid><pubDate>Mon, 10 Feb 2020 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Welcome to the February Heartbeat! This month’s featured image is a DVC pipeline
&lt;a href=&quot;https://medium.com/nlp-trend-and-review-en/use-dvc-to-version-control-ml-dl-models-bef61dbfe477&quot;&gt;created by one of our users&lt;/a&gt;,
which &lt;em&gt;we&lt;/em&gt; think resembles a valentine. Here are some more highlights from our
team and our community:&lt;/p&gt;
&lt;h2&gt;News&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Our team is growing!&lt;/strong&gt; In early January, DVC gained two new folks: engineer
&lt;a href=&quot;https://github.com/skshetry&quot;&gt;Saugat Pachhai&lt;/a&gt; and data scientist
&lt;a href=&quot;https://twitter.com/andronovhopf&quot;&gt;Elle O’Brien&lt;/a&gt;. Saugat, based in Nepal, will
be contributing to core DVC. Elle (that’s me!), currently in San Francisco, will
be leading data science projects and outreach with DVC.&lt;/p&gt;
&lt;p&gt;We’re &lt;strong&gt;gearing up for a spring full of talks&lt;/strong&gt; about DVC projects, including
new up-and-coming features for data cataloging and continuous integration. Here
are just a few events that have been added to our schedule:&lt;/p&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;section class=&quot;elp-content-holder&quot;&gt;
      &lt;a href=&quot;https://www.mlprague.com/#schedule-saturday&quot; class=&quot;external-link-preview&quot;&gt;
          &lt;div class=&quot;elp-description-holder&quot;&gt;
            &lt;h4 class=&quot;elp-title&quot;&gt;Machine Learning Prague - March 19&lt;/h4&gt;
            &lt;div class=&quot;elp-description&quot;&gt;DVC engineer Pawel Redzynski will talk about open source tools for versioning machine learning projects.&lt;/div&gt;
            &lt;div class=&quot;elp-link&quot;&gt;mlprague.com&lt;/div&gt;
          &lt;/div&gt;
           &lt;div class=&quot;elp-image-holder&quot;&gt;
                &lt;img src=&quot;/uploads/images/2020-02-10/mlprague.jpg&quot; alt=&quot;Machine Learning Prague - March 19&quot;&gt;
            &lt;/div&gt;
      &lt;/a&gt;
    &lt;/section&gt;
    &lt;/body&gt;&lt;/html&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;section class=&quot;elp-content-holder&quot;&gt;
      &lt;a href=&quot;https://www.mlprague.com/#schedule-saturday&quot; class=&quot;external-link-preview&quot;&gt;
          &lt;div class=&quot;elp-description-holder&quot;&gt;
            &lt;h4 class=&quot;elp-title&quot;&gt;DivOps 2020 - March 24&lt;/h4&gt;
            &lt;div class=&quot;elp-description&quot;&gt;Elle O&apos;Brien is talking about open source software in the growing field of MLOps at this international, remote conference.&lt;/div&gt;
            &lt;div class=&quot;elp-link&quot;&gt;https://divops.org/&lt;/div&gt;
          &lt;/div&gt;
           &lt;div class=&quot;elp-image-holder&quot;&gt;
                &lt;img src=&quot;/uploads/images/2020-02-10/divops_logo.png&quot; alt=&quot;DivOps 2020 - March 24&quot;&gt;
            &lt;/div&gt;
      &lt;/a&gt;
    &lt;/section&gt;
    &lt;/body&gt;&lt;/html&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;section class=&quot;elp-content-holder&quot;&gt;
      &lt;a href=&quot;https://www.mlprague.com/#schedule-saturday&quot; class=&quot;external-link-preview&quot;&gt;
          &lt;div class=&quot;elp-description-holder&quot;&gt;
            &lt;h4 class=&quot;elp-title&quot;&gt;Women in Data Science San Diego - May 9&lt;/h4&gt;
            &lt;div class=&quot;elp-description&quot;&gt;Elle O&apos;Brien will be delivering a keynote talk about data catalogs and feature stores.&lt;/div&gt;
            &lt;div class=&quot;elp-link&quot;&gt;https://www.widsconference.org/&lt;/div&gt;
          &lt;/div&gt;
           &lt;div class=&quot;elp-image-holder&quot;&gt;
                &lt;img src=&quot;/uploads/images/2020-02-10/wids.jpeg&quot; alt=&quot;Women in Data Science San Diego - May 9&quot;&gt;
            &lt;/div&gt;
      &lt;/a&gt;
    &lt;/section&gt;
    &lt;/body&gt;&lt;/html&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;p&gt;-Elle O’Brien was recently accepted to give a keynote at
&lt;a href=&quot;https://www.widsconference.org/&quot;&gt;Women in Data Science&lt;/a&gt; San Diego on May 9. The
talk is called “Packaging data and machine learning models for sharing.”&lt;/p&gt;
&lt;p&gt;-Elle will also be speaking at &lt;a href=&quot;https://divops.org/&quot;&gt;Div Ops&lt;/a&gt;, a new online
conference about (you guessed it) DevOps, on March 27.&lt;/p&gt;
&lt;p&gt;Look out for more conference announcements soon- in our &lt;strong&gt;brand new community
page!&lt;/strong&gt; We’ve &lt;a href=&quot;https://dvc.org/community&quot;&gt;just launched a new hub&lt;/a&gt; for sharing
events, goings-ons, and ways to contribute to DVC.&lt;/p&gt;
&lt;h2&gt;From the community&lt;/h2&gt;
&lt;p&gt;Our users continue to put awesome things on the internet. Like this AI blogger
who isn’t afraid to wear his heart on his sleeve.&lt;/p&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;section class=&quot;elp-content-holder&quot;&gt;
      &lt;a href=&quot;https://medium.com/@matlihan/my-favorite-data-science-tool-is-dvc-data-version-control-e6ab8aed24d2&quot; class=&quot;external-link-preview&quot;&gt;
          &lt;div class=&quot;elp-description-holder&quot;&gt;
            &lt;h4 class=&quot;elp-title&quot;&gt;My favorite data science tool is DVC - Data Version Control&lt;/h4&gt;
            &lt;div class=&quot;elp-description&quot;&gt;by Musa Atlıhan&lt;/div&gt;
            &lt;div class=&quot;elp-link&quot;&gt;medium.com&lt;/div&gt;
          &lt;/div&gt;
           &lt;div class=&quot;elp-image-holder&quot;&gt;
                &lt;img src=&quot;/uploads/images/2020-02-10/musa_atlihan.jpeg&quot; alt=&quot;My favorite data science tool is DVC - Data Version Control&quot;&gt;
            &lt;/div&gt;
      &lt;/a&gt;
    &lt;/section&gt;
    &lt;/body&gt;&lt;/html&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;p&gt;Musa Atlihan writes:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;From my experience, whether it is a real-world data science project or it is a
data science competition, there are two major key components for success.
Those components are API simplicity and reproducible pipelines. Since data
science means experimenting a lot in a limited time frame, first, we need
machine learning tools with simplicity and second, we need
reliable/reproducible machine learning pipelines. Thanks to tools like Keras,
LightGBM, and fastai we already have simple yet powerful tools for rapid model
development. And thanks to DVC, we are building large projects with
reproducible pipelines very easily.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It’s cool how Musa puts DVC in context with libraries for model building. In a
way, the libraries that have made it easier than ever to iterate through
different model architectures have increased the need for reproducibility in
proportion.&lt;/p&gt;
&lt;p&gt;Meanwhile in Germany, superusers Marcel Mikl and Bert Besser wrote
&lt;a href=&quot;https://blog.codecentric.de/en/2019/03/walkthrough-dvc/&quot;&gt;another&lt;/a&gt; seriously
comprehensive article about DVC for Codecentric. Marcel and Bert walk readers
through the steps to &lt;strong&gt;build a custom machine learning training pipeline with
remote computing resources&lt;/strong&gt; like GCP and AWS. It’s an excellent guide to
configuring model training with attention to &lt;em&gt;automation&lt;/em&gt; and &lt;em&gt;collaboration&lt;/em&gt;.
We give them 🦉🦉🦉🦉🦉 out of 5.&lt;/p&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;section class=&quot;elp-content-holder&quot;&gt;
      &lt;a href=&quot;https://blog.codecentric.de/en/2020/01/remote-training-gitlab-ci-dvc/&quot; class=&quot;external-link-preview&quot;&gt;
          &lt;div class=&quot;elp-description-holder&quot;&gt;
            &lt;h4 class=&quot;elp-title&quot;&gt;Remote training with GitLab-CI and DVC&lt;/h4&gt;
            &lt;div class=&quot;elp-description&quot;&gt;by Marcel Mikl and Bert Besser&lt;/div&gt;
            &lt;div class=&quot;elp-link&quot;&gt;blog.codecentric.de&lt;/div&gt;
          &lt;/div&gt;
           &lt;div class=&quot;elp-image-holder&quot;&gt;
                &lt;img src=&quot;/uploads/images/2020-02-10/marcel.png&quot; alt=&quot;Remote training with GitLab-CI and DVC&quot;&gt;
            &lt;/div&gt;
      &lt;/a&gt;
    &lt;/section&gt;
    &lt;/body&gt;&lt;/html&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;p&gt;Here are a few more stories on our radar:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;AI Singapore shares their method for AI development and deployment.&lt;/strong&gt; This
..
&lt;a href=&quot;https://makerspace.aisingapore.org/2020/01/agile-ai-engineering-in-aisg/&quot;&gt;blog about how Agile informs their processes&lt;/a&gt;
for continuous integration and delivery includes data versioning.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Toucan AI dispenses advice for ML engineers.&lt;/strong&gt; This ..
&lt;a href=&quot;https://toucanai.com/blog/post/building-production-ml/&quot;&gt;blog for practitioners&lt;/a&gt;
discusses questions like, “When to work on ML vs. the processes that surround
ML”. It covers how DVC is used for model versioning in the exploration stage
of ML.&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;DVC at the University.&lt;/strong&gt; A recent ..
&lt;a href=&quot;https://arxiv.org/pdf/1912.01706.pdf&quot;&gt;pre-print from natural language processing researchers at Université Laval&lt;/a&gt;
explains how DVC facilitated dataset access for collaborators.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;“In our case, the original dataset takes up to 6 Gigabytes. The previous way
of retrieving the dataset over the network with a standard 20 Mbits/sec
internet connexion took up to an hour to complete (including uncompressing
the data). Using DVC reduced the retrieval time of the dataset to 3 minutes
over the network with the same internet connexion.”&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Thanks for sharing- this is a lovely result. Oh, and last…&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;DVC is a job requirement&lt;/strong&gt;! We celebrated a small milestone when we stumbled
.. across a listing for a data engineer to support R&amp;#x26;D at
&lt;a href=&quot;https://www.elvie.com/en-us/&quot;&gt;Elvie&lt;/a&gt;, a maker of tech for women’s health
(pretty neat mission). The decorations on the job posting are ours 😎&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;span class=&quot;gatsby-resp-image-wrapper&quot; style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto;  max-width: 470px;&quot;&gt;
      &lt;a class=&quot;gatsby-resp-image-link&quot; href=&quot;/static/f0e8a9d4e7525ba2c56504833e14c3cd/4362d/elvie.png&quot; style=&quot;display: block&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;
    &lt;span class=&quot;gatsby-resp-image-background-image&quot; style=&quot;padding-bottom: 83.82978723404257%; position: relative; bottom: 0; left: 0; background-image: url(&amp;#x27;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAARCAYAAADdRIy+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAACVUlEQVQ4y6VUy47TQBDMN/EXHPgNfgQJCXFHnOECh9VeuLMckIggWtiEtXASO++337HHzxRdkzibXQSLYKRS2z097aqebjdKb4180kTSOgegsNtBsPsncDWCs4dwXzxA/P4p/ncxacM/fwT/7WOU8pKmObI0hVLqt0gP+1mWIUkSDb5XVbVnWAbfUcY2kqzCYjHHZrMRu8B0OsVkMsF8PsdyudSWWK1W2r9er3XMbDbT+3me7xNqqoI0VfB9H67rwnEcnZiWB5mEflr6GVfbXyRjV6EqS02dQfwaD9KSKUFG3KOfrOg7ZXi6GvXtkDIPjkYjfYB2OBwIhhjYNkZix+MxLMvSYExRFLp+k7mNzo8vSDN1kCxJWeQgCDQo1fM8+GGk4foBvCBEFCdwxM+9KIqw3W51qUy7jYtP7xAn0U1CfomsbGEzGAzQ65qwjW+wOpcwO19hXrVgXV+hd92WGAvdXg+GYWjJSRKLwnRfw1pyKXVkMsoaSuIo8FFMDWT9z8jHHWl+A7nVQjG4lK7woURRLqDsm1vBbclhGGopvkiinCQrNGKVIowVwm2MUPZZlloywec8zw5tw7GRpqRkMuv3+1oyC981TfRFmiU+R27Zk9Zh/dhCTEoCdd1v9aG+5aKUYPcYTLiHBAQngYfqKbk7QcdJ2SkHxcZAHLoYT+fHVmE9T7Fvo+GxB3kZ9eQQR4b56APS5jOo9iv4C0u3hus6unU4CWR6KvPen0NmvIFqPkd6+RLYTv/qj/LH31cxuoD6+ARZ5zUq5em55iXdd/Duc71+Asu4ECrn2prNAAAAAElFTkSuQmCC&amp;#x27;); background-size: cover; display: block;&quot;&gt;&lt;/span&gt;
  &lt;picture&gt;
        &lt;source srcset=&quot;/static/f0e8a9d4e7525ba2c56504833e14c3cd/c54d4/elvie.webp 175w, /static/f0e8a9d4e7525ba2c56504833e14c3cd/a3432/elvie.webp 350w, /static/f0e8a9d4e7525ba2c56504833e14c3cd/426ac/elvie.webp 700w, /static/f0e8a9d4e7525ba2c56504833e14c3cd/e8e7c/elvie.webp 940w&quot; sizes=&quot;(max-width: 700px) 100vw, 700px&quot; type=&quot;image/webp&quot;&gt;
        &lt;source srcset=&quot;/static/f0e8a9d4e7525ba2c56504833e14c3cd/17006/elvie.png 175w, /static/f0e8a9d4e7525ba2c56504833e14c3cd/d6f3f/elvie.png 350w, /static/f0e8a9d4e7525ba2c56504833e14c3cd/69344/elvie.png 700w, /static/f0e8a9d4e7525ba2c56504833e14c3cd/4362d/elvie.png 940w&quot; sizes=&quot;(max-width: 700px) 100vw, 700px&quot; type=&quot;image/png&quot;&gt;
        &lt;img class=&quot;gatsby-resp-image-image&quot; src=&quot;/static/f0e8a9d4e7525ba2c56504833e14c3cd/69344/elvie.png&quot; alt=&quot;elvie&quot; title=&quot;elvie&quot; loading=&quot;lazy&quot; style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;&gt;
      &lt;/picture&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/body&gt;&lt;/html&gt;&lt;em&gt;A
&lt;a href=&quot;https://www.jobstoday.co.uk/job/40530810/data-engineer/?TrackID=8&quot;&gt;job advertisement&lt;/a&gt;
featuring DVC.&lt;/em&gt;&lt;/p&gt;</content:encoded></item><item><title><![CDATA[Join DVC for Google Summer of Code 2020]]></title><link>https://blog.dvc.org/gsoc-ideas-2020</link><guid isPermaLink="false">https://blog.dvc.org/gsoc-ideas-2020</guid><pubDate>Tue, 04 Feb 2020 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Announcement, announcement! After a successful experience with
&lt;a href=&quot;https://developers.google.com/season-of-docs&quot;&gt;Google Season of Docs&lt;/a&gt; in 2019,
we’re putting out a call for students to apply to work with DVC as part of
&lt;a href=&quot;https://summerofcode.withgoogle.com/&quot;&gt;Google Summer of Code&lt;/a&gt;. If you want to
make a dent in open source software development with mentorship from our team,
read on.&lt;/p&gt;
&lt;h2&gt;Prerequisites to apply&lt;/h2&gt;
&lt;p&gt;Besides the general requirements to apply to Google Summer of Code, there are a
few skills we look for in applicants.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Python experience.&lt;/strong&gt; All of our core development is done in Python, so we
prefer candidates that are experienced in Python. However, we will consider
applicants who are very strong in another language and familiar with Python
basics.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Git experience.&lt;/strong&gt; Git is also a key part of DVC development, as DVC is
built around Git; that said, for certain projects (rated as “Beginner”) a
surface-level knowledge of Git will be sufficient.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;People skills.&lt;/strong&gt; Beyond technical fundamentals, we put a high value on
communication skills: the ability to report and document your experiments and
findings, to work kindly with teammates, and explain your goals and work
clearly.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;If you like our mission but aren’t sure if you’re sufficiently prepared, please
be in touch anyway. We’d love to hear from you.&lt;/p&gt;
&lt;h2&gt;Project ideas&lt;/h2&gt;
&lt;p&gt;Below are several project ideas that are an immediate priority for the core DVC
team. Of course,we welcome students to create their own proposals, even if they
differ from our ideas. Projets will be primarily mentored by co-founders
&lt;a href=&quot;https://github.com/dmpetrov&quot;&gt;Dmitry Petrov&lt;/a&gt; and
&lt;a href=&quot;https://github.com/shcheklein&quot;&gt;Ivan Shcheklein&lt;/a&gt;.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Migrate to the latest v3 API to improve Google Drive support.&lt;/strong&gt; Our
organization is a co-maintainer of the PyDrive library in collaboration with
a team at Google. The PyDrive library is now several years old and still
relies on the v2 protocol. We would like to migrate to v3, which we expect
will boost performance for many DVC use cases (e.g. the ability to filter
fields being retrieved from our API, etc). For this project, we’re looking
for a student to work with us to prepare the next major version of the
PyDrive library, as well as making important changes to the core DVC code to
support it. Because PyDrive is broadly used outside of DVC, this project is a
chance to work on a library of widespread interest to the Python community.
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;br&gt;&lt;/body&gt;&lt;/html&gt; &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;br&gt;&lt;/body&gt;&lt;/html&gt; &lt;em&gt;Skills required:&lt;/em&gt; Python, Git, experience with APIs &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;br&gt;&lt;/body&gt;&lt;/html&gt;
&lt;em&gt;Difficulty rating:&lt;/em&gt; Beginner-Medium &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;br&gt;&lt;/body&gt;&lt;/html&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Introducing parallelism to DVC.&lt;/strong&gt; One of DVC’s features is the ability to
create pipelines, linking data repositories with code to process data, train
models, and evaluate model metrics. Once a DVC pipeline is created, the
pipeline can be shared and re-run in a systematic and entirely reproducible
way. Currently, DVC executes pipelines sequentially, even though some steps
may be run in parallel (such as data preprocessing). We would like to support
parallelization for pipeline steps specified by the user. Furthermore, we’ll
need to support building flags into DVC commands that specify the level of
parallelization (CPU, GPU or memory). &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;br&gt;&lt;/body&gt;&lt;/html&gt; &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;br&gt;&lt;/body&gt;&lt;/html&gt; &lt;em&gt;Skills required:&lt;/em&gt;
Python, Git. Some experience with parallelization and/or scientific computing
would be helpful but not required. &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;br&gt;&lt;/body&gt;&lt;/html&gt; &lt;em&gt;Difficulty rating:&lt;/em&gt; Advanced
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;br&gt;&lt;/body&gt;&lt;/html&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Developing use cases for data registries and ML model zoos.&lt;/strong&gt; A new DVC
functionality that we’re particularly excited about is &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;summon&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;, a method
that can turn remotely-hosted machine learning artifacts such as datasets,
trained models, and more into objects in the user’s local environment (such
as a Jupyter notebook). This is a foundation for creating data catalogs of
data-frames and machine learning model zoos on top of Git repositories and
cloud storages (like GCS or S3). We need to identify and implement model zoos
(think PyTorch Hub, the Caffe Model Zoo, or the TensorFlow DeepLab Model Zoo)
and data registries for types that are not supported by DVC yet. Currently,
we’ve tested &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;summon&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; with PyTorch image segmentation models and Pandas
dataframes. We’re looking for students to explore other possible use cases.
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;br&gt;&lt;/body&gt;&lt;/html&gt; &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;br&gt;&lt;/body&gt;&lt;/html&gt; &lt;em&gt;Skills required:&lt;/em&gt; Python, Git, and some machine learning or
data science experience &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;br&gt;&lt;/body&gt;&lt;/html&gt; &lt;em&gt;Difficulty rating:&lt;/em&gt; Beginner-Medium &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;br&gt;&lt;/body&gt;&lt;/html&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Continuous delivery for JetBrains TeamCity.&lt;/strong&gt; Continuous integration and
continuous delivery (CI/CD) for ML projects is an area where we see
&lt;a href=&quot;https://martinfowler.com/articles/cd4ml.html&quot;&gt;DVC make a big impact&lt;/a&gt;-
specifically, by delivering datasets and ML models into CI/CD pipelines.
While there are many cases when DVC is used inside GitHub Actions and GitLab
CI, you will be transferring this experience to another type of CI/CD system,
&lt;a href=&quot;https://www.jetbrains.com/teamcity/&quot;&gt;JetBrains TeamCity&lt;/a&gt;. We’re working to
integrate DVC’s model and dataset versioning into TeamCity’s CI/CD toolkit.
This project would be ideal for a student looking to explore the growing
field of MLOps, an offshoot of DevOps with the specifics of ML projects at
the center. &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;br&gt;&lt;/body&gt;&lt;/html&gt; &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;br&gt;&lt;/body&gt;&lt;/html&gt; &lt;em&gt;Skills required:&lt;/em&gt; Python, Git, bash scripting. It
would be nice, but not necessary, to have some experience with CI/CD tools
and developer workflow automation. &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;br&gt;&lt;/body&gt;&lt;/html&gt; &lt;em&gt;Difficulty rating:&lt;/em&gt;
Medium-Advanced &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;br&gt;&lt;/body&gt;&lt;/html&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;DVC performance testing framework.&lt;/strong&gt; Performance is a core value of DVC. We
will be creating a performance monitoring and testing framework where new
scenarios (e.g., unit testing)can be populated. The framework should reflect
all performance improvements and degradations for each of the DVC releases.
It would be especially compelling if testing could be integrated with our
GitHub workflow (CI/CD). This is a great opportunity for a student to learn
about DVC and versioning in-depth and contribute to its stability. &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;br&gt;&lt;/body&gt;&lt;/html&gt;
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;br&gt;&lt;/body&gt;&lt;/html&gt; &lt;em&gt;Skills required:&lt;/em&gt; Python, Git, bash scripting. &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;br&gt;&lt;/body&gt;&lt;/html&gt; &lt;em&gt;Difficulty
rating:&lt;/em&gt; Medium-Advanced &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;br&gt;&lt;/body&gt;&lt;/html&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;If you’d like to apply&lt;/h2&gt;
&lt;p&gt;Please refer to the
&lt;a href=&quot;https://summerofcode.withgoogle.com/&quot;&gt;Google Summer of Code&lt;/a&gt; application guides
for specifics of the program. Students looking to know more about DVC, and our
worldwide community of contributors, will learn most by visiting our
&lt;a href=&quot;https://dvc.org/chat&quot;&gt;Discord channel&lt;/a&gt;,
&lt;a href=&quot;https://github.com/iterative/dvc&quot;&gt;GitHub repository&lt;/a&gt;, and
&lt;a href=&quot;https://discuss.dvc.org/&quot;&gt;Forum&lt;/a&gt;. We are available to discuss project proposals
from interested students and can be reached by &lt;a href=&quot;support@dvc.org&quot;&gt;email&lt;/a&gt; or on
our Discord channel.&lt;/p&gt;</content:encoded></item><item><title><![CDATA[January '20 Community Gems]]></title><link>https://blog.dvc.org/january-20-community-gems</link><guid isPermaLink="false">https://blog.dvc.org/january-20-community-gems</guid><pubDate>Mon, 20 Jan 2020 00:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;Discord gems&lt;/h2&gt;
&lt;p&gt;There’s a lot of action in our Discord channel these days. Ruslan, DVC’s core
maintainer, said it best with a gif.&lt;/p&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;blockquote class=&quot;twitter-tweet&quot; data-dnt=&quot;true&quot;&gt;&lt;p lang=&quot;en&quot; dir=&quot;ltr&quot;&gt;How it feels when &lt;a href=&quot;https://twitter.com/DVCorg&quot;&gt;@DVCorg&lt;/a&gt; team is handling multiple conversations on Discord at the same time. &lt;a href=&quot;https://t.co/QrLusdWYml&quot;&gt;https://t.co/QrLusdWYml&lt;/a&gt;&lt;/p&gt;— 🦉 Ruslan Kuprieiev (@rkuprieiev) &lt;a href=&quot;https://twitter.com/rkuprieiev/status/1144008869414342658&quot;&gt;June 26, 2019&lt;/a&gt;&lt;/blockquote&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;p&gt;It’s a lot to keep up with, so here are some highlights. We think these are
useful, good-to-know, and interesting conversations between DVC developers and
users.&lt;/p&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/563406153334128681/657590900754612284&quot;&gt;What pros does DVC have compared to Git LFS?&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;For an in-depth answer, check out this
&lt;a href=&quot;https://stackoverflow.com/questions/58541260/difference-between-git-lfs-and-dvc&quot;&gt;Stack Overflow discussion&lt;/a&gt;.
But in brief, with DVC you don’t need a special server, and you can use nearly
any kind of storage (S3, Google Cloud Storage, Azure Blobs, your own server,
etc.) without a fuss. There are also no limits on the size of the data that you
can store, unlike with GitHub. With Git LFS, there are some general LFS server
limits, too. DVC has additional features for sharing your data (e.g.,
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc import&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;) and has pipeline support, so it does much more than LFS. Plus, we
have flexible and quick checkouts, as we utilize different link types (reflinks,
symlinks, and hardlinks). We think there are lots of advantages; of course, the
usefulness will depend on your particular needs.&lt;/p&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/563406153334128681/656016145119182849&quot;&gt;How do I use DVC with SSH remote storage?&lt;/a&gt; I usually connect with a .pem key file. How do I do the same with DVC?&lt;/h3&gt;
&lt;p&gt;DVC is built to work with the SSH protocol to access remote storage (we provide
some
&lt;a href=&quot;https://dvc.org/doc/user-guide/external-dependencies#ssh&quot;&gt;examples in our official documentation&lt;/a&gt;).
When SSH requires a key file, try this:&lt;/p&gt;
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;dvc&quot;&gt;&lt;pre class=&quot;language-dvc&quot;&gt;&lt;code class=&quot;language-dvc&quot;&gt;&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token dvc&quot;&gt;dvc remote modify&lt;/span&gt; myremote keyfile &lt;span class=&quot;token operator&quot;&gt;&amp;#x3C;&lt;/span&gt;path to *.pem&lt;span class=&quot;token operator&quot;&gt;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/body&gt;&lt;/html&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/563406153334128681/651098762466426891&quot;&gt;If you train a TensorFlow model that creates multiple checkpoint files, how do you establish them as dependencies in the DVC pipeline?&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;You can specify a directory as a dependency/output in your DVC pipeline, and
store checkpointed models in that directory. It might look like this:&lt;/p&gt;
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;dvc&quot;&gt;&lt;pre class=&quot;language-dvc&quot;&gt;&lt;code class=&quot;language-dvc&quot;&gt;&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token dvc&quot;&gt;dvc run&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;\&lt;/span&gt;
     -f train.dvc &lt;span class=&quot;token punctuation&quot;&gt;\&lt;/span&gt;
     -d data &lt;span class=&quot;token punctuation&quot;&gt;\&lt;/span&gt;
     -d train.py &lt;span class=&quot;token punctuation&quot;&gt;\&lt;/span&gt;
     -o models python code/train.py&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/body&gt;&lt;/html&gt;
&lt;p&gt;where &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;models&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; is a directory created for checkpoint files. If you would like to
preserve your models in the data directory, though, then you would need to
specify them one by one. You can do this with bash:&lt;/p&gt;
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;dvc&quot;&gt;&lt;pre class=&quot;language-dvc&quot;&gt;&lt;code class=&quot;language-dvc&quot;&gt;&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token dvc&quot;&gt;dvc run&lt;/span&gt; &lt;span class=&quot;token variable&quot;&gt;&lt;span class=&quot;token variable&quot;&gt;$(&lt;/span&gt;&lt;span class=&quot;token keyword&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;token for-or-select variable&quot;&gt;file&lt;/span&gt; &lt;span class=&quot;token keyword&quot;&gt;in&lt;/span&gt; data/*.gz&lt;span class=&quot;token punctuation&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;token keyword&quot;&gt;do&lt;/span&gt; &lt;span class=&quot;token builtin class-name&quot;&gt;echo&lt;/span&gt; -n -d $file&lt;span class=&quot;token punctuation&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;token keyword&quot;&gt;done&lt;/span&gt;&lt;span class=&quot;token variable&quot;&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/body&gt;&lt;/html&gt;
&lt;p&gt;Be careful, though: if you declare checkpoint files to be an output of the DVC
pipeline, you won’t be able to re-run the pipeline using those checkpoint files
to initialize weights for model training. This would introduce circularity, as
your output would become your input.&lt;/p&gt;
&lt;p&gt;Also keep in mind that whenever you re-run a pipeline with &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc repro&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;, outputs
are deleted and then regenerated. If you don’t wish to automatically delete
outputs, there is a &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;--persist&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; flag (see discussion
&lt;a href=&quot;https://github.com/iterative/dvc/issues/1214&quot;&gt;here&lt;/a&gt; and
&lt;a href=&quot;https://github.com/iterative/dvc/issues/1884&quot;&gt;here&lt;/a&gt;), although we don’t
currently provide technical support for it.&lt;/p&gt;
&lt;p&gt;Finally, remember that setting something as a dependency (&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;-d&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;) doesn’t mean it
is automatically tracked by DVC. So remember to &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc add&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; data files in the
beginning!&lt;/p&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485596304961962003/655012135973158942&quot;&gt;Is it possible to use the same cache directory for multiple DVC repos that are used in parallel?&lt;/a&gt; Or do I need external software to prevent potential race conditions?&lt;/h3&gt;
&lt;p&gt;This is absolutely possible, and you don’t need any external software to safely
use multiple DVC repos in parallel. With DVC, cache operations are atomic. The
only exception is cleaning the cache with &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc gc&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;, which you should only run
when no one else is working on a shared project that is referenced in your cache
(and also, be sure to use the &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;--projects&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; flag
&lt;a href=&quot;https://dvc.org/doc/command-reference/gc&quot;&gt;as described in our docs&lt;/a&gt;). For more
about using multiple DVC repos in parallel, check out some discussions
&lt;a href=&quot;https://discuss.dvc.org/t/setup-dvc-to-work-with-shared-data-on-nas-server/180&quot;&gt;here&lt;/a&gt;
and &lt;a href=&quot;https://dvc.org/doc/use-cases/shared-development-server&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485596304961962003/652380507832844328&quot;&gt;What are some strategies for reproducibility if parts of our model training pipeline are run on our organizations’s HPC?&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Using DVC for version control is entirely compatible with using remote computing
resources, like high performance computing (HPC), in your model training
pipeline. We think a great example of using DVC with parallel computing is
provided by &lt;a href=&quot;http://www.peterfogh.dk/&quot;&gt;Peter Fogh&lt;/a&gt; Take a
&lt;a href=&quot;https://github.com/PeterFogh/dvc_dask_use_case&quot;&gt;look at his repo&lt;/a&gt; for a
detailed use case. Please keep us posted about how HPC works in your pipeline,
as we’ll be eager to pass on any insights to the community.&lt;/p&gt;
&lt;h3&gt;Q: Say I have a Git repository with multiple projets inside (one classification, one object detection, etc.). &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/563406153334128681/646760832616890408&quot;&gt;Is it possible to tell DVC to just pull data for one particular project?&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Absolutely, DVC supports pulling data from different DVC-files. An example would
be having two project subdirectories in your Git repo, &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;classification&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; and
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;detection&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;. You could use &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc pull -R classification&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; to only pull files in
that project to your workspace.&lt;/p&gt;
&lt;p&gt;If you prefer to be even more granular, you can &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc add&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; files individually.
Then you can use &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc pull &amp;#x3C;filename&gt;.dvc&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; to retrieve the outputs specified
only by that file.&lt;/p&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/563406153334128681/623234659098296348&quot;&gt;Is it possible to set an S3 remote without the use of AWS credentials with DVC?&lt;/a&gt; I want to publicly host a dataset so that everybody who clones my code repo can just run &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc pull&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; to fetch the dataset.&lt;/h3&gt;
&lt;p&gt;Yes, and we love the idea of publicly hosting a dataset. There are a few ways to
do it with DVC. We use one method in our own DVC project repository on Github.
If you run &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;git clone https://github.com/iterative/dvc&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; and then &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc pull&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;,
you’ll see that DVC is downloading data from an HTTP repository, which is
actually just an S3 repository that we’ve granted public HTTP read-access to.&lt;/p&gt;
&lt;p&gt;So you would need to configure two remotes in your config file, each pointing to
the same S3 bucket through different protocols. Like this:&lt;/p&gt;
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;dvc&quot;&gt;&lt;pre class=&quot;language-dvc&quot;&gt;&lt;code class=&quot;language-dvc&quot;&gt;&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token dvc&quot;&gt;dvc remote add&lt;/span&gt; -d --local myremote s3://bucket/path
&lt;/span&gt;&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token dvc&quot;&gt;dvc remote add&lt;/span&gt; -d mypublicemote http://s3-external-1.amazonaws.com/bucket/path&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/body&gt;&lt;/html&gt;
&lt;p&gt;Here’s why this works: the &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;-d&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; flag sets the default remote, and the &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;--local&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;
flag creates a set of configuration preferences that will override the global
settings when DVC commands are run locally and won’t be shared through Git (you
can read more about this
&lt;a href=&quot;https://dvc.org/doc/command-reference/remote/add#remote-add&quot;&gt;in our docs&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;This means that even though you and users from the public are accessing the
stored dataset by different protocols (S3 and HTTPS), you’ll all run the same
command: &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc pull&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;.&lt;/p&gt;</content:encoded></item><item><title><![CDATA[January '20 DVC❤️Heartbeat]]></title><link>https://blog.dvc.org/january-20-dvc-heartbeat</link><guid isPermaLink="false">https://blog.dvc.org/january-20-dvc-heartbeat</guid><pubDate>Fri, 17 Jan 2020 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Welcome to the New Year! Time for a recap of the last few weeks of activity in
the DVC community.&lt;/p&gt;
&lt;h2&gt;News&lt;/h2&gt;
&lt;p&gt;We were honored to be named a &lt;a href=&quot;https://ods.ai/awards/2019/&quot;&gt;Project of the Year&lt;/a&gt;
by Open Data Science, Russia’s largest community of data scientists and machine
learning practitioners. Check out our ⭐️incredibly shiny trophy⭐️!&lt;/p&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;blockquote class=&quot;twitter-tweet&quot; data-dnt=&quot;true&quot;&gt;&lt;p lang=&quot;en&quot; dir=&quot;ltr&quot;&gt;DVC is the &quot;project of the year&quot; according to &lt;a href=&quot;https://twitter.com/odsai_en&quot;&gt;@odsai_en&lt;/a&gt;!&lt;br&gt;😱🏆🎉&lt;br&gt;Open Data Science the largest DS community we know, with over 40K active members, great courses and it&apos;s own conf Data Fest.&lt;br&gt;Many thanks to the organizers and voters!&lt;br&gt;This is the best surprize gift for the team!!🥳 &lt;a href=&quot;https://t.co/LZgewjM582&quot;&gt;pic.twitter.com/LZgewjM582&lt;/a&gt;&lt;/p&gt;— 🦉DVC (@DVCorg) &lt;a href=&quot;https://twitter.com/DVCorg/status/1209544709930016768&quot;&gt;December 24, 2019&lt;/a&gt;&lt;/blockquote&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;p&gt;DVC hit &lt;strong&gt;100 individual contributors&lt;/strong&gt; on Github! To celebrate our
100&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;sup&gt;&lt;/sup&gt;&lt;/body&gt;&lt;/html&gt;th&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;/body&gt;&lt;/html&gt; contributor, &lt;a href=&quot;https://github.com/verasativa/&quot;&gt;Vera Sativa&lt;/a&gt;, we
sent her $500 to use on any educational opportunity and her own DeeVee (that’s
our rainbow owl). We also awarded educational mini-grants to two of DVC’s
biggest contributors, &lt;a href=&quot;https://twitter.com/tweetiko&quot;&gt;Vít Novotný&lt;/a&gt;, and
&lt;a href=&quot;https://twitter.com/david_prihoda&quot;&gt;David Příhoda&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;span class=&quot;gatsby-resp-image-wrapper&quot; style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto;  max-width: 612px;&quot;&gt;
      &lt;a class=&quot;gatsby-resp-image-link&quot; href=&quot;/static/78b685e283d679c8ebe518ea17520f6d/75999/odd_with_deevee.png&quot; style=&quot;display: block&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;
    &lt;span class=&quot;gatsby-resp-image-background-image&quot; style=&quot;padding-bottom: 103.921568627451%; position: relative; bottom: 0; left: 0; background-image: url(&amp;#x27;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAVCAYAAABG1c6oAAAACXBIWXMAABYlAAAWJQFJUiTwAAAFyklEQVQ4yx2Ue0zUVxqGf0mTZpvspdq6UFcjFQRUkHItVIGBGYbhOhdwBhgYxwHHGWS4dBhgBgGFAYeRmzAUUBAUBKNQ2eXSi6bbakRTu922aXbV2krjbruujRtdk02a7D77kz+efMk553vPyfe+OYIlN4JR525GXVJqtTEYsqMwZoTRbk6g1ZzEXlk4JblvUl6QSHdVGl6bnMpiKZX7MjhaoeS4XUZPbTadDj3W4kwEpTSK3gYjp1stDNRoGSjLpNVWiMeh5bhTS6kulfpqIz53Oed7Kzk/WMtwp52JwSMsjR9lqt/F+xdPcXn5XU6OjSGYCyT0NedzZbSedk00u4ICKMyIoSw/krdNUprLtmEp3kNTXT5fztfwyYKDbtdeWhx5zE/U8N4ZDx9fPMGVCwNU28wIpzpKGHIbGXXk48vcilcRxSF5CJ2GzczVh+MrfxWzKpRPTpt4crWKn1bexlcrZbariIfXmsX1cj7ymXiv56A4hn0Io0c0nDycz0BtDj5rCh5bHhUFSXTZEhhtlmLKCsJaqmXQY+d0by1Dx+14PS7qzBpxLDr6PZWMT/SwdHGEo047wohbfGGrAVOxGr1BS5SmkJrMCJKy1IRmFiPbEYAqV4lUU0yWXIrFaEJ70E1pypvYS/eRkZXDvoI8Orv7yJAlIjgtCm561QwbErClR665OWNNRq/LRCIJY7FWRVlaNPKoIBx5qbj3pxOXkkZuXBgtNgNJkeH0GLKZHOohTxGPUKQIYtahYLwsFVPsb8mNCUIZE4Lf+t/w4osvcCA7EWl0GHp5Al6djLlWMzu2BxD5O3/Sg7eSvXsjJ6t385cpBdO9RQgN+igGqxRUl6SJmQthy4ZX2Pl6ANv916GO3YY8LJR6jYKk8FCsfutY6W5ElaMg0s+fvZGBtJWHcc6dxq0ZFXM+OUJhZpS4KMNcmCySSlzwa7wRGEDYltfoUCdwtUXKh2Nm9MkS3Bs3MVNnRbY7DkEQ8N/wKyqzQsWYJePzHuJUcwGCPj0abVa0aIqCFn0Kmjc28/IvXlpreM68PY1p0dW0iBisv15Pv05NSODra3ubXl1P054QDknlzHWamGyzILRrI7GJgT6gl+HOj+KgRkabs5pmh42OJjufnmzk1vlRps7NMDM1yefXV+jr6iQxNhxZ3C58pWFcqnuLG74CpvtMCLl5evrtpSw2G1AmxtLrrgOe8fOzx2L9H/96+oTVhz/xj8dPefrvZzz7+b9c/ehDBo9YOVFnpFLMrEm3ixGXhpFuI0JwvJwMYyO26iMkyPNod1Xwnyf/ZOnyIhrbfqbmz/Po0SMe/PiQu3e/4bvVB1x+fwl3lZ6uBhPlpYUUFxXQesxJZXcwQkLgy6hSdhCvPoBUbWRuuJ379+8SoZKQVKJka3o8V29c48zEOE5nA/O/X+CDpUto9+ykviSLCqMKs0mH7x0PRmcoQnzoBkrTtpCfEsymkEjxOyrhzu2vkRiUBIliCYXZfPHl59y8cZ0bK9e4d+8O87PT+K37JVv8X0ElOlxqzKfxcAMWVxbCrqANqCVBaCQhbI7LwWQoYPXbe9z67DqeoS5ufnqdHx6s8rcH9/n+29v8uHqbhbkpQrcFEBy4EYM8BUdZCc66CmrayxDeighAKdmOVrYTv/BUdLo8/v6d2Pj9Nzz+YZX7d77ir1/c5Os/XePPKx/w1coyVy6N4rLmU29W0mBWUa1LZ/JYEx0dLgTpngiKxBxqUrYTn5hKlXU/yzND/HF2kOXJ4yyOH2NhrINLIy1cGGhgptfOdFc1k53ljHdYGG4x0ddQzDlfCyc8jWKw92ZRd6iIEk0yZTmxDLZYGPHWsXSqmcWRRuYHXSJifecw7/qczPY/F63lrLeCM6Lo2c4KznbVMCNe1tdWjVAuOtTrtjPmrWWmz8Efhg+zMNrG8ulW5ocaudhfL9KwxoXnYuKZqW47454KxtzmNSY85Uz21NDVdJD/A4BKz4t1MwtxAAAAAElFTkSuQmCC&amp;#x27;); background-size: cover; display: block;&quot;&gt;&lt;/span&gt;
  &lt;picture&gt;
        &lt;source srcset=&quot;/static/78b685e283d679c8ebe518ea17520f6d/c54d4/odd_with_deevee.webp 175w, /static/78b685e283d679c8ebe518ea17520f6d/a3432/odd_with_deevee.webp 350w, /static/78b685e283d679c8ebe518ea17520f6d/426ac/odd_with_deevee.webp 700w, /static/78b685e283d679c8ebe518ea17520f6d/c139f/odd_with_deevee.webp 1050w, /static/78b685e283d679c8ebe518ea17520f6d/caef1/odd_with_deevee.webp 1224w&quot; sizes=&quot;(max-width: 700px) 100vw, 700px&quot; type=&quot;image/webp&quot;&gt;
        &lt;source srcset=&quot;/static/78b685e283d679c8ebe518ea17520f6d/17006/odd_with_deevee.png 175w, /static/78b685e283d679c8ebe518ea17520f6d/d6f3f/odd_with_deevee.png 350w, /static/78b685e283d679c8ebe518ea17520f6d/69344/odd_with_deevee.png 700w, /static/78b685e283d679c8ebe518ea17520f6d/b1f9d/odd_with_deevee.png 1050w, /static/78b685e283d679c8ebe518ea17520f6d/75999/odd_with_deevee.png 1224w&quot; sizes=&quot;(max-width: 700px) 100vw, 700px&quot; type=&quot;image/png&quot;&gt;
        &lt;img class=&quot;gatsby-resp-image-image&quot; src=&quot;/static/78b685e283d679c8ebe518ea17520f6d/69344/odd_with_deevee.png&quot; alt=&quot;odd with deevee&quot; title=&quot;odd with deevee&quot; loading=&quot;lazy&quot; style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;&gt;
      &lt;/picture&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/body&gt;&lt;/html&gt;&lt;em&gt;Vera (center, flashing a
peace sign) thanked us with this lovely picture of DeeVee and her team,
&lt;a href=&quot;https://odd.co/en/&quot;&gt;Odd Industries&lt;/a&gt;. They are making some extremely neat tools
for construction teams using computer vision.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;We were at PyData LA!&lt;/strong&gt; Our fearless leader
&lt;a href=&quot;https://www.youtube.com/watch?v=7Wsd6V0k4Oc&quot;&gt;Dmitry gave a talk&lt;/a&gt; and we set up
a busy booth to meet with the Pythonistas of Los Angeles. It was a cold and
blustery day, but visitors kept showing up to our semi-outdoor booth. We’re sure
they came for the open source version control and not the donuts.&lt;/p&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;span class=&quot;gatsby-resp-image-wrapper&quot; style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto;  max-width: 512px;&quot;&gt;
      &lt;a class=&quot;gatsby-resp-image-link&quot; href=&quot;/static/c827a7148f442ec7b39f79659a697878/e937d/py_data1.jpg&quot; style=&quot;display: block&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;
    &lt;span class=&quot;gatsby-resp-image-background-image&quot; style=&quot;padding-bottom: 75%; position: relative; bottom: 0; left: 0; background-image: url(&amp;#x27;data:image/jpeg;base64,/9j/2wBDABALDA4MChAODQ4SERATGCgaGBYWGDEjJR0oOjM9PDkzODdASFxOQERXRTc4UG1RV19iZ2hnPk1xeXBkeFxlZ2P/2wBDARESEhgVGC8aGi9jQjhCY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2P/wgARCAAPABQDASIAAhEBAxEB/8QAFwABAQEBAAAAAAAAAAAAAAAABAABA//EABUBAQEAAAAAAAAAAAAAAAAAAAIA/9oADAMBAAIQAxAAAAHlpWyFJmP/xAAZEAADAQEBAAAAAAAAAAAAAAABAgMSABH/2gAIAQEAAQUC9y1JbdwyslChAFuadEP/xAAWEQEBAQAAAAAAAAAAAAAAAAAAARH/2gAIAQMBAT8BrI//xAAWEQADAAAAAAAAAAAAAAAAAAAAARH/2gAIAQIBAT8BSpT/xAAeEAACAQMFAAAAAAAAAAAAAAAAARECIYESMUFRYf/aAAgBAQAGPwKlvLpNUxT6Qro7RbkjfJ//xAAZEAEBAQEBAQAAAAAAAAAAAAABEQAhQeH/2gAIAQEAAT8hi0oTWZde9dczB8k0DOnTSz5aRiPEG//aAAwDAQACAAMAAAAQMB//xAAWEQEBAQAAAAAAAAAAAAAAAAABABH/2gAIAQMBAT8QwM1//8QAFhEBAQEAAAAAAAAAAAAAAAAAAQAh/9oACAECAQE/EHGMINv/xAAcEAEBAAIDAQEAAAAAAAAAAAABEQAxIWFxQbH/2gAIAQEAAT8QLoSRRHT3k4SarL8XOCeEWxMsGwrL37i5YL7xJtxuFHAKeOf/2Q==&amp;#x27;); background-size: cover; display: block;&quot;&gt;&lt;/span&gt;
  &lt;picture&gt;
        &lt;source srcset=&quot;/static/c827a7148f442ec7b39f79659a697878/c54d4/py_data1.webp 175w, /static/c827a7148f442ec7b39f79659a697878/a3432/py_data1.webp 350w, /static/c827a7148f442ec7b39f79659a697878/426ac/py_data1.webp 700w, /static/c827a7148f442ec7b39f79659a697878/a9a89/py_data1.webp 1024w&quot; sizes=&quot;(max-width: 700px) 100vw, 700px&quot; type=&quot;image/webp&quot;&gt;
        &lt;source srcset=&quot;/static/c827a7148f442ec7b39f79659a697878/8dc06/py_data1.jpg 175w, /static/c827a7148f442ec7b39f79659a697878/f4417/py_data1.jpg 350w, /static/c827a7148f442ec7b39f79659a697878/571ad/py_data1.jpg 700w, /static/c827a7148f442ec7b39f79659a697878/e937d/py_data1.jpg 1024w&quot; sizes=&quot;(max-width: 700px) 100vw, 700px&quot; type=&quot;image/jpeg&quot;&gt;
        &lt;img class=&quot;gatsby-resp-image-image&quot; src=&quot;/static/c827a7148f442ec7b39f79659a697878/571ad/py_data1.jpg&quot; alt=&quot;py data1&quot; title=&quot;py data1&quot; loading=&quot;lazy&quot; style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;&gt;
      &lt;/picture&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/body&gt;&lt;/html&gt;
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;span class=&quot;gatsby-resp-image-wrapper&quot; style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto;  max-width: 512px;&quot;&gt;
      &lt;a class=&quot;gatsby-resp-image-link&quot; href=&quot;/static/76308821da8925b6cf7540b9b0b1ea3f/e937d/py_data2.jpg&quot; style=&quot;display: block&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;
    &lt;span class=&quot;gatsby-resp-image-background-image&quot; style=&quot;padding-bottom: 75%; position: relative; bottom: 0; left: 0; background-image: url(&amp;#x27;data:image/jpeg;base64,/9j/2wBDABALDA4MChAODQ4SERATGCgaGBYWGDEjJR0oOjM9PDkzODdASFxOQERXRTc4UG1RV19iZ2hnPk1xeXBkeFxlZ2P/2wBDARESEhgVGC8aGi9jQjhCY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2P/wgARCAAPABQDASIAAhEBAxEB/8QAFwAAAwEAAAAAAAAAAAAAAAAAAAMEBf/EABUBAQEAAAAAAAAAAAAAAAAAAAIB/9oADAMBAAIQAxAAAAF886St4mDf/8QAHBAAAgICAwAAAAAAAAAAAAAAAgMAEQEEEiEi/9oACAEBAAEFAmvuBsZ5AfRpoBXFh5//xAAWEQEBAQAAAAAAAAAAAAAAAAAAEQH/2gAIAQMBAT8Bmo//xAAVEQEBAAAAAAAAAAAAAAAAAAAAEf/aAAgBAgEBPwGq/8QAGRAAAgMBAAAAAAAAAAAAAAAAASEAEBIi/9oACAEBAAY/Ahgqj0A4HHX/xAAZEAADAQEBAAAAAAAAAAAAAAAAASERUTH/2gAIAQEAAT8hodG+DK12dNWLSzaY3kD9qGq7T//aAAwDAQACAAMAAAAQg8//xAAWEQADAAAAAAAAAAAAAAAAAAAQESH/2gAIAQMBAT8Qow//xAAWEQEBAQAAAAAAAAAAAAAAAAABABH/2gAIAQIBAT8QBmSL/8QAGxABAQACAwEAAAAAAAAAAAAAAREAITFBYYH/2gAIAQEAAT8QIBDqOxNI4OyoFYb45xD0Jj9cDxqmr3cUG1VmbHI0KdHGf//Z&amp;#x27;); background-size: cover; display: block;&quot;&gt;&lt;/span&gt;
  &lt;picture&gt;
        &lt;source srcset=&quot;/static/76308821da8925b6cf7540b9b0b1ea3f/c54d4/py_data2.webp 175w, /static/76308821da8925b6cf7540b9b0b1ea3f/a3432/py_data2.webp 350w, /static/76308821da8925b6cf7540b9b0b1ea3f/426ac/py_data2.webp 700w, /static/76308821da8925b6cf7540b9b0b1ea3f/a9a89/py_data2.webp 1024w&quot; sizes=&quot;(max-width: 700px) 100vw, 700px&quot; type=&quot;image/webp&quot;&gt;
        &lt;source srcset=&quot;/static/76308821da8925b6cf7540b9b0b1ea3f/8dc06/py_data2.jpg 175w, /static/76308821da8925b6cf7540b9b0b1ea3f/f4417/py_data2.jpg 350w, /static/76308821da8925b6cf7540b9b0b1ea3f/571ad/py_data2.jpg 700w, /static/76308821da8925b6cf7540b9b0b1ea3f/e937d/py_data2.jpg 1024w&quot; sizes=&quot;(max-width: 700px) 100vw, 700px&quot; type=&quot;image/jpeg&quot;&gt;
        &lt;img class=&quot;gatsby-resp-image-image&quot; src=&quot;/static/76308821da8925b6cf7540b9b0b1ea3f/571ad/py_data2.jpg&quot; alt=&quot;py data2&quot; title=&quot;py data2&quot; loading=&quot;lazy&quot; style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;&gt;
      &lt;/picture&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/body&gt;&lt;/html&gt; &lt;em&gt;The DVC team and PyData
volunteers who heroically staffed our booth in the rain.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Our engineer and technical writer Jorge reported:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;We were super happy to meet all kinds of data professionals and enthusiasts in
several fields who are learning and adopting DVC with their teams – including
several working with privacy-sensitive medical records, very cool!&lt;/p&gt;
&lt;/blockquote&gt;
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;hr&gt;&lt;/body&gt;&lt;/html&gt;
&lt;h2&gt;From the community&lt;/h2&gt;
&lt;p&gt;Here are some rumblings from the machine learning (ML) and data science
community that got us talking.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;A machine learning software wishlist.&lt;/strong&gt; Computer scientist and writer
&lt;a href=&quot;https://twitter.com/chipro&quot;&gt;Chip Huyen&lt;/a&gt; tweeted about her ML software wishlist
and kicked off a big community discussion.&lt;/p&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;blockquote class=&quot;twitter-tweet&quot; data-dnt=&quot;true&quot;&gt;&lt;p lang=&quot;en&quot; dir=&quot;ltr&quot;&gt;I&apos;ve been thinking about the software stack for machine learning. Tools I&apos;d love to see.&lt;br&gt;&lt;br&gt;1. Pip for pretrained models.&lt;br&gt;2. Version control for datasets.&lt;br&gt;3. GPU-friendly CI. Travis CI, Circe CI don&apos;t support GPUs. Jenkins is a pain.&lt;br&gt;4. Fast dataframes. Why is Pandas so slow?&lt;/p&gt;— Chip Huyen (@chipro) &lt;a href=&quot;https://twitter.com/chipro/status/1202815757593108480&quot;&gt;December 6, 2019&lt;/a&gt;&lt;/blockquote&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;p&gt;Her tweet resonated with a lot of practitioners, who were eager to discuss the
solutions they’d tried. Among the many thoughtful replies and recommendations,
we were thrilled to see DVC mentioned.&lt;/p&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;blockquote class=&quot;twitter-tweet&quot; data-dnt=&quot;true&quot;&gt;&lt;p lang=&quot;en&quot; dir=&quot;ltr&quot;&gt;We&apos;re using &lt;a href=&quot;https://twitter.com/DVCorg&quot;&gt;@DVCorg&lt;/a&gt; for 2) and it works great. 🙂&lt;/p&gt;— Kristijan Ivancic (@kristijan_ivanc) &lt;a href=&quot;https://twitter.com/kristijan_ivanc/status/1202879739716870144&quot;&gt;December 6, 2019&lt;/a&gt;&lt;/blockquote&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;p&gt;If you haven’t already, definitely check out Chip’s
&lt;a href=&quot;https://twitter.com/chipro/status/1202815757593108480&quot;&gt;thread&lt;/a&gt;, and follow her
on Twitter for more excllent, accessible content about ML engineering. We’re
thinking hard about these ideas and hope the discussion continues on- and
offline.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;A gentle intro to DVC for data scientists.&lt;/strong&gt; Scientist
&lt;a href=&quot;https://twitter.com/andronovhopf&quot;&gt;Elle O’Brien&lt;/a&gt; published a code walkthrough
about using DVC to make an image classification project more reproducible.
Specifically, the blog is a case study about version control when a dataset
grows over time. If you’re looking for a DVC tutorial geared for data
scientists, this might be up your alley.&lt;/p&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;section class=&quot;elp-content-holder&quot;&gt;
      &lt;a href=&quot;https://towardsdatascience.com/start-version-controlling-your-machine-learning-datasets-2b872e109856&quot; class=&quot;external-link-preview&quot;&gt;
          &lt;div class=&quot;elp-description-holder&quot;&gt;
            &lt;h4 class=&quot;elp-title&quot;&gt;Start Version Controlling your Machine Learning Datasets&lt;/h4&gt;
            &lt;div class=&quot;elp-description&quot;&gt;Make your machine learning and data science projects reproducible with open source tools.&lt;/div&gt;
            &lt;div class=&quot;elp-link&quot;&gt;medium.com&lt;/div&gt;
          &lt;/div&gt;
           &lt;div class=&quot;elp-image-holder&quot;&gt;
                &lt;img src=&quot;/uploads/images/2020-01-17/medium_1.png&quot; alt=&quot;Start Version Controlling your Machine Learning Datasets&quot;&gt;
            &lt;/div&gt;
      &lt;/a&gt;
    &lt;/section&gt;
    &lt;/body&gt;&lt;/html&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Ideas for data scientists to level up their code&lt;/strong&gt; Machine learning engineer
Andrew Greatorex posted a blog called “Down with technical debt! Clean Python
for data scientists.” Andrew highlights something we can easily relate to: the
“science” part of data science, which encourages experimentation and
flexibility, sometimes means less emphasis on readable, shareable code. Andrew
writes:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;“I’m hoping to shed light on some of the ways that more fledgling data
scientists can write cleaner Python code and better structure small scale
projects, with the important side effect of reducing the amount of technical
debt you inadvertently burden on yourself and your team.”&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;In this blog, DVC gets a shout-out as Andrew’s preferred data versioning tool,
used in conjunction with Git for versioning Python code. Thanks!&lt;/p&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;section class=&quot;elp-content-holder&quot;&gt;
      &lt;a href=&quot;https://towardsdatascience.com/down-with-technical-debt-clean-python-for-data-scientists-aa7592eff7fc&quot; class=&quot;external-link-preview&quot;&gt;
          &lt;div class=&quot;elp-description-holder&quot;&gt;
            &lt;h4 class=&quot;elp-title&quot;&gt;Down with technical debt! Clean Python for data scientists.&lt;/h4&gt;
            &lt;div class=&quot;elp-description&quot;&gt;&lt;/div&gt;
            &lt;div class=&quot;elp-link&quot;&gt;medium.com&lt;/div&gt;
          &lt;/div&gt;
           &lt;div class=&quot;elp-image-holder&quot;&gt;
                &lt;img src=&quot;/uploads/images/2020-01-17/medium_2.png&quot; alt=&quot;Down with technical debt! Clean Python for data scientists.&quot;&gt;
            &lt;/div&gt;
      &lt;/a&gt;
    &lt;/section&gt;
    &lt;/body&gt;&lt;/html&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;An introduction to MLOps&lt;/strong&gt; Engineer
&lt;a href=&quot;https://twitter.com/elfouly_sharif&quot;&gt;Sharif Elfouly&lt;/a&gt; wrote an approachable guide
to thinking about MLOps, the growing field around making ML projects run
efficiently from experimentation to production. He summarises why managing ML
projects can be fundamentally different than traditional software development:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;“The main difference between traditional software and ML is that you don’t
only have the code. You also have data, models, and experiments. Writing
traditional software is relatively straightforward but in ML you need to try
out a lot of different things to find the best and fastest model for your
use-case. You have a lot of different model types to choose from and every
single one of them has its specific hyperparameters. Even if you work alone
this can get out of hand pretty quickly.”&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Sharif gives some recommendations for tools that work especially well for ML,
and he writes that DVC is the “perfect combination for versioning your code and
data.” Thanks, Sharif! We think you’re perfect, too.&lt;/p&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;section class=&quot;elp-content-holder&quot;&gt;
      &lt;a href=&quot;https://towardsdatascience.com/down-with-technical-debt-clean-python-for-data-scientists-aa7592eff7fc&quot; class=&quot;external-link-preview&quot;&gt;
          &lt;div class=&quot;elp-description-holder&quot;&gt;
            &lt;h4 class=&quot;elp-title&quot;&gt;MLOps Done Right&lt;/h4&gt;
            &lt;div class=&quot;elp-description&quot;&gt;What is MLOps? Why is it so important? How to do it right!&lt;/div&gt;
            &lt;div class=&quot;elp-link&quot;&gt;medium.com&lt;/div&gt;
          &lt;/div&gt;
           &lt;div class=&quot;elp-image-holder&quot;&gt;
                &lt;img src=&quot;/uploads/images/2020-01-17/medium_3.png&quot; alt=&quot;MLOps Done Right&quot;&gt;
            &lt;/div&gt;
      &lt;/a&gt;
    &lt;/section&gt;
    &lt;/body&gt;&lt;/html&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;p&gt;That’s a wrap for January. We’ll see you next month with more updates!&lt;/p&gt;</content:encoded></item><item><title><![CDATA[November ’19 DVC❤️Heartbeat]]></title><link>https://blog.dvc.org/november-19-dvc-heartbeat</link><guid isPermaLink="false">https://blog.dvc.org/november-19-dvc-heartbeat</guid><pubDate>Sat, 14 Dec 2019 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;The past few months have been so busy and full of great events! We love how
involved our community is and can’t wait to share more with you:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;We have organized our very first
&lt;a href=&quot;https://www.meetup.com/San-Francisco-Machine-Learning-Meetup/events/264846847/&quot;&gt;meetup&lt;/a&gt;!
So many great conversations, new use cases and insights! Many thanks to
&lt;a href=&quot;https://www.linkedin.com/in/daniel-fischetti-4a6592bb/&quot;&gt;Dan Fischetti&lt;/a&gt; from
&lt;a href=&quot;https://standard.ai/&quot;&gt;Standard Cognition&lt;/a&gt;, who joined our Dmitry Petrov on
stage. Watch the recording here.&lt;/p&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;iframe width=&quot;100%&quot; height=&quot;315&quot; src=&quot;https://www.youtube-nocookie.com/embed/RHQXK7EC0jI?rel=0&quot; frameborder=&quot;0&quot; allow=&quot;accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture&quot; allowfullscreen&gt;&lt;/iframe&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://blog.dataversioncontrol.com/dvc-org-for-hacktoberfest-2019-ce5320151a0c&quot;&gt;Hacktoberfest&lt;/a&gt;
was a great exercise for DVC team on many levels and we really enjoyed
supporting new contributors. Kudos to
&lt;a href=&quot;https://twitter.com/explorer_07&quot;&gt;Nabanita Dash&lt;/a&gt; for organizing a cool
DVC-themed hackathon!&lt;/p&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;blockquote class=&quot;twitter-tweet&quot; data-dnt=&quot;true&quot;&gt;&lt;p lang=&quot;en&quot; dir=&quot;ltr&quot;&gt;Our open source event Hacktoberfest-themed meet-up was a success. Thanks to &lt;a href=&quot;https://twitter.com/DVCorg&quot;&gt;@DVCorg&lt;/a&gt; and it&apos;s mentors for all the hard work. &lt;br&gt;Some of our attendees made their first PR on DVC and got them merged. Kudos to the team! &lt;br&gt;PS: 🍕 was the second best thing of the evening. &lt;a href=&quot;https://t.co/zAWC0TVlPd&quot;&gt;pic.twitter.com/zAWC0TVlPd&lt;/a&gt;&lt;/p&gt;— Programming Society IIIT-Bh (@psociiit) &lt;a href=&quot;https://twitter.com/psociiit/status/1185150096792535040&quot;&gt;October 18, 2019&lt;/a&gt;&lt;/blockquote&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;We’ve crossed 4k stars mark on &lt;a href=&quot;https://github.com/iterative/dvc&quot;&gt;Github&lt;/a&gt;!&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;DVC was participating in the
&lt;a href=&quot;https://twitter.com/FossMec/status/1192866498324254720&quot;&gt;Devsprints&lt;/a&gt; (Thank
you &lt;a href=&quot;https://twitter.com/kurianbenoy2&quot;&gt;Kurian Benoy&lt;/a&gt; for the intro!) and we
were happy to jump in and help with some mentoring.&lt;/p&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;blockquote class=&quot;twitter-tweet&quot; data-dnt=&quot;true&quot;&gt;&lt;p lang=&quot;en&quot; dir=&quot;ltr&quot;&gt;Thank you &lt;a href=&quot;https://twitter.com/DVCorg&quot;&gt;@DVCorg&lt;/a&gt; for participating in the Devsprints, by &lt;a href=&quot;https://twitter.com/FossMec&quot;&gt;@FossMEC&lt;/a&gt; and &lt;a href=&quot;https://twitter.com/excelmec&quot;&gt;@excelmec&lt;/a&gt;. We had &lt;a href=&quot;https://twitter.com/shcheklein&quot;&gt;@shcheklein&lt;/a&gt; who joined us all the way from SF and explained how open source is boosting the future. Srinidhi and &lt;a href=&quot;https://twitter.com/kurianbenoy2&quot;&gt;@kurianbenoy2&lt;/a&gt; helped participants get started to contributing to the project.&lt;/p&gt;— FOSS MEC (@FossMec) &lt;a href=&quot;https://twitter.com/FossMec/status/1192866498324254720&quot;&gt;November 8, 2019&lt;/a&gt;&lt;/blockquote&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;span class=&quot;gatsby-resp-image-wrapper&quot; style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto;  max-width: 700px;&quot;&gt;
      &lt;a class=&quot;gatsby-resp-image-link&quot; href=&quot;/static/1fe957ddccf9aa3e7bb643d8e8ea8bed/8923b/devsprints.png&quot; style=&quot;display: block&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;
    &lt;span class=&quot;gatsby-resp-image-background-image&quot; style=&quot;padding-bottom: 78.62068965517241%; position: relative; bottom: 0; left: 0; background-image: url(&amp;#x27;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAQCAIAAACZeshMAAAACXBIWXMAAAsSAAALEgHS3X78AAABxklEQVQoz4VS247TMBTMH7HdpnXuiRNf4ksaN91LF4EQEggJLU/7ABIIiVd+gZ9ljLtRtFq60sQ5ic+xZ86ciBSUCmPHq16P27Q6g01SJnkj9cikGdz1Kk4i1jDHVFp2ZcOREZMCWG/zgPC5BI4ASFZjjR6Or//c39dMT9d3yjqhBoCyvuWK9ZYyBTRdHwL8R1xTkZUUl0UkrUhS8d6iUmoX8ppOYu2EzqsOVAPSgs4rbgYiMLmIM2WsGQ5M7takDMRi4oHgTBeihilpHMn82RuC1FNxKHsSLLd88Zt3Hx++/WTSmt3E+0H0A5cWKtBSL6/1/OtWhBhqEQBlw3zD8KxJIaTWdq/sHhlhG7KTf8L+B39zvMkvVqnkSijXieFyGyx5Qe2JdquNvrkieYPWz719sexUfPz84cvvH7RV7nCHOYN4gHrD+oXJHkFRy3VFReAVxdv81WU62d24v1V2iklFzkqdBfvi8Ery2k9LVj2aXDzxZjZ/tsoPydv3n75+/wUPOm7AB7KLGnPOEJQ1K54BEhjmzN9ctbITmGGth0NvHKxG50LzQoDUEDwCnwyz7Ys3SbGKs50buT7W7fQs2wXtYjlkfwE2Qu5MC8RIwAAAAABJRU5ErkJggg==&amp;#x27;); background-size: cover; display: block;&quot;&gt;&lt;/span&gt;
  &lt;picture&gt;
        &lt;source srcset=&quot;/static/1fe957ddccf9aa3e7bb643d8e8ea8bed/c54d4/devsprints.webp 175w, /static/1fe957ddccf9aa3e7bb643d8e8ea8bed/a3432/devsprints.webp 350w, /static/1fe957ddccf9aa3e7bb643d8e8ea8bed/426ac/devsprints.webp 700w, /static/1fe957ddccf9aa3e7bb643d8e8ea8bed/c139f/devsprints.webp 1050w, /static/1fe957ddccf9aa3e7bb643d8e8ea8bed/7f403/devsprints.webp 1400w, /static/1fe957ddccf9aa3e7bb643d8e8ea8bed/a7dc3/devsprints.webp 1450w&quot; sizes=&quot;(max-width: 700px) 100vw, 700px&quot; type=&quot;image/webp&quot;&gt;
        &lt;source srcset=&quot;/static/1fe957ddccf9aa3e7bb643d8e8ea8bed/17006/devsprints.png 175w, /static/1fe957ddccf9aa3e7bb643d8e8ea8bed/d6f3f/devsprints.png 350w, /static/1fe957ddccf9aa3e7bb643d8e8ea8bed/69344/devsprints.png 700w, /static/1fe957ddccf9aa3e7bb643d8e8ea8bed/b1f9d/devsprints.png 1050w, /static/1fe957ddccf9aa3e7bb643d8e8ea8bed/3fc71/devsprints.png 1400w, /static/1fe957ddccf9aa3e7bb643d8e8ea8bed/8923b/devsprints.png 1450w&quot; sizes=&quot;(max-width: 700px) 100vw, 700px&quot; type=&quot;image/png&quot;&gt;
        &lt;img class=&quot;gatsby-resp-image-image&quot; src=&quot;/static/1fe957ddccf9aa3e7bb643d8e8ea8bed/69344/devsprints.png&quot; alt=&quot;devsprints&quot; title=&quot;devsprints&quot; loading=&quot;lazy&quot; style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;&gt;
      &lt;/picture&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/body&gt;&lt;/html&gt;&lt;em&gt;Devsprints participants on our
&lt;a href=&quot;http://dvc.org/chat&quot;&gt;Discord&lt;/a&gt; channel&lt;/em&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;DVC became part of the default
&lt;a href=&quot;https://formulae.brew.sh/formula/dvc&quot;&gt;Homebrew formulae&lt;/a&gt;! So now you can
install it as easy as &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;𝚋𝚛𝚎𝚠 𝚒𝚗𝚜𝚝𝚊𝚕𝚕 𝚍𝚟𝚌&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;!&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;We helped 2 aspiring speakers deliver their very first conference talks.
&lt;a href=&quot;https://twitter.com/kurianbenoy2/status/1183427495342694401?s=20&quot;&gt;Kurian Benoy&lt;/a&gt;
was speaking at &lt;a href=&quot;https://in.pycon.org/2019/&quot;&gt;PyconIndia&lt;/a&gt; and
&lt;a href=&quot;https://www.linkedin.com/in/aman-sharma606/&quot;&gt;Aman Sharma&lt;/a&gt; was speaking at
&lt;a href=&quot;https://scipy.in/2019#speakers&quot;&gt;SciPyIndia&lt;/a&gt;. &lt;strong&gt;Supporting speakers is
something we are passionate about and if you ever wanted to give a talk on a
DVC-related topic — we are here to help, just
&lt;a href=&quot;https://dvc.org/support&quot;&gt;let us know&lt;/a&gt;!&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;iframe width=&quot;100%&quot; height=&quot;315&quot; src=&quot;https://www.youtube-nocookie.com/embed/Ipzf6oQqQpo?rel=0&quot; frameborder=&quot;0&quot; allow=&quot;accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture&quot; allowfullscreen&gt;&lt;/iframe&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;Our own &lt;a href=&quot;https://twitter.com/FullStackML&quot;&gt;Dmitry Petrov&lt;/a&gt; went to Europe to
speak at the
&lt;a href=&quot;https://osseu19.sched.com/speaker/dmitry35&quot;&gt;Open Source Summit Europe&lt;/a&gt; in
Lyon, &lt;a href=&quot;https://www.highload.ru/moscow/2019/abstracts/6032&quot;&gt;Highload++&lt;/a&gt; in
Moscow and made a stop in in Berlin to co-host a
&lt;a href=&quot;https://www.meetup.com/codecentric-Berlin/events/265555810/&quot;&gt;meetup&lt;/a&gt; with our
favourite AI folks from &lt;a href=&quot;https://www.codecentric.de/&quot;&gt;Codecentric&lt;/a&gt;!&lt;/li&gt;
&lt;/ul&gt;
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;hr&gt;&lt;/body&gt;&lt;/html&gt;
&lt;p&gt;Here are some of the great pieces of content around DVC and ML ops that we
discovered in October and November:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://www.deploymachinelearning.com/&quot;&gt;Deploy Machine Learning Models with Django&lt;/a&gt;
by Piotr Płoński.&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote&gt;
&lt;p&gt;…building your ML system has a great advantage — it is tailored to your needs.
It has all features that are needed in your ML system and can be as complex as
you wish. This tutorial is for readers who are familiar with ML and would like
to learn how to build ML web services.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;section class=&quot;elp-content-holder&quot;&gt;
      &lt;a href=&quot;https://www.deploymachinelearning.com/&quot; class=&quot;external-link-preview&quot;&gt;
          &lt;div class=&quot;elp-description-holder&quot;&gt;
            &lt;h4 class=&quot;elp-title&quot;&gt;Deploy Machine Learning Models with Django&lt;/h4&gt;
            &lt;div class=&quot;elp-description&quot;&gt;Version 1.0 (04/11/2019) Piotr Płoński The demand for Machine Learning (ML) applications is growing. Many resources…&lt;/div&gt;
            &lt;div class=&quot;elp-link&quot;&gt;deploymachinelearning.com&lt;/div&gt;
          &lt;/div&gt;
           &lt;div class=&quot;elp-image-holder&quot;&gt;
                &lt;img src=&quot;/uploads/images/2019-12-14/deploy-machine-learning-models.png&quot; alt=&quot;Deploy Machine Learning Models with Django&quot;&gt;
            &lt;/div&gt;
      &lt;/a&gt;
    &lt;/section&gt;
    &lt;/body&gt;&lt;/html&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://towardsdatascience.com/how-to-manage-your-machine-learning-workflow-with-dvc-weights-biases-and-docker-5529ea4e59e0&quot;&gt;How to Manage Your Machine Learning Workflow with DVC, Weights &amp;#x26; Biases, and Docker&lt;/a&gt;
by &lt;a href=&quot;https://towardsdatascience.com/@james_aka_yale&quot;&gt;James Le&lt;/a&gt;.&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote&gt;
&lt;p&gt;In this article, I want to show 3 powerful tools to simplify and scale up
machine learning development within an organization by making it easy to
track, reproduce, manage, and deploy models.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;section class=&quot;elp-content-holder&quot;&gt;
      &lt;a href=&quot;https://towardsdatascience.com/how-to-manage-your-machine-learning-workflow-with-dvc-weights-biases-and-docker-5529ea4e59e0&quot; class=&quot;external-link-preview&quot;&gt;
          &lt;div class=&quot;elp-description-holder&quot;&gt;
            &lt;h4 class=&quot;elp-title&quot;&gt;How to Manage Your Machine Learning Workflow withDVC, Weights &amp;#x26; Biases,
and Docker&lt;/h4&gt;
            &lt;div class=&quot;elp-description&quot;&gt;Managing a machine learning workflow is hard!&lt;/div&gt;
            &lt;div class=&quot;elp-link&quot;&gt;towardsdatascience.com&lt;/div&gt;
          &lt;/div&gt;
           &lt;div class=&quot;elp-image-holder&quot;&gt;
                &lt;img src=&quot;/uploads/images/2019-12-14/how-to-manage-your-machine-learning-workflow.jpeg&quot; alt=&quot;How to Manage Your Machine Learning Workflow withDVC, Weights &amp;#x26; Biases,
and Docker&quot;&gt;
            &lt;/div&gt;
      &lt;/a&gt;
    &lt;/section&gt;
    &lt;/body&gt;&lt;/html&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://towardsdatascience.com/creating-a-solid-data-science-development-environment-60df14ce3a34&quot;&gt;Creating a solid Data Science development environment&lt;/a&gt;
by
&lt;a href=&quot;https://towardsdatascience.com/@gabrielsgoncalves&quot;&gt;Gabriel dos Santos Goncalves&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote&gt;
&lt;p&gt;We do believe that Data Science is a field that can become even more mature by
using best practices in project development and that Conda, Git, DVC, and
JupyterLab are key components of this new approach&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;section class=&quot;elp-content-holder&quot;&gt;
      &lt;a href=&quot;https://towardsdatascience.com/creating-a-solid-data-science-development-environment-60df14ce3a34&quot; class=&quot;external-link-preview&quot;&gt;
          &lt;div class=&quot;elp-description-holder&quot;&gt;
            &lt;h4 class=&quot;elp-title&quot;&gt;Creating a solid Data Science development environment&lt;/h4&gt;
            &lt;div class=&quot;elp-description&quot;&gt;How to organize and replicate your development environment using Conda, Git, DVC, and JupyterLab.&lt;/div&gt;
            &lt;div class=&quot;elp-link&quot;&gt;towardsdatascience.com&lt;/div&gt;
          &lt;/div&gt;
           &lt;div class=&quot;elp-image-holder&quot;&gt;
                &lt;img src=&quot;/uploads/images/2019-12-14/creating-solid-data-science-dev-env.png&quot; alt=&quot;Creating a solid Data Science development environment&quot;&gt;
            &lt;/div&gt;
      &lt;/a&gt;
    &lt;/section&gt;
    &lt;/body&gt;&lt;/html&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/y-data-stories/creating-reproducible-data-science-workflows-with-dvc-3bf058e9797b&quot;&gt;Creating reproducible data science workflows with DVC&lt;/a&gt;
by &lt;a href=&quot;https://medium.com/@glib.ivashkevych&quot;&gt;Gleb Ivashkevich&lt;/a&gt;.&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote&gt;
&lt;p&gt;DVC is a powerful tool and we covered only the fundamentals of it.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;section class=&quot;elp-content-holder&quot;&gt;
      &lt;a href=&quot;https://medium.com/y-data-stories/creating-reproducible-data-science-workflows-with-dvc-3bf058e9797b&quot; class=&quot;external-link-preview&quot;&gt;
          &lt;div class=&quot;elp-description-holder&quot;&gt;
            &lt;h4 class=&quot;elp-title&quot;&gt;Creating reproducible data science workflows with DVC&lt;/h4&gt;
            &lt;div class=&quot;elp-description&quot;&gt;Getting started” tutorial into DVC to make a structure and order in your daily ML routine&lt;/div&gt;
            &lt;div class=&quot;elp-link&quot;&gt;medium.com&lt;/div&gt;
          &lt;/div&gt;
           &lt;div class=&quot;elp-image-holder&quot;&gt;
                &lt;img src=&quot;/uploads/images/2019-12-14/creating-reproducible-data-science-workflows.jpeg&quot; alt=&quot;Creating reproducible data science workflows with DVC&quot;&gt;
            &lt;/div&gt;
      &lt;/a&gt;
    &lt;/section&gt;
    &lt;/body&gt;&lt;/html&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;hr&gt;&lt;/body&gt;&lt;/html&gt;
&lt;h2&gt;Discord gems&lt;/h2&gt;
&lt;p&gt;There are lots of hidden gems in our Discord community discussions. Sometimes
they are scattered all over the channels and hard to track down.&lt;/p&gt;
&lt;p&gt;We are sifting through the issues and discussions and share with you the most
interesting takeaways.&lt;/p&gt;
&lt;h3&gt;Q: When you do a &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc import&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; you get the state of the data in the original repo at that moment in time from that repo, right? &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/563406153334128681/618744949277458462&quot;&gt;The overall state of that repo (e.g. Git &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;commit id&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; (hash)) is not preserved upon import, right?&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;On the contrary, DVC relies on Git &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;commit id&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; (hash) to determine the state of
the data as well as code. Git &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;commit id&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; (hash) is saved in DVC file upon
import, data itself is copied/downloaded into DVC repo cache but would not be
pushed to the remote — DVC does not create duplicates. There is a command to
advance/update it when it’s needed — &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc update&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;. Git commit hash saved to
provide reproducibility. Even if the source repo &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;HEAD&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; has changed your import
stays the same until you run &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc update&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; or redo &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc import&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;.&lt;/p&gt;
&lt;h3&gt;Q: I’m trying to understand if DVC is an appropriate solution for storing data under GDPR requirements. &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485596304961962003/621057268145848340&quot;&gt;That means that permanent deletion of files with sensitive data needs to be fully supported.&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Yes, in this sense DVC is not very different from using bare S3, SSH or any
other storage where you can go and just delete data. DVC can give a bit of
overhead to locate a specific file to delete, but otherwise it’s all the same
you will be able to delete any file you want. Read more details in
&lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485596304961962003/621062105524862987&quot;&gt;this discussion&lt;/a&gt;.&lt;/p&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485596304961962003/621591769766821888&quot;&gt;Is there anyway to get the remote url for specific DVC-files?&lt;/a&gt; Say, I have a DVC-file &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;foo.png.dvc&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; — is there a command that will show the remote url, something like &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc get-remote-url foo.png.dvc&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; which will return e.g. the Azure url to download.&lt;/h3&gt;
&lt;p&gt;There is no special command for that, but if you are using Python, you could use
our API specifically designed for that:&lt;/p&gt;
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;python&quot;&gt;&lt;pre class=&quot;language-python&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;&lt;span class=&quot;token keyword&quot;&gt;from&lt;/span&gt; dvc&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;api &lt;span class=&quot;token keyword&quot;&gt;import&lt;/span&gt; get_url

url &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; get_url&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;path&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
              repo&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;https://github.com/user/proj&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
              rev&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;mybranch&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/body&gt;&lt;/html&gt;
&lt;p&gt;so, you could as well use this from CLI as a wrapper command.&lt;/p&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/563406153334128681/619244714071425035&quot;&gt;Can DVC be integrated with MS Active Directory (AD) authentication for controlling access?&lt;/a&gt; The GDPR requirements would force me to use such a system to manage access.&lt;/h3&gt;
&lt;p&gt;Short answer: no (as of the date of publishing this Heartbeat issue) Good news —
it should be very easy to add, so we would welcome a contribution :) Azure has a
connection argument for AD — quick googling shows this
&lt;a href=&quot;https://github.com/AzureAD/azure-activedirectory-library-for-python&quot;&gt;library&lt;/a&gt;,
which is what probably needed.&lt;/p&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485596304961962003/625124341201502209&quot;&gt;How do I uninstall DVC from Mac installed as a package?&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;When installing using &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;plain.pkg&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; it is a bit tricky to uninstall, so we usually
recommend using things like brew cask instead if you really need the binary
package. Try to run these commands:&lt;/p&gt;
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;dvc&quot;&gt;&lt;pre class=&quot;language-dvc&quot;&gt;&lt;code class=&quot;language-dvc&quot;&gt;&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token command&quot;&gt;sudo&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;rm&lt;/span&gt; -rf /usr/local/bin/dvc
&lt;/span&gt;&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token command&quot;&gt;sudo&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;rm&lt;/span&gt; -rf /usr/local/lib/dvc
&lt;/span&gt;&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token command&quot;&gt;sudo&lt;/span&gt; pkgutil --forget com.iterative.dvc&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/body&gt;&lt;/html&gt;
&lt;p&gt;to uninstall the package.&lt;/p&gt;
&lt;h3&gt;Q: We are using SSH remote to store data, but the problem is that everyone within the project has different username on the remote machine and thus we cannot set it in the config file (that is committed to Git). &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/563406153334128681/619420070111608848&quot;&gt;Is there a way to add just host and path, without the username?&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Yes, you should use &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;--local&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; or &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;--global&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; config options to set user per
project or per use machine without sharing (committing) them to Git:&lt;/p&gt;
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;dvc&quot;&gt;&lt;pre class=&quot;language-dvc&quot;&gt;&lt;code class=&quot;language-dvc&quot;&gt;&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token dvc&quot;&gt;dvc remote modify&lt;/span&gt; myremote —local user myuser&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/body&gt;&lt;/html&gt;
&lt;p&gt;or&lt;/p&gt;
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;dvc&quot;&gt;&lt;pre class=&quot;language-dvc&quot;&gt;&lt;code class=&quot;language-dvc&quot;&gt;&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token dvc&quot;&gt;dvc remote modify&lt;/span&gt; myremote —global user myuser&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/body&gt;&lt;/html&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485596304961962003/628227197592797191&quot;&gt;I still get the &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;SSL ERROR&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; when I try to perform a dvc push with or without &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;use_ssl = false&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;&lt;/a&gt;?&lt;/h3&gt;
&lt;p&gt;A simple environment variable like this:&lt;/p&gt;
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;dvc&quot;&gt;&lt;pre class=&quot;language-dvc&quot;&gt;&lt;code class=&quot;language-dvc&quot;&gt;&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token command&quot;&gt;export&lt;/span&gt; &lt;span class=&quot;token assign-left variable&quot;&gt;AWS_CA_BUNDLE&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;/path/to/cert/cert.crt dvc push&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/body&gt;&lt;/html&gt;
&lt;p&gt;should do the trick for now, we plan to fix the ca_bundle option soon.&lt;/p&gt;
&lt;h3&gt;Q: I have just finished a lengthy &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc repro&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; and I’m happy with the result. However, I realized that I didn’t specify a dependency which I needed (and obviously is used in the computation). &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/563406153334128681/620572187841265675&quot;&gt;Can I somehow fix it?&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Add the dependency to the stage file without rerunning/reproducing the stage.
This is not needed as this additional dependency hasn’t changed.&lt;/p&gt;
&lt;p&gt;You would need to edit the DVC-file. In the deps section add:&lt;/p&gt;
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;yaml&quot;&gt;&lt;pre class=&quot;language-yaml&quot;&gt;&lt;code class=&quot;language-yaml&quot;&gt;&lt;span class=&quot;token key atrule&quot;&gt;-path&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; not/included/file/path&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/body&gt;&lt;/html&gt;
&lt;p&gt;and run &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc commit file.dvc&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; to save changes w/o running the pipeline again.
See an example
&lt;a href=&quot;https://discordapp.com/channels/485586884165107732/563406153334128681/620641530075414570&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;h3&gt;Q: For some reason &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485596304961962003/629704961868955648&quot;&gt;we need to always specify the remote name when doing a &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc push&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;&lt;/a&gt; e.g., &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc push -r upstream&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; as opposed to &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc push&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; (mind no additional arguments).&lt;/h3&gt;
&lt;p&gt;You can mark a “default” remote:&lt;/p&gt;
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;dvc&quot;&gt;&lt;pre class=&quot;language-dvc&quot;&gt;&lt;code class=&quot;language-dvc&quot;&gt;&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token dvc&quot;&gt;dvc remote add&lt;/span&gt; -d remote /path/to/my/main/remote&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/body&gt;&lt;/html&gt;
&lt;p&gt;then, &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc push&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; (and other commands like &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc pull&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;) will know to push to the
default&lt;/p&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/563406153334128681/620715145374466048&quot;&gt;If I want stage B to run after stage A, but the stage A has no output, can I specify A’s DVC-file as B’s dependency?&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;No, at least at the time of publishing this. You could use a phony output
though. E.g. make the stage A output some dummy file and make B depend on it.
Please, consider creating or upvoting a relevant issue on our Github if you’d
this to be implemented.&lt;/p&gt;
&lt;h3&gt;Q: I’m just getting started with DVC, but I’d like to use it for multiple developers to access the data and share models and code. &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/563406153334128681/598867829785362452&quot;&gt;I do own the server, but I’m not sure how to use DVC with SSH remote?&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Please, refer to
&lt;a href=&quot;https://discuss.dvc.org/t/how-do-i-use-dvc-with-ssh-remote/279/2&quot;&gt;this answer&lt;/a&gt;
on the DVC forum and check the documentation for the
&lt;a href=&quot;https://dvc.org/doc/command-reference/remote/add&quot;&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc remote add&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;&lt;/a&gt; and
&lt;a href=&quot;https://dvc.org/doc/command-reference/remote/modify&quot;&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc remote modify&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;&lt;/a&gt;
commands to see more options and details.&lt;/p&gt;
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;hr&gt;&lt;/body&gt;&lt;/html&gt;
&lt;p&gt;If you have any questions, concerns or ideas, let us know in the comments below
or connect with DVC team &lt;a href=&quot;https://dvc.org/support&quot;&gt;here&lt;/a&gt;. Our
&lt;a href=&quot;https://twitter.com/DVCorg&quot;&gt;DMs on Twitter&lt;/a&gt; are always open, too.&lt;/p&gt;</content:encoded></item><item><title><![CDATA[October ’19 DVC❤️Heartbeat]]></title><link>https://blog.dvc.org/october-19-dvc-heartbeat</link><guid isPermaLink="false">https://blog.dvc.org/october-19-dvc-heartbeat</guid><pubDate>Tue, 05 Nov 2019 00:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;News and links&lt;/h2&gt;
&lt;p&gt;Autumn is a great season for new beginnings and there is so much we love about
it this year. Here are some of the highlights:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Co-hosting our
&lt;a href=&quot;https://www.meetup.com/San-Francisco-Machine-Learning-Meetup/events/264846847/&quot;&gt;first ever meetup&lt;/a&gt;!
Our &lt;a href=&quot;https://twitter.com/FullStackML&quot;&gt;Dmitry Petrov&lt;/a&gt; partnering with
&lt;a href=&quot;https://www.linkedin.com/in/daniel-fischetti-4a6592bb/&quot;&gt;Dan Fischetti&lt;/a&gt; from
&lt;a href=&quot;https://twitter.com/standardAI&quot;&gt;Standard Cognition&lt;/a&gt; to discuss Open-source
tools to version control Machine Learning models and experiments. The
recording is available now here.&lt;/p&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;iframe width=&quot;100%&quot; height=&quot;315&quot; src=&quot;https://www.youtube-nocookie.com/embed/RHQXK7EC0jI?rel=0&quot; frameborder=&quot;0&quot; allow=&quot;accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture&quot; allowfullscreen&gt;&lt;/iframe&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://blog.dataversioncontrol.com/dvc-org-for-hacktoberfest-2019-ce5320151a0c&quot;&gt;Getting ready for the Hacktoberfest&lt;/a&gt;
and having the whole team get together to pick up and label nice issues and be
ready to support the contributors.&lt;/li&gt;
&lt;li&gt;Discovering some really cool blogposts, talks and tutorials from our users all
over the world: check
&lt;a href=&quot;https://blog.octo.com/mise-en-application-de-dvc-sur-un-projet-de-machine-learning/&quot;&gt;this blogpost in French&lt;/a&gt;
or
&lt;a href=&quot;https://jupyter-tutorial.readthedocs.io/de/latest/reproduce/dvc/init.html&quot;&gt;this tutorial in German&lt;/a&gt;!&lt;/li&gt;
&lt;li&gt;Having great time working with a &lt;a href=&quot;https://github.com/dashohoxha&quot;&gt;tech writer&lt;/a&gt;
brought to us by the
&lt;a href=&quot;https://developers.google.com/season-of-docs&quot;&gt;Google Season of Docs&lt;/a&gt; program.
Check out these
&lt;a href=&quot;https://dvc.org/doc/tutorials/interactive&quot;&gt;interactive tutorials&lt;/a&gt; we’ve
created together.&lt;/li&gt;
&lt;li&gt;Having hot internal discussion about Discord vs Slack support/community
channels. If you are on the edge like us, have a look at
&lt;a href=&quot;https://internals.rust-lang.org/t/exploring-new-communication-channels/7859&quot;&gt;this discussion&lt;/a&gt;
in the Rust community, so helpful.&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Seeing &lt;a href=&quot;https://twitter.com/FullStackML&quot;&gt;Dmitry Petrov&lt;/a&gt; being really happy one
day:&lt;/p&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;blockquote class=&quot;twitter-tweet&quot; data-dnt=&quot;true&quot;&gt;&lt;p lang=&quot;en&quot; dir=&quot;ltr&quot;&gt;.&lt;a href=&quot;https://twitter.com/martinfowler&quot;&gt;@martinfowler&lt;/a&gt;&apos;s books and his website were always the source of programming wisdom 💎 His Refactoring book is the first book I recommend to developers.&lt;br&gt;&lt;br&gt;Now they write about ML lifecycle and automation. I’m especially excited because they use &lt;a href=&quot;https://twitter.com/DVCorg&quot;&gt;@DVCorg&lt;/a&gt; that we’ve created. &lt;a href=&quot;https://t.co/HwswZqjOsb&quot;&gt;https://t.co/HwswZqjOsb&lt;/a&gt;&lt;/p&gt;— Dmitry Petrov (@FullStackML) &lt;a href=&quot;https://twitter.com/FullStackML/status/1169403554290814976&quot;&gt;September 5, 2019&lt;/a&gt;&lt;/blockquote&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;hr&gt;&lt;/body&gt;&lt;/html&gt;
&lt;p&gt;We at &lt;a href=&quot;https://dvc.org&quot;&gt;DVC.org&lt;/a&gt; are so happy every time we discover an article
featuring DVC or addressing one of the burning ML issues we are trying to solve.
Here are some of the links that caught our eye past month:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Continuous Delivery for Machine Learning by
&lt;a href=&quot;https://twitter.com/dtsato&quot;&gt;Danilo Sato&lt;/a&gt;,
&lt;a href=&quot;https://twitter.com/arifwider&quot;&gt;Arif Wider&lt;/a&gt;,
&lt;a href=&quot;https://twitter.com/intellification&quot;&gt;Christoph Windheuser&lt;/a&gt; and curated by
&lt;a href=&quot;https://martinfowler.com/&quot;&gt;Martin Fowler&lt;/a&gt;.&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote&gt;
&lt;p&gt;As Machine Learning techniques continue to evolve and perform more complex
tasks, so is evolving our knowledge of how to manage and deliver such
applications to production. By bringing and extending the principles and
practices from Continuous Delivery, we can better manage the risks of
releasing changes to Machine Learning applications in a safe and reliable way.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;section class=&quot;elp-content-holder&quot;&gt;
      &lt;a href=&quot;https://martinfowler.com/articles/cd4ml.html&quot; class=&quot;external-link-preview&quot;&gt;
          &lt;div class=&quot;elp-description-holder&quot;&gt;
            &lt;h4 class=&quot;elp-title&quot;&gt;Continuous Delivery for Machine Learning&lt;/h4&gt;
            &lt;div class=&quot;elp-description&quot;&gt;bio I am a consultant at ThoughtWorks Germany, where I am leading our data and machine learning activities. I enjoy…&lt;/div&gt;
            &lt;div class=&quot;elp-link&quot;&gt;martinfowler.com&lt;/div&gt;
          &lt;/div&gt;
           &lt;div class=&quot;elp-image-holder&quot;&gt;
                &lt;img src=&quot;/uploads/images/2019-11-05/continuous-delivery-for-machine-learning.png&quot; alt=&quot;Continuous Delivery for Machine Learning&quot;&gt;
            &lt;/div&gt;
      &lt;/a&gt;
    &lt;/section&gt;
    &lt;/body&gt;&lt;/html&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/signaturit-tech-blog/the-path-to-identity-validation-2-3-4f698b2ffae9&quot;&gt;The Path to Identity Validation&lt;/a&gt;
by &lt;a href=&quot;https://medium.com/@victor.segura&quot;&gt;Víctor Segura&lt;/a&gt;.&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote&gt;
&lt;p&gt;So, the first question is clear: how to choose the optimal hardware for neural
networks? Secondly, assuming that we have the appropriate infrastructure, how
to build the machine learning ecosystem to train our models efficiently and
not die trying? At &lt;strong&gt;Signaturit&lt;/strong&gt;, we have the solution ;)&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;section class=&quot;elp-content-holder&quot;&gt;
      &lt;a href=&quot;https://medium.com/signaturit-tech-blog/the-path-to-identity-validation-2-3-4f698b2ffae9&quot; class=&quot;external-link-preview&quot;&gt;
          &lt;div class=&quot;elp-description-holder&quot;&gt;
            &lt;h4 class=&quot;elp-title&quot;&gt;The Path to Identity Validation (2/3)&lt;/h4&gt;
            &lt;div class=&quot;elp-description&quot;&gt;How to start your own machine learning project?&lt;/div&gt;
            &lt;div class=&quot;elp-link&quot;&gt;medium.com&lt;/div&gt;
          &lt;/div&gt;
           &lt;div class=&quot;elp-image-holder&quot;&gt;
                &lt;img src=&quot;/uploads/images/2019-11-05/the-path-to-identity-validation.jpeg&quot; alt=&quot;The Path to Identity Validation (2/3)&quot;&gt;
            &lt;/div&gt;
      &lt;/a&gt;
    &lt;/section&gt;
    &lt;/body&gt;&lt;/html&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Talk:
&lt;a href=&quot;https://pretalx.com/pyconuk-2019/talk/GCLBFH/&quot;&gt;Managing Big Data in Machine Learning projects&lt;/a&gt;
by &lt;a href=&quot;https://twitter.com/vvasworld&quot;&gt;V Vishnu Anirudh&lt;/a&gt; at the
&lt;a href=&quot;https://2019.pyconuk.org/&quot;&gt;Pycon UK 2019.&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote&gt;
&lt;p&gt;My talk will focus on Version Control Systems (VCS) for big-data projects.
With the advent of Machine Learning (ML) , the development teams find it
increasingly difficult to manage and collaborate on projects that deal with
huge amounts of data and ML models apart from just source code.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;iframe width=&quot;100%&quot; height=&quot;315&quot; src=&quot;https://www.youtube-nocookie.com/embed/4XpHk85_x0E?rel=0&quot; frameborder=&quot;0&quot; allow=&quot;accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture&quot; allowfullscreen&gt;&lt;/iframe&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Podcast: TWIML Talk #295
&lt;a href=&quot;https://twimlai.com/twiml-talk-295-managing-deep-learning-experiments-with-lukas-biewald/&quot;&gt;Managing Deep Learning Experiments&lt;/a&gt;
with &lt;a href=&quot;https://twitter.com/l2k&quot;&gt;Lukas Biewald&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote&gt;
&lt;p&gt;Seeing a need for reproducibility in deep learning experiments, Lukas founded
Weights &amp;#x26; Biases. In this episode we discuss his experiment tracking tool, how
it works, the components that make it unique in the ML marketplace and the
open, collaborative culture that Lukas promotes. Listen to Lukas delve into
how he got his start in deep learning experiments, what his experiment
tracking used to look like, the current Weights &amp;#x26; Biases business success
strategy, and what his team is working on today.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;section class=&quot;elp-content-holder&quot;&gt;
      &lt;a href=&quot;https://twimlai.com/twiml-talk-295-managing-deep-learning-experiments-with-lukas-biewald/&quot; class=&quot;external-link-preview&quot;&gt;
          &lt;div class=&quot;elp-description-holder&quot;&gt;
            &lt;h4 class=&quot;elp-title&quot;&gt;Managing Deep Learning Experiments with Lukas Biewald — Talk #295&lt;/h4&gt;
            &lt;div class=&quot;elp-description&quot;&gt;Today we are joined by Lukas Biewald, CEO and Co-Founder of Weights &amp;#x26; Biases. Lukas, previously CEO and Founder of…&lt;/div&gt;
            &lt;div class=&quot;elp-link&quot;&gt;twimlai.com&lt;/div&gt;
          &lt;/div&gt;
           &lt;div class=&quot;elp-image-holder&quot;&gt;
                &lt;img src=&quot;/uploads/images/2019-11-05/managing-deep-learning-experiments.jpeg&quot; alt=&quot;Managing Deep Learning Experiments with Lukas Biewald — Talk #295&quot;&gt;
            &lt;/div&gt;
      &lt;/a&gt;
    &lt;/section&gt;
    &lt;/body&gt;&lt;/html&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;hr&gt;&lt;/body&gt;&lt;/html&gt;
&lt;h2&gt;Discord gems&lt;/h2&gt;
&lt;p&gt;There are lots of hidden gems in our Discord community discussions. Sometimes
they are scattered all over the channels and hard to track down.&lt;/p&gt;
&lt;p&gt;We are sifting through the issues and discussions and share with you the most
interesting takeaways.&lt;/p&gt;
&lt;h3&gt;Q: I’ve just run a &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc run&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; step, and realised I forgot to declare an output file. &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485596304961962003/593743448020877323&quot;&gt;Is there a way to add an output file without rerunning the (computationally expensive) step/stage?&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;If you’ve already ran it, you could just open created DVC-file with an editor
and add an entry to the outs field. After that, just run &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc commit my.dvc&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; and
it will save the checksums and data without re-running your command.
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc run --no-exec&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; would also work with commit instead of modifying the
DVC-file by hand.&lt;/p&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485596304961962003/593869598651318282&quot;&gt;For metric files do I have to use dvc run to set a metric or can I do it some other way?&lt;/a&gt; Can I use metrics functionality without the need to setup and manage DVC cache and remote storage?&lt;/h3&gt;
&lt;p&gt;Any file that is under DVC control (e.g. added with &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc add&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; or an output in
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc run -o&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;) can be made a metric file with dvc metrics add file. Alternatively
a command &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc run -M&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; file makes file a metric without caching it. It means dvc
metrics show can be used while file is still versioned by Git.&lt;/p&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485596304961962003/595586670498283520&quot;&gt;Is there a way not to add the full (Azure) connection string to the .dvc/config file that is being checked into Git for using dvc remotes&lt;/a&gt;? I think it’s quite unhealthy to have secrets checked in SCM.&lt;/h3&gt;
&lt;p&gt;There are two options — use &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;AZURE_STORAGE_CONNECTION_STRING&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; environment
variable or use &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;--local&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; flag that will put it into the &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;.dvc/config.local&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;
that is added to the &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;.gitignore&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;, so you don’t track it with it and so won’t
expose secrets.&lt;/p&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485596304961962003/601068667131920385&quot;&gt;I would like to know if it is possible to manage files under DVC whilst keeping them in their original locations (e.g. on a network drive in a given folder structure)&lt;/a&gt;? &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485596304961962003/615278138896941101&quot;&gt;If I want to add a large file to be tracked by DVC, and it is in a bucket on S3 or GCS, can I do that without downloading it locally?&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Yes, you are probably looking for external dependencies and outputs. This is the
&lt;a href=&quot;https://dvc.org/doc/user-guide/managing-external-data&quot;&gt;link&lt;/a&gt; to the
documentation to start.&lt;/p&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485596304961962003/606388040377565215&quot;&gt;How do I setup DVC so that NAS (e.g. Synology) acts as a shared DVC cache?&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Using NAS (e.g. NFS) is a very common scenario for DVC. In short you use
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc cache dir&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; to setup a cache externally. Set cache type to use symlinks and
enable protected mode. We are preparing a
&lt;a href=&quot;https://github.com/iterative/dvc.org/blob/31c5d424c6530bb793af69c2af578d2b8a374d02/static/docs/use-cases/shared-storage-on-nfs.md&quot;&gt;document&lt;/a&gt;
how to setup the NFS as a shared cache, but I think it can be applied to any
NAS.&lt;/p&gt;
&lt;h3&gt;Q: So I have some data that is in the hundreds of gigs. &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485596304961962003/608013531010301952&quot;&gt;If I enable symlink, hardlink strategy and cache protecting, will DVC automatically choose this strategy over copying when trying to use dvc add&lt;/a&gt;?&lt;/h3&gt;
&lt;p&gt;Yes, it will! Here is some clarification. So when you set those settings like
that, &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc add&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; data will move data to your cache and then will create a
hardlink from your cache to your workspace.&lt;/p&gt;
&lt;p&gt;Unless your cache directory and your workspace are on different file systems,
move should be instant. Please, find more information
&lt;a href=&quot;https://dvc.org/doc/user-guide/large-dataset-optimization&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;h3&gt;Q: My repo’s DVC is “busy and locked” and I’m not sure how it got that way and how to remove/diagnose the lock. &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485596304961962003/608392956679815168&quot;&gt;Any suggestions?&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;DVC uses a lock file to prevent running two commands at the same time. The lock
&lt;a href=&quot;https://dvc.org/doc/user-guide/dvc-files-and-directories#dvc-files-and-directories&quot;&gt;file&lt;/a&gt;
is under the &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;.dvc&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; directory. If no DVC commands running and you are still
getting this error it’s safe to remove this file manually to resolve the issue.&lt;/p&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485596304961962003/611209851757920266&quot;&gt;I’m trying to understand how does DVC remote add work in case of a local folder and what is the best workflow when data is outside of your project root?&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;When using DVC, in most cases we assume that your data will be somewhere under
project root. There is an option to use so called
&lt;a href=&quot;https://dvc.org/doc/user-guide/managing-external-data&quot;&gt;external dependencies&lt;/a&gt;,
which is data that is usually too big to be stored under your project root, but
if you operate on data that is of some reasonable size, I would recommend
starting with putting data somewhere under project root. Remotes are usually
places where you store your data, but it is DVC task to move your data around.
But if you want to keep your current setup where you will have data in different
place than your project, you will need to refer to data with full paths. So, for
example:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;You are in &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;/home/gabriel/myproject&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; and you have initialized dvc and git
repository&lt;/li&gt;
&lt;li&gt;You have &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;featurize.py&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; in your project dir, and want to use data to produce
some features and than &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;train.py&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; to train a model.&lt;/li&gt;
&lt;li&gt;Run the command:&lt;/li&gt;
&lt;/ol&gt;
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;dvc&quot;&gt;&lt;pre class=&quot;language-dvc&quot;&gt;&lt;code class=&quot;language-dvc&quot;&gt;&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token dvc&quot;&gt;dvc run&lt;/span&gt; -d /research_data/myproject/videos &lt;span class=&quot;token punctuation&quot;&gt;\&lt;/span&gt;
          -o /research_data/myproject/features &lt;span class=&quot;token punctuation&quot;&gt;\&lt;/span&gt;
          python featurize.py&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/body&gt;&lt;/html&gt;
&lt;p&gt;to tell DVC, that you use &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;/research_data/myproject/videos&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; to featurize, and
produce output to your features dir. Note that your code should be aware of
those paths, they can be hardcoded inside &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;featurize.py&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;, but point of &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc run&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;
is just to tell DVC what artifacts belong to currently defined step of ML
pipeline.&lt;/p&gt;
&lt;h3&gt;Q: When I run &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;du&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; command to check how much space DVC project consumes I see that it duplicates/copies data. &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485596304961962003/613935477896249364&quot;&gt;It’s very space and time consuming to copy large data files, is there a way to avoid that?&lt;/a&gt; It takes too long to add large files to DVC.&lt;/h3&gt;
&lt;p&gt;Yes! You don’t have to copy files with DVC. First of all, there are two reasons
when du can show that it takes double the space to store data under DVC control.
du can be inaccurate when the underlying file system supports reflinks (XFS on
Linux, APFS on Mac, etc). This is actually the best scenario since no copying is
happening and no changes are required to any DVC settings. Second, case means
that copy semantics is used by default. It can be turned off by providing cache
type &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;symlinks&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;, &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;hardlinks&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;. Please, read more on this
&lt;a href=&quot;https://dvc.org/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485596304961962003/615479227189559323&quot;&gt;How can I detach a file from DVC control?&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Just removing the corresponding DVC-file and running &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc gc&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; after that should
be enough. It’ll stop tracking the data file and clean the local cache that
might still contain it. Note! Don’t forget to run &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc unprotect&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; if you use
advanced&lt;a href=&quot;https://dvc.org/doc/user-guide/large-dataset-optimization&quot;&gt; DVC setup with symlinks and hardlinks&lt;/a&gt;
(&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;cache.type&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; config option is not default). If &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc gc&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; behavior is not
granular enough you can manually find the by its cache from the DVC-file in
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;.dvc/cache&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; and remote storage. Learn
&lt;a href=&quot;https://dvc.org/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory&quot;&gt;here&lt;/a&gt;
how they are organized.&lt;/p&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485596304961962003/621057268145848340&quot;&gt;I’m trying to understand if DVC is an appropriate solution for storing data under GDPR requirements.&lt;/a&gt; That means that permanent deletion of files with sensitive data needs to be fully supported.&lt;/h3&gt;
&lt;p&gt;Yes, in this sense DVC is not very different from using bare S3, SSH or any
other storage where you can go and just delete data. DVC can give a bit of
overhead to locate a specific file to delete, but otherwise it’s all the same
you will be able to delete any file you want. See more details on how you
retrospectively can edit directories under DVC control
&lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485596304961962003/621062105524862987&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;hr&gt;&lt;/body&gt;&lt;/html&gt;
&lt;p&gt;If you have any questions, concerns or ideas, let us know in the comments below
or connect with DVC team &lt;a href=&quot;https://dvc.org/support&quot;&gt;here&lt;/a&gt;. Our
&lt;a href=&quot;https://twitter.com/DVCorg&quot;&gt;DMs on Twitter&lt;/a&gt; are always open, too.&lt;/p&gt;</content:encoded></item><item><title><![CDATA[DVC.org for Hacktoberfest 2019]]></title><link>https://blog.dvc.org/dvc-org-for-hacktoberfest-2019</link><guid isPermaLink="false">https://blog.dvc.org/dvc-org-for-hacktoberfest-2019</guid><pubDate>Tue, 08 Oct 2019 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;a href=&quot;https://hacktoberfest.digitalocean.com/&quot;&gt;Hacktoberfest&lt;/a&gt; is a monthly-long
program that celebrates open source and encourages you to contribute to open
source projects (and rewards you with stickers and a cool T-shirt!). Whether
you’re a seasoned contributor or looking for projects to contribute to for the
first time, you’re welcome to participate!&lt;/p&gt;
&lt;p&gt;It is the 6th season of Hacktoberfest and the 2d year of participating for
DVC.org team. We really enjoyed it in 2018 and this year we are upping the game
with our own cool stickers, special edition T-shirts and a
&lt;a href=&quot;https://github.com/iterative/dvc/labels/hacktoberfest&quot;&gt;collection of carefully picked tickets&lt;/a&gt;.&lt;/p&gt;
&lt;h3&gt;How to participate?&lt;/h3&gt;
&lt;p&gt;If you haven’t started your Hacktoberfest challenge yet, it is just the right
time, you have 3 weeks left to submit PRs and get your swag! Here are some
important details:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Hacktoberfest is open to everyone in the global community.&lt;/li&gt;
&lt;li&gt;You can sign up anytime between October 1 and October 31. Make sure to sign up
on the
&lt;a href=&quot;https://hacktoberfest.digitalocean.com/&quot;&gt;official Hacktoberfest website&lt;/a&gt; for
your PRs to count.&lt;/li&gt;
&lt;li&gt;To get a shirt, you must make 4 legit pull requests (PRs) between October 1–31
in any time zone.&lt;/li&gt;
&lt;li&gt;Pull requests can be made in any public GitHub-hosted repositories/projects,
not just the ones highlighted.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And the special addition from DVC.org team:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Look through the list of
&lt;a href=&quot;https://github.com/iterative/dvc/labels/hacktoberfest&quot;&gt;DVC Hacktoberfest tickets&lt;/a&gt;
or the list of
&lt;a href=&quot;https://github.com/iterative/dvc/labels/good%20first%20issue&quot;&gt;good DVC first issues&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Make a PR to DVC and get our stickers.&lt;/li&gt;
&lt;li&gt;Close three issues for DVC and get a special DVC T-shirt.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Why contribute to DVC?&lt;/h3&gt;
&lt;p&gt;&lt;a href=&quot;http://dvc.org&quot;&gt;DVC&lt;/a&gt; (Data Version Control) is a relatively young open source
project. It was started in late 2017 by a data scientist and an engineer to fill
in the gaps in the ML processes tooling. Nowadays DVC is growing pretty fast and
though our in-house team is quite small, we have to thank our contributors (more
than 100 in both code and docs) for developing DVC with us.&lt;/p&gt;
&lt;p&gt;DVC is participating in Hacktoberfest for 2 years in a row to bring more people
into open source, to learn from them and to give back by sharing our own
experience. This year we decided to focus on a single important topic for us —
improving UI/UX.&lt;/p&gt;
&lt;p&gt;As our contributors and maintainers were sifting through the feature requests,
bugs, and improvements to create a good
&lt;a href=&quot;https://github.com/iterative/dvc/labels/hacktoberfest&quot;&gt;list of Hacktoberfest tickets&lt;/a&gt;,
we noticed that UI/UX label on Github is popping up again and again. DVC is a
command line tool, and improving UI/UX in our case means making decisions on how
to name command options, where and when to use
&lt;a href=&quot;https://github.com/iterative/dvc/issues/2498&quot;&gt;confirmation prompts&lt;/a&gt; and/or
where abort execution, what exactly user would expect to see in the output, how
to test it later, etc.&lt;/p&gt;
&lt;p&gt;Why improving UI/UX appears to be so important for DVC at this stage? Perhaps
because the project is more mature now and we are ready to spend more time on
polishing it. Or maybe because it is still too-engineering focused and we used
to disregard/de-prioritize all this ‘fancy’ stuff. Or it is because we just lack
experience in creating good CLI UI/UX!&lt;/p&gt;
&lt;p&gt;One or another, those are great reasons to focus on improving UI (in a broader
sense than just GUI), improving docs, creating powerful consistent experience
for our users and increasing accessibility of DVC.&lt;/p&gt;
&lt;p&gt;That’s how
&lt;a href=&quot;https://devcenter.heroku.com/articles/cli-style-guide&quot;&gt;Heroku’s CLI style guide&lt;/a&gt;
starts:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Heroku CLI plugins should provide a clear user experience, targeted primarily
for human readability and usability, which delights the user, while at the
same time supporting advanced users and output formats. This article provides
a clear direction for designing delightful CLI plugins.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;At DVC we are building user experience in line with these principles too, but we
also have our own challenges. And here we turn for help to the global open
source community and all the contributors out there.&lt;/p&gt;
&lt;p&gt;For all of us who have a heart for open source — let’s discuss, contribute,
learn, take the technologies forward and build something great together!&lt;/p&gt;
&lt;p&gt;Happy hacking!&lt;/p&gt;
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;hr&gt;&lt;/body&gt;&lt;/html&gt;
&lt;p&gt;We are happy to hear from you &lt;a href=&quot;https://dvc.org/support&quot;&gt;here&lt;/a&gt;. Our
&lt;a href=&quot;https://twitter.com/DVCorg&quot;&gt;DMs on Twitter&lt;/a&gt; are always open, too!&lt;/p&gt;</content:encoded></item><item><title><![CDATA[September ’19 DVC❤️Heartbeat]]></title><link>https://blog.dvc.org/september-19-dvc-heartbeat</link><guid isPermaLink="false">https://blog.dvc.org/september-19-dvc-heartbeat</guid><pubDate>Thu, 26 Sep 2019 00:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;News and links&lt;/h2&gt;
&lt;p&gt;We are super excited to co-host our very first
&lt;strong&gt;&lt;a href=&quot;https://www.meetup.com/San-Francisco-Machine-Learning-Meetup/events/264846847/&quot;&gt;meetup in San Francisco on October 10&lt;/a&gt;&lt;/strong&gt;!
We will gather at the brand new Dropbox HQ office at 6:30 pm to discuss
open-source tools to version control ML models and experiments.
&lt;a href=&quot;https://twitter.com/FullStackML&quot;&gt;Dmitry Petrov&lt;/a&gt; is teaming up with
&lt;a href=&quot;https://www.linkedin.com/in/daniel-fischetti-4a6592bb/&quot;&gt;Daniel Fischetti&lt;/a&gt; from
&lt;a href=&quot;https://standard.ai/&quot;&gt;Standard Cognition&lt;/a&gt; to discuss best ML practices. Join us
and save your spot now:&lt;/p&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;section class=&quot;elp-content-holder&quot;&gt;
      &lt;a href=&quot;https://www.meetup.com/San-Francisco-Machine-Learning-Meetup/events/264846847/&quot; class=&quot;external-link-preview&quot;&gt;
          &lt;div class=&quot;elp-description-holder&quot;&gt;
            &lt;h4 class=&quot;elp-title&quot;&gt;Open-source tools to version control Machine Learning models and experiments&lt;/h4&gt;
            &lt;div class=&quot;elp-description&quot;&gt;AI and ML are becoming an essential part of the engineering and data science everyday workflow. ML teams need new tools…&lt;/div&gt;
            &lt;div class=&quot;elp-link&quot;&gt;meetup.com&lt;/div&gt;
          &lt;/div&gt;
           &lt;div class=&quot;elp-image-holder&quot;&gt;
                &lt;img src=&quot;/uploads/images/2019-09-26/open-source-tools-to-version-control.png&quot; alt=&quot;Open-source tools to version control Machine Learning models and experiments&quot;&gt;
            &lt;/div&gt;
      &lt;/a&gt;
    &lt;/section&gt;
    &lt;/body&gt;&lt;/html&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;p&gt;If you are not in SF on this date and happen to be in Europe — don’t miss the
PyCon DE &amp;#x26; PyData Berlin 2019 joint event on October 9–11. We cannot make it to
Berlin this year, but we were thrilled to discover 2 independent talks featuring
DVC by
&lt;a href=&quot;https://de.pycon.org/program/pydata-ppgwxl-version-control-for-data-science-alessia-marcolini/&quot;&gt;Alessia Marcolini&lt;/a&gt;
and
&lt;a href=&quot;https://de.pycon.org/program/pydata-cwmae7-tools-that-help-you-get-your-experiments-under-control-katharina-rasch/&quot;&gt;Katharina Rasch&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Some other highlights of the end of summer:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Our users and contributors keep creating fantastic pieces of content around
DVC (sharing some links below, but it’s only a fraction of what we have in
stock — can’t be more happy and humbled about it!).&lt;/li&gt;
&lt;li&gt;We’ve reached 79 contributors to
&lt;a href=&quot;https://github.com/iterative/dvc&quot;&gt;DVC core project&lt;/a&gt; and 74 contributors to
&lt;a href=&quot;https://github.com/iterative/dvc.org&quot;&gt;DVC documentation&lt;/a&gt; (and have something
special in mind to celebrate our 100th contributors).&lt;/li&gt;
&lt;li&gt;we enjoyed working with all the talented
&lt;a href=&quot;https://developers.google.com/season-of-docs/&quot;&gt;Google Season of docs&lt;/a&gt;
applicants and now moving to the next stage with our chosen tech writer
&lt;a href=&quot;http://dashohoxha.fs.al/&quot;&gt;Dashamir Hoxha&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;We’ve crossed the 3,000 stars mark on Github
(&lt;a href=&quot;https://github.com/iterative/dvc&quot;&gt;over 3,500 now&lt;/a&gt;). Thank you for your
support!&lt;/p&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;blockquote class=&quot;twitter-tweet&quot; data-dnt=&quot;true&quot;&gt;&lt;p lang=&quot;en&quot; dir=&quot;ltr&quot;&gt;&lt;a href=&quot;https://t.co/vhkN3zWzjT&quot;&gt;https://t.co/vhkN3zWzjT&lt;/a&gt; just hit 3000 stars on &lt;a href=&quot;https://twitter.com/hashtag/Github?src=hash&amp;#x26;ref_src=twsrc%5Etfw&quot;&gt;#Github&lt;/a&gt;! &lt;a href=&quot;https://t.co/AILppwghuu&quot;&gt;https://t.co/AILppwghuu&lt;/a&gt; &lt;br&gt;Thank you for your trust, your contributions and your insights🤝&lt;br&gt;We are beyond happy to have you with us on this exciting journey🚀 &lt;a href=&quot;https://t.co/dwokD2v7t7&quot;&gt;pic.twitter.com/dwokD2v7t7&lt;/a&gt;&lt;/p&gt;— 🦉DVC (@DVCorg) &lt;a href=&quot;https://twitter.com/DVCorg/status/1147220439472545793&quot;&gt;July 5, 2019&lt;/a&gt;&lt;/blockquote&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;We’ve had great time at the
&lt;a href=&quot;https://events.linuxfoundation.org/events/open-source-summit-north-america-2019/program/&quot;&gt;Open Source Summit&lt;/a&gt;
by Linux foundation in San Diego — speaking on stage, running a booth and
chatting with all the amazing open-source crowd out there.&lt;/p&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;blockquote class=&quot;twitter-tweet&quot; data-dnt=&quot;true&quot;&gt;&lt;p lang=&quot;en&quot; dir=&quot;ltr&quot;&gt;Love all &lt;a href=&quot;https://twitter.com/DVCorg&quot;&gt;@DVCorg&lt;/a&gt; booth buzz at &lt;a href=&quot;https://twitter.com/hashtag/OSSummit?src=hash&amp;#x26;ref_src=twsrc%5Etfw&quot;&gt;#OSSummit&lt;/a&gt;! 🎉&lt;br&gt;Stop by and grab some cool swag 🌈and participate in our easy fun contest to win a Jetson Nano, the coolest fuzzy owls and a bunch of other staff! 🤩 &lt;a href=&quot;https://t.co/MIzfilhrRJ&quot;&gt;pic.twitter.com/MIzfilhrRJ&lt;/a&gt;&lt;/p&gt;— Svetlana Grinchenko (@a142hr) &lt;a href=&quot;https://twitter.com/a142hr/status/1164256520235675648&quot;&gt;August 21, 2019&lt;/a&gt;&lt;/blockquote&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;span class=&quot;gatsby-resp-image-wrapper&quot; style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto;  max-width: 700px;&quot;&gt;
      &lt;a class=&quot;gatsby-resp-image-link&quot; href=&quot;/static/ccbbea0b26a9ac64744739bf7a5ee8b5/31ed7/open-source-summit-by-linux-foundation.jpg&quot; style=&quot;display: block&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;
    &lt;span class=&quot;gatsby-resp-image-background-image&quot; style=&quot;padding-bottom: 75%; position: relative; bottom: 0; left: 0; background-image: url(&amp;#x27;data:image/jpeg;base64,/9j/2wBDABALDA4MChAODQ4SERATGCgaGBYWGDEjJR0oOjM9PDkzODdASFxOQERXRTc4UG1RV19iZ2hnPk1xeXBkeFxlZ2P/2wBDARESEhgVGC8aGi9jQjhCY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2P/wgARCAAPABQDASIAAhEBAxEB/8QAGAAAAgMAAAAAAAAAAAAAAAAAAAQBAwX/xAAVAQEBAAAAAAAAAAAAAAAAAAABAv/aAAwDAQACEAMQAAABgdRirDME/8QAGhAAAgMBAQAAAAAAAAAAAAAAAQMAAhMREv/aAAgBAQABBQJCreLHjdVRXcmnQms//8QAFBEBAAAAAAAAAAAAAAAAAAAAEP/aAAgBAwEBPwE//8QAFBEBAAAAAAAAAAAAAAAAAAAAEP/aAAgBAgEBPwE//8QAHBAAAwABBQAAAAAAAAAAAAAAAAERAhIhQWGB/9oACAEBAAY/AuF0zG+wVE2aXtkQ/8QAGRABAAMBAQAAAAAAAAAAAAAAAQARIVFB/9oACAEBAAE/IQ3tFcIx7UXpkauJyNaCwPJsYY02uz//2gAMAwEAAgADAAAAEBTP/8QAGREAAgMBAAAAAAAAAAAAAAAAAAERITFR/9oACAEDAQE/EIiuDV4f/8QAGBEBAQADAAAAAAAAAAAAAAAAAQARIWH/2gAIAQIBAT8QcbYe3//EAB8QAQACAQMFAAAAAAAAAAAAAAEAEUEhMaFhcYHB8P/aAAgBAQABPxA5cMoKx6j0Vtg5dyLDUgC4fXN+hPMrKBemwMiHWoDtBq06ufE//9k=&amp;#x27;); background-size: cover; display: block;&quot;&gt;&lt;/span&gt;
  &lt;picture&gt;
        &lt;source srcset=&quot;/static/ccbbea0b26a9ac64744739bf7a5ee8b5/c54d4/open-source-summit-by-linux-foundation.webp 175w, /static/ccbbea0b26a9ac64744739bf7a5ee8b5/a3432/open-source-summit-by-linux-foundation.webp 350w, /static/ccbbea0b26a9ac64744739bf7a5ee8b5/426ac/open-source-summit-by-linux-foundation.webp 700w, /static/ccbbea0b26a9ac64744739bf7a5ee8b5/c139f/open-source-summit-by-linux-foundation.webp 1050w, /static/ccbbea0b26a9ac64744739bf7a5ee8b5/7f403/open-source-summit-by-linux-foundation.webp 1400w, /static/ccbbea0b26a9ac64744739bf7a5ee8b5/775df/open-source-summit-by-linux-foundation.webp 2600w&quot; sizes=&quot;(max-width: 700px) 100vw, 700px&quot; type=&quot;image/webp&quot;&gt;
        &lt;source srcset=&quot;/static/ccbbea0b26a9ac64744739bf7a5ee8b5/8dc06/open-source-summit-by-linux-foundation.jpg 175w, /static/ccbbea0b26a9ac64744739bf7a5ee8b5/f4417/open-source-summit-by-linux-foundation.jpg 350w, /static/ccbbea0b26a9ac64744739bf7a5ee8b5/571ad/open-source-summit-by-linux-foundation.jpg 700w, /static/ccbbea0b26a9ac64744739bf7a5ee8b5/566e2/open-source-summit-by-linux-foundation.jpg 1050w, /static/ccbbea0b26a9ac64744739bf7a5ee8b5/3a5dd/open-source-summit-by-linux-foundation.jpg 1400w, /static/ccbbea0b26a9ac64744739bf7a5ee8b5/31ed7/open-source-summit-by-linux-foundation.jpg 2600w&quot; sizes=&quot;(max-width: 700px) 100vw, 700px&quot; type=&quot;image/jpeg&quot;&gt;
        &lt;img class=&quot;gatsby-resp-image-image&quot; src=&quot;/static/ccbbea0b26a9ac64744739bf7a5ee8b5/571ad/open-source-summit-by-linux-foundation.jpg&quot; alt=&quot;open source summit by linux foundation&quot; title=&quot;open source summit by linux foundation&quot; loading=&quot;lazy&quot; style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;&gt;
      &lt;/picture&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;hr&gt;&lt;/body&gt;&lt;/html&gt;
&lt;p&gt;Here are some of the great pieces of content around DVC and ML ops that we
discovered in July and August:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;** Great insightful discussion on Twitter about versioning ML projects started
by &lt;a href=&quot;https://medium.com/@NathanBenaich&quot;&gt;Nathan Benaich&lt;/a&gt;.**&lt;/p&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;blockquote class=&quot;twitter-tweet&quot; data-dnt=&quot;true&quot;&gt;&lt;p lang=&quot;en&quot; dir=&quot;ltr&quot;&gt;🙏Question to ML friends: How do you go about version control for your ML projects (data, models, and intermediate steps in your data pipelines)? Have you built your own tools? Are using something open source? Or a SaaS? Or does this come bundled with your ML infra products? Thx!&lt;/p&gt;— Nathan Benaich (@NathanBenaich) &lt;a href=&quot;https://twitter.com/NathanBenaich/status/1151815916512010242&quot;&gt;July 18, 2019&lt;/a&gt;&lt;/blockquote&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/ixorthink/our-machine-learning-workflow-dvc-mlflow-and-training-in-docker-containers-5b9c80cdf804&quot;&gt;Our Machine Learning Workflow: DVC, MLFlow and Training in Docker Containers&lt;/a&gt;
by &lt;a href=&quot;https://medium.com/@ward.vanlaer&quot;&gt;Ward Van Laer&lt;/a&gt;.&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote&gt;
&lt;p&gt;It is possible to manage your work flow using open-source and free tools.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;section class=&quot;elp-content-holder&quot;&gt;
      &lt;a href=&quot;https://medium.com/ixorthink/our-machine-learning-workflow-dvc-mlflow-and-training-in-docker-containers-5b9c80cdf804&quot; class=&quot;external-link-preview&quot;&gt;
          &lt;div class=&quot;elp-description-holder&quot;&gt;
            &lt;h4 class=&quot;elp-title&quot;&gt;Our Machine Learning Workflow: DVC, MLFlow and Training in Docker Containers&lt;/h4&gt;
            &lt;div class=&quot;elp-description&quot;&gt;Googling for machine learning frameworks to version data, track python models etc.. I was surprised to see that these…&lt;/div&gt;
            &lt;div class=&quot;elp-link&quot;&gt;medium.com&lt;/div&gt;
          &lt;/div&gt;
           &lt;div class=&quot;elp-image-holder&quot;&gt;
                &lt;img src=&quot;/uploads/images/2019-09-26/our-machine-learning-workflow.jpeg&quot; alt=&quot;Our Machine Learning Workflow: DVC, MLFlow and Training in Docker Containers&quot;&gt;
            &lt;/div&gt;
      &lt;/a&gt;
    &lt;/section&gt;
    &lt;/body&gt;&lt;/html&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/qonto-engineering/using-dvc-to-create-an-efficient-version-control-system-for-data-projects-96efd94355fe&quot;&gt;Using DVC to create an efficient version control system for data projects&lt;/a&gt;
by &lt;a href=&quot;https://medium.com/@basile_16101&quot;&gt;Basile Guerrapin&lt;/a&gt;.&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote&gt;
&lt;p&gt;DVC brought versioning for inputs, intermediate files and algorithm models to
the VAT auto-detection project and this drastically increased our
&lt;strong&gt;productivity&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;section class=&quot;elp-content-holder&quot;&gt;
      &lt;a href=&quot;https://medium.com/qonto-engineering/using-dvc-to-create-an-efficient-version-control-system-for-data-projects-96efd94355fe&quot; class=&quot;external-link-preview&quot;&gt;
          &lt;div class=&quot;elp-description-holder&quot;&gt;
            &lt;h4 class=&quot;elp-title&quot;&gt;Using DVC to create an efficient version control system for data projects&lt;/h4&gt;
            &lt;div class=&quot;elp-description&quot;&gt;At first we were looking for a tool to help us dealing with production data files such as trained machine learning…&lt;/div&gt;
            &lt;div class=&quot;elp-link&quot;&gt;medium.com&lt;/div&gt;
          &lt;/div&gt;
           &lt;div class=&quot;elp-image-holder&quot;&gt;
                &lt;img src=&quot;/uploads/images/2019-09-26/using-dvc-to-create-an-efficient-vcs.png&quot; alt=&quot;Using DVC to create an efficient version control system for data projects&quot;&gt;
            &lt;/div&gt;
      &lt;/a&gt;
    &lt;/section&gt;
    &lt;/body&gt;&lt;/html&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://techsparx.com/software-development/ai/dvc/versioning-example.html&quot;&gt;Managing versioned machine learning datasets in DVC, and easily share ML projects with colleagues&lt;/a&gt;
by &lt;a href=&quot;https://twitter.com/7genblogger&quot;&gt;David Herron&lt;/a&gt;.&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote&gt;
&lt;p&gt;In this tutorial we will go over a simple image classifier. We will learn how
DVC works in a machine learning project, how it optimizes reproducing results
when the project is changed, and how to share the project with colleagues.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;section class=&quot;elp-content-holder&quot;&gt;
      &lt;a href=&quot;https://techsparx.com/software-development/ai/dvc/versioning-example.html&quot; class=&quot;external-link-preview&quot;&gt;
          &lt;div class=&quot;elp-description-holder&quot;&gt;
            &lt;h4 class=&quot;elp-title&quot;&gt;Managing versioned machine learning datasets in DVC, and easily share ML projects with colleagues&lt;/h4&gt;
            &lt;div class=&quot;elp-description&quot;&gt;Software Development Artificial Intelligence Data Version Control (DVC) Managing versioned machine learning datasets in…&lt;/div&gt;
            &lt;div class=&quot;elp-link&quot;&gt;techsparx.com&lt;/div&gt;
          &lt;/div&gt;
           &lt;div class=&quot;elp-image-holder&quot;&gt;
                &lt;img src=&quot;/uploads/images/2019-09-26/managing-versioned-machine-learning-datasets.jpeg&quot; alt=&quot;Managing versioned machine learning datasets in DVC, and easily share ML projects with colleagues&quot;&gt;
            &lt;/div&gt;
      &lt;/a&gt;
    &lt;/section&gt;
    &lt;/body&gt;&lt;/html&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://towardsdatascience.com/how-to-use-data-version-control-dvc-in-a-machine-learning-project-a78245c0185&quot;&gt;How to use data version control (dvc) in a machine learning project&lt;/a&gt;
by &lt;a href=&quot;https://towardsdatascience.com/@matthiasbitzer94&quot;&gt;Matthias Bitzer&lt;/a&gt;.&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote&gt;
&lt;p&gt;To illustrate the use of dvc in a machine learning context, we assume that our
data is divided into train, test and validation folders by default, with the
amount of data increasing over time either through an active learning cycle or
by manually adding new data.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;section class=&quot;elp-content-holder&quot;&gt;
      &lt;a href=&quot;https://towardsdatascience.com/how-to-use-data-version-control-dvc-in-a-machine-learning-project-a78245c0185&quot; class=&quot;external-link-preview&quot;&gt;
          &lt;div class=&quot;elp-description-holder&quot;&gt;
            &lt;h4 class=&quot;elp-title&quot;&gt;How to use data version control (dvc) in a machine learning project&lt;/h4&gt;
            &lt;div class=&quot;elp-description&quot;&gt;When working in a productive machine learning project you probably deal with a tone of data and several models. To keep…&lt;/div&gt;
            &lt;div class=&quot;elp-link&quot;&gt;towardsdatascience.com&lt;/div&gt;
          &lt;/div&gt;
           &lt;div class=&quot;elp-image-holder&quot;&gt;
                &lt;img src=&quot;/uploads/images/2019-09-26/how-to-use-data-version-control.jpeg&quot; alt=&quot;How to use data version control (dvc) in a machine learning project&quot;&gt;
            &lt;/div&gt;
      &lt;/a&gt;
    &lt;/section&gt;
    &lt;/body&gt;&lt;/html&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://towardsdatascience.com/version-control-ml-model-4adb2db5f87c&quot;&gt;Version Control ML Model&lt;/a&gt;
by &lt;a href=&quot;https://towardsdatascience.com/@TianchenW&quot;&gt;Tianchen Wu&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote&gt;
&lt;p&gt;This post presents a solution to version control machine learning models with
git and dvc (&lt;a href=&quot;https://dvc.org/doc/tutorial&quot;&gt;Data Version Control&lt;/a&gt;).&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;section class=&quot;elp-content-holder&quot;&gt;
      &lt;a href=&quot;https://towardsdatascience.com/version-control-ml-model-4adb2db5f87c&quot; class=&quot;external-link-preview&quot;&gt;
          &lt;div class=&quot;elp-description-holder&quot;&gt;
            &lt;h4 class=&quot;elp-title&quot;&gt;Version Control ML Model&lt;/h4&gt;
            &lt;div class=&quot;elp-description&quot;&gt;Machine Learning operations (let’s call it MLOps under the current buzzword pattern xxOps) are quite different from…&lt;/div&gt;
            &lt;div class=&quot;elp-link&quot;&gt;towardsdatascience.com&lt;/div&gt;
          &lt;/div&gt;
           &lt;div class=&quot;elp-image-holder&quot;&gt;
                &lt;img src=&quot;/uploads/images/2019-09-26/version-control-ml-model.png&quot; alt=&quot;Version Control ML Model&quot;&gt;
            &lt;/div&gt;
      &lt;/a&gt;
    &lt;/section&gt;
    &lt;/body&gt;&lt;/html&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://dev.to/robogeek/reflinks-vs-symlinks-vs-hard-links-and-how-they-can-help-machine-learning-projects-1cj4&quot;&gt;Reflinks vs symlinks vs hard links, and how they can help machine learning projects&lt;/a&gt;
by &lt;a href=&quot;https://medium.com/@7genblogger&quot;&gt;David Herron&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote&gt;
&lt;p&gt;In this blog post we’ll go over the details of using links, some cool new
stuff in modern file systems (reflinks), and an example of how DVC (Data
Version Control, &lt;a href=&quot;https://dvc.org/&quot;&gt;https://dvc.org/&lt;/a&gt;) leverages this.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;section class=&quot;elp-content-holder&quot;&gt;
      &lt;a href=&quot;https://towardsdatascience.com/version-control-ml-model-4adb2db5f87c&quot; class=&quot;external-link-preview&quot;&gt;
          &lt;div class=&quot;elp-description-holder&quot;&gt;
            &lt;h4 class=&quot;elp-title&quot;&gt;Reflinks vs symlinks vs hard links, and how they can help machine learning projects&lt;/h4&gt;
            &lt;div class=&quot;elp-description&quot;&gt;Hard links and symbolic links have been available since time immemorial, and we use them all the time without even…&lt;/div&gt;
            &lt;div class=&quot;elp-link&quot;&gt;dev.to&lt;/div&gt;
          &lt;/div&gt;
           &lt;div class=&quot;elp-image-holder&quot;&gt;
                &lt;img src=&quot;/uploads/images/2019-09-26/reflinks-vs-symlinks-vs-hard-links.jpeg&quot; alt=&quot;Reflinks vs symlinks vs hard links, and how they can help machine learning projects&quot;&gt;
            &lt;/div&gt;
      &lt;/a&gt;
    &lt;/section&gt;
    &lt;/body&gt;&lt;/html&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://blog.codecentric.de/en/2019/08/dvc-dependency-management/&quot;&gt;DVC dependency management — a guide&lt;/a&gt;
by &lt;a href=&quot;https://blog.codecentric.de/en/author/bert-besser/&quot;&gt;Bert Besser&lt;/a&gt; and
&lt;a href=&quot;https://blog.codecentric.de/en/author/veronika-schindler/&quot;&gt;Veronika Schwan&lt;/a&gt;.&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote&gt;
&lt;p&gt;This post is a follow-up to
&lt;a href=&quot;https://blog.codecentric.de/en/2019/03/walkthrough-dvc/&quot;&gt;A walkthrough of DVC&lt;/a&gt;
that deals with managing dependencies between DVC projects. In particular,
this follow-up is about importing specific versions of an artifact (e.g. a
trained model or a dataset) from one DVC project into another.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;section class=&quot;elp-content-holder&quot;&gt;
      &lt;a href=&quot;https://blog.codecentric.de/en/2019/08/dvc-dependency-management/&quot; class=&quot;external-link-preview&quot;&gt;
          &lt;div class=&quot;elp-description-holder&quot;&gt;
            &lt;h4 class=&quot;elp-title&quot;&gt;DVC dependency management - a guide - codecentric AG Blog&lt;/h4&gt;
            &lt;div class=&quot;elp-description&quot;&gt;This post is a follow-up to A walkthrough of DVC that deals with managing dependencies between DVC projects. In…&lt;/div&gt;
            &lt;div class=&quot;elp-link&quot;&gt;blog.codecentric.de&lt;/div&gt;
          &lt;/div&gt;
           &lt;div class=&quot;elp-image-holder&quot;&gt;
                &lt;img src=&quot;/uploads/images/2019-09-26/dvc-org.png&quot; alt=&quot;DVC dependency management - a guide - codecentric AG Blog&quot;&gt;
            &lt;/div&gt;
      &lt;/a&gt;
    &lt;/section&gt;
    &lt;/body&gt;&lt;/html&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/@czeslaw.szubert/effective-ml-teams-lessons-learned-6a6e761bc283&quot;&gt;Effective ML Teams — Lessons Learne&lt;/a&gt;
by &lt;a href=&quot;https://medium.com/@czeslaw.szubert&quot;&gt;Czeslaw Szubert&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote&gt;
&lt;p&gt;In this post I’ll present lessons learned on how to setup successful ML teams
and what you need to devise an effective enterprise ML strategy.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;section class=&quot;elp-content-holder&quot;&gt;
      &lt;a href=&quot;https://medium.com/@czeslaw.szubert/effective-ml-teams-lessons-learned-6a6e761bc283&quot; class=&quot;external-link-preview&quot;&gt;
          &lt;div class=&quot;elp-description-holder&quot;&gt;
            &lt;h4 class=&quot;elp-title&quot;&gt;Effective ML Teams — Lessons Learned&lt;/h4&gt;
            &lt;div class=&quot;elp-description&quot;&gt;Machine Learning and Artificial Intelligence has entered our everyday lives — from Virtual Assistants built into each…&lt;/div&gt;
            &lt;div class=&quot;elp-link&quot;&gt;medium.com&lt;/div&gt;
          &lt;/div&gt;
           &lt;div class=&quot;elp-image-holder&quot;&gt;
                &lt;img src=&quot;/uploads/images/2019-09-26/effective-ml-teams.jpeg&quot; alt=&quot;Effective ML Teams — Lessons Learned&quot;&gt;
            &lt;/div&gt;
      &lt;/a&gt;
    &lt;/section&gt;
    &lt;/body&gt;&lt;/html&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://www.esentri.com/lessons-learned-from-training-a-german-speech-recognition-model/&quot;&gt;Lessons learned from training a German Speech Recognition model&lt;/a&gt;
by &lt;a href=&quot;https://www.linkedin.com/in/dschoenleber/&quot;&gt;David Schönleber&lt;/a&gt;.&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote&gt;
&lt;p&gt;Setting up a documentation-by-design workflow and using appropriate tools
where needed, e.g. &lt;em&gt;MLFlow&lt;/em&gt; and &lt;em&gt;dvc,&lt;/em&gt; can be a real deal-breaker.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;section class=&quot;elp-content-holder&quot;&gt;
      &lt;a href=&quot;https://medium.com/@czeslaw.szubert/effective-ml-teams-lessons-learned-6a6e761bc283&quot; class=&quot;external-link-preview&quot;&gt;
          &lt;div class=&quot;elp-description-holder&quot;&gt;
            &lt;h4 class=&quot;elp-title&quot;&gt;Lessons Learned from Training a German Speech Recognition Model - esentri AG&lt;/h4&gt;
            &lt;div class=&quot;elp-description&quot;&gt;This post is the first of a two-part series. In this first part, I address learnings from a recent project in which I…&lt;/div&gt;
            &lt;div class=&quot;elp-link&quot;&gt;esentri.com&lt;/div&gt;
          &lt;/div&gt;
           &lt;div class=&quot;elp-image-holder&quot;&gt;
                &lt;img src=&quot;/uploads/images/2019-09-26/lessons-learned-from-training.jpeg&quot; alt=&quot;Lessons Learned from Training a German Speech Recognition Model - esentri AG&quot;&gt;
            &lt;/div&gt;
      &lt;/a&gt;
    &lt;/section&gt;
    &lt;/body&gt;&lt;/html&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;hr&gt;&lt;/body&gt;&lt;/html&gt;
&lt;h2&gt;Discord gems&lt;/h2&gt;
&lt;p&gt;There are lots of hidden gems in our Discord community discussions. Sometimes
they are scattered all over the channels and hard to track down.&lt;/p&gt;
&lt;p&gt;We are sifting through the issues and discussions and share with you the most
interesting takeaways.&lt;/p&gt;
&lt;h3&gt;Q: I’m getting an error message while trying to use AWS S3 storage: &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;ERROR: failed to push data to the cloud — Unable to locate credentials.&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/563406153334128681/587792932061577218&quot;&gt;Any ideas what’s happening?&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Most likely you haven’t configured your S3 credentials/AWS account yet. Please,
read the full documentation on the AWS website. The short version of what should
be done is the following:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://portal.aws.amazon.com/gp/aws/developer/registration/index.html&quot;&gt;Create your AWS account.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Log in to your AWS Management Console.&lt;/li&gt;
&lt;li&gt;Click on your user name at the top right of the page.&lt;/li&gt;
&lt;li&gt;Click on the Security Credentials link from the drop-down menu.&lt;/li&gt;
&lt;li&gt;Find the Access Credentials section, and copy the latest &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;Access Key ID&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;.&lt;/li&gt;
&lt;li&gt;Click on the Show link in the same row, and copy the &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;Secret Access Key&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Follow
&lt;a href=&quot;https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html&quot;&gt;this link&lt;/a&gt;
to setup your environment.&lt;/p&gt;
&lt;h3&gt;Q: I added data with &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc add&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; or &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc run&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; and see that it takes twice what it was before (with &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;du&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; command). &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/563406153334128681/595402051203235861&quot;&gt;Does it mean that DVC copies data that is added under its control? How do I prevent this from happening?&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;To give a short summary — by default, DVC copies the files from your working
directory to the cache (this is for safety reasons, it is better to duplicate
the data). If you have reflinks (copy-on-write) enabled on your file system, DVC
will use that method — which is as safe as copying. You can also configure DVC
to use hardlinks/symlinks to save some space and time, but it will require
enabling the protected mode (making data files in workspace read-only). Read
more details &lt;a href=&quot;https://dvc.org/doc/user-guide/large-dataset-optimization&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/563406153334128681/599345778703597568&quot;&gt;How concurrent-friendly is the cache? And different remotes? Is it safe to have several containers/nodes fill the same cache at the same time?&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;It is safe and a very common use case for DVC to have a shared cache. Please,
check &lt;a href=&quot;https://discuss.dvc.org/t/share-nas-data-in-server/180/12&quot;&gt;this thread&lt;/a&gt;,
for example.&lt;/p&gt;
&lt;h3&gt;Q:&lt;a href=&quot;https://discordapp.com/channels/485586884165107732/563406153334128681/603890677176336394&quot;&gt;What is the proper way to exit the ASCII visualization?&lt;/a&gt; (when you run &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc pipeline show&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; command).&lt;/h3&gt;
&lt;p&gt;See this
&lt;a href=&quot;https://dvc.org/doc/commands-reference/pipeline/show#options&quot;&gt;document&lt;/a&gt;. To
navigate, use arrows or W, A, S, D keys. To exit, press Q.&lt;/p&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/563406153334128681/606197026488844338&quot;&gt;Is there an issue if I set my &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;cache.s3&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; external cache to my default remote?&lt;/a&gt; I don’t quite understand what an external cache is for other than I have to have it for external outputs.&lt;/h3&gt;
&lt;p&gt;Short answer is that we would suggest keeping them separately to avoid possible
checksum overlaps. Checksum on S3 might theoretically overlap with our checksums
(with the content of the file being different), so it could be dangerous. The
chances of losing data are pretty slim, but we would not risk it. Right now, we
are working on making sure there are no possible overlapping.&lt;/p&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/563406153334128681/606425815139221504&quot;&gt;What’s the right procedure to move a step .dvc file around the project?&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Assuming the file was created with &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc run&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;. There are few possible ways.
Obvious one is to delete the file and create a new one with
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc run --no-exec -f file/path/and/name.dvc&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;. Another possibility is to
rename/move and then edit manually. See
&lt;a href=&quot;https://dvc.org/doc/user-guide/dvc-file-format&quot;&gt;this document&lt;/a&gt; that describes
how DVC-files are organized. No matter what method you use, you can run
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc commit file.dvc&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; to save changes without running the command again.&lt;/p&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/563406153334128681/606917839688957952&quot;&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc status&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; doesn’t seem to report things that need to be dvc pushed, is that by design?&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;You should try with dvc status &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;--cloud&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; or &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc status --remote &amp;#x3C;your-remote&gt;&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;
to compare your local cache with a remote one, by default it only compares the
“working directory” with your local cache (to check whether something should be
reproduced and saved or not).&lt;/p&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/563406153334128681/608701494035873792&quot;&gt;What kind of files can you put into &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc metrics&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;?&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;The file could be in any format, &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc metric&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; show will try to interpret the
format and output it in the best possible way. Also, if you are using &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;csv&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; or
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;json&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;, you can use the &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;--xpath&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; flag to query specific measurements. &lt;strong&gt;In
general, you can make any file a metric file and put any content into it, DVC is
not opinionated about it.&lt;/strong&gt; Usually though these are files that measures the
performance/accuracy of your model and captures configuration of experiments.
The idea is to use &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc metrics show&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; to display all your metrics across
experiments so you can make decisions of which combination (of features,
parameters, algorithms, architecture, etc.) works the best.&lt;/p&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/563406153334128681/613639458000207902&quot;&gt;Does DVC take into account the timestamp of a file or is the MD5 only depends on the files actual/bits content?&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;DVC takes into account only content (bits) of a file to calculate hashes that
are saved into DVC-files.&lt;/p&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/563406153334128681/616421757808541721&quot;&gt;Similar to &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc gc&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; is there a command to garbage collect from the remote?&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc gc --remote NAME&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; is doing this, but you should be extra careful, because
it will remove everything that is not currently “in use” (by the working
directory). Also, please check this
&lt;a href=&quot;https://github.com/iterative/dvc/issues/2325&quot;&gt;issue&lt;/a&gt; — semantics of this
command might have changed by the time you read this.&lt;/p&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485596304961962003/591237578209099786&quot;&gt;How do I use and configure remote storage on IBM Cloud Object Storage?&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Since it’s S3 compatible, specifying &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;endpointurl&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; (exact URL depends on the
&lt;a href=&quot;https://cloud.ibm.com/docs/services/cloud-object-storage?topic=cloud-object-storage-endpoints&quot;&gt;region&lt;/a&gt;)
is the way to go:&lt;/p&gt;
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;dvc&quot;&gt;&lt;pre class=&quot;language-dvc&quot;&gt;&lt;code class=&quot;language-dvc&quot;&gt;&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token dvc&quot;&gt;dvc remote add&lt;/span&gt; -d mybucket s3://path/to/dir
&lt;/span&gt;&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token dvc&quot;&gt;dvc remote modify&lt;/span&gt; mybucket &lt;span class=&quot;token punctuation&quot;&gt;\&lt;/span&gt;
                    endpointurl &lt;span class=&quot;token punctuation&quot;&gt;\&lt;/span&gt;
                    https://s3.eu.cloud-object-storage.appdomain.cloud&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/body&gt;&lt;/html&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485596304961962003/592958360903483403&quot;&gt;How can I push data from client to google cloud bucket using DVC?&lt;/a&gt;. Just want to know how can i set the credentials.&lt;/h3&gt;
&lt;p&gt;You can do it by setting environment variable pointing to yours credentials
path, like:&lt;/p&gt;
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;dvc&quot;&gt;&lt;pre class=&quot;language-dvc&quot;&gt;&lt;code class=&quot;language-dvc&quot;&gt;&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token command&quot;&gt;export&lt;/span&gt; &lt;span class=&quot;token assign-left variable&quot;&gt;GOOGLE_APPLICATION_CREDENTIALS&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;path/to/credentials&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/body&gt;&lt;/html&gt;
&lt;p&gt;It is also possible to set this variable via &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc config&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;:&lt;/p&gt;
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;dvc&quot;&gt;&lt;pre class=&quot;language-dvc&quot;&gt;&lt;code class=&quot;language-dvc&quot;&gt;&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token dvc&quot;&gt;dvc remote modify&lt;/span&gt; myremote credentialpath /path/to/my/creds&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/body&gt;&lt;/html&gt;
&lt;p&gt;where &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;myremote&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; is your remote name.&lt;/p&gt;
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;hr&gt;&lt;/body&gt;&lt;/html&gt;
&lt;p&gt;If you have any questions, concerns or ideas, let us know in the comments below
or connect with DVC team &lt;a href=&quot;https://dvc.org/support&quot;&gt;here&lt;/a&gt;. Our
&lt;a href=&quot;https://twitter.com/DVCorg&quot;&gt;DMs on Twitter&lt;/a&gt; are always open, too.&lt;/p&gt;</content:encoded></item><item><title><![CDATA[July ’19 DVC❤️Heartbeat]]></title><link>https://blog.dvc.org/july-19-dvc-heartbeat</link><guid isPermaLink="false">https://blog.dvc.org/july-19-dvc-heartbeat</guid><pubDate>Thu, 01 Aug 2019 00:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;News and links&lt;/h2&gt;
&lt;p&gt;As we continue to grow DVC together with our fantastic contributors, we enjoy
more and more insights, discussions, and articles either created or brought to
us by our community. We feel it is the right time to start sharing more of your
news, your stories and your discoveries. New Heartbeat is here!&lt;/p&gt;
&lt;p&gt;Speaking of our own news — next month DVC team is going to the
&lt;a href=&quot;https://events.linuxfoundation.org/events/open-source-summit-north-america-2019/&quot;&gt;Open Source North America Summit&lt;/a&gt;.
It is taking place in San Diego on August 21–23.
&lt;a href=&quot;https://ossna19.sched.com/speaker/dmitry35&quot;&gt;Dmitry&lt;/a&gt; and
&lt;a href=&quot;https://ossna19.sched.com/speaker/svetlanagrinchenko&quot;&gt;Sveta&lt;/a&gt; will be giving
talks and we will run a booth. So looking forward to it! Stop by for a chat and
some cool swag. And if you are in San Diego on those days and want to catch up —
please let us know &lt;a href=&quot;http://dvc.org/support&quot;&gt;here&lt;/a&gt; or on Twitter!&lt;/p&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;section class=&quot;elp-content-holder&quot;&gt;
      &lt;a href=&quot;https://ossna19.sched.com/event/PUVv/open-source-tools-for-ml-experiments-management-dmitry-petrov-ruslan-kuprieiev-iterative-ai&quot; class=&quot;external-link-preview&quot;&gt;
          &lt;div class=&quot;elp-description-holder&quot;&gt;
            &lt;h4 class=&quot;elp-title&quot;&gt;Open Source Summit + ELC North America 2019: Open Source Tools for ML Experiments Man...&lt;/h4&gt;
            &lt;div class=&quot;elp-description&quot;&gt;Speakers Software Engineer, Iterative AI Ruslan is a Software Engineer at Iterative AI. Previously he worked on live…&lt;/div&gt;
            &lt;div class=&quot;elp-link&quot;&gt;ossna19.sched.com&lt;/div&gt;
          &lt;/div&gt;
           &lt;div class=&quot;elp-image-holder&quot;&gt;
                &lt;img src=&quot;/uploads/images/2019-08-01/open-source-north-america-summit.png&quot; alt=&quot;Open Source Summit + ELC North America 2019: Open Source Tools for ML Experiments Man...&quot;&gt;
            &lt;/div&gt;
      &lt;/a&gt;
    &lt;/section&gt;
    &lt;/body&gt;&lt;/html&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;section class=&quot;elp-content-holder&quot;&gt;
      &lt;a href=&quot;https://ossna19.sched.com/event/PWNk/speaker-preparation-simple-steps-with-a-tremendous-impact-svetlana-grinchenko-dvcorg&quot; class=&quot;external-link-preview&quot;&gt;
          &lt;div class=&quot;elp-description-holder&quot;&gt;
            &lt;h4 class=&quot;elp-title&quot;&gt;Open Source Summit + ELC North America 2019: Speaker Preparation: Simple Steps with a...&lt;/h4&gt;
            &lt;div class=&quot;elp-description&quot;&gt;Speakers Head of Developer Relations, DVC.org Svetlana is driving developer relations and community at DVC.org…&lt;/div&gt;
            &lt;div class=&quot;elp-link&quot;&gt;ossna19.sched.com&lt;/div&gt;
          &lt;/div&gt;
           &lt;div class=&quot;elp-image-holder&quot;&gt;
                &lt;img src=&quot;/uploads/images/2019-08-01/open-source-north-america-summit.png&quot; alt=&quot;Open Source Summit + ELC North America 2019: Speaker Preparation: Simple Steps with a...&quot;&gt;
            &lt;/div&gt;
      &lt;/a&gt;
    &lt;/section&gt;
    &lt;/body&gt;&lt;/html&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;p&gt;Every month our team is excited to discover new great pieces of content
addressing some of the burning ML issues. Here are some of the links that caught
our eye in June:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://dev.to/robogeek/principled-machine-learning-4eho&quot;&gt;Principled Machine Learning: Practices and Tools for Efficient Collaboration&lt;/a&gt;
by &lt;a href=&quot;https://medium.com/@7genblogger&quot;&gt;David Herron&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;section class=&quot;elp-content-holder&quot;&gt;
      &lt;a href=&quot;https://dev.to/robogeek/principled-machine-learning-4eho&quot; class=&quot;external-link-preview&quot;&gt;
          &lt;div class=&quot;elp-description-holder&quot;&gt;
            &lt;h4 class=&quot;elp-title&quot;&gt;Principled Machine Learning: Practices and Tools for Efficient Collaboration&lt;/h4&gt;
            &lt;div class=&quot;elp-description&quot;&gt;Machine learning projects are often harder than they should be. The code to train an ML model is just software, and we…&lt;/div&gt;
            &lt;div class=&quot;elp-link&quot;&gt;dev.to&lt;/div&gt;
          &lt;/div&gt;
           &lt;div class=&quot;elp-image-holder&quot;&gt;
                &lt;img src=&quot;/uploads/images/2019-08-01/principled-machine-learning.jpeg&quot; alt=&quot;Principled Machine Learning: Practices and Tools for Efficient Collaboration&quot;&gt;
            &lt;/div&gt;
      &lt;/a&gt;
    &lt;/section&gt;
    &lt;/body&gt;&lt;/html&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;As we’ve seen in this article some tools and practices can be borrowed from
regular software engineering. However, the needs of machine learning projects
dictate tools that better fit the purpose.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;First
&lt;a href=&quot;http://ml-repa.ru/&quot;&gt;ML-REPA&lt;/a&gt;&lt;a href=&quot;http://ml-repa.ru/page6697700.html&quot;&gt;Meetup: Reproducible ML experiments&lt;/a&gt;
hosted by &lt;a href=&quot;https://www.raiffeisen-digital.ru/?utm_referrer=&quot;&gt;Raiffeisen DGTL&lt;/a&gt;
— check out the video and slide decks.&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;section class=&quot;elp-content-holder&quot;&gt;
      &lt;a href=&quot;http://ml-repa.ru/&quot; class=&quot;external-link-preview&quot;&gt;
          &lt;div class=&quot;elp-description-holder&quot;&gt;
            &lt;h4 class=&quot;elp-title&quot;&gt;Machine Learning REPA&lt;/h4&gt;
            &lt;div class=&quot;elp-description&quot;&gt;Анонсы мероприятий, проектов, обзоров инструментов и кейсов про ML проекты, управление экспериментами, автоматизацию и…&lt;/div&gt;
            &lt;div class=&quot;elp-link&quot;&gt;ml-repa.ru&lt;/div&gt;
          &lt;/div&gt;
           &lt;div class=&quot;elp-image-holder&quot;&gt;
                &lt;img src=&quot;/uploads/images/2019-08-01/machine-learning-repa.png&quot; alt=&quot;Machine Learning REPA&quot;&gt;
            &lt;/div&gt;
      &lt;/a&gt;
    &lt;/section&gt;
    &lt;/body&gt;&lt;/html&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;http://ml-repa.ru/&quot;&gt;ML-REPA&lt;/a&gt; is an a new fantastic resource for
Russian-speaking folks interested in Reproducibility, Experiments and Pipelines
Automation. Curated by &lt;a href=&quot;https://twitter.com/mnrozhkov&quot;&gt;Mikhail Rozhkov&lt;/a&gt; and
highly recommended by our team.&lt;/p&gt;
&lt;h3&gt;&lt;a href=&quot;https://www.reddit.com/r/MachineLearning/comments/bx0apm/d_how_do_you_manage_your_machine_learning/&quot;&gt;How do you manage your machine learning experiments?&lt;/a&gt; discussion on Reddit is full of insights.&lt;/h3&gt;
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;blockquote class=&quot;reddit-card&quot; data-card-created=&quot;1576789144&quot;&gt;&lt;a href=&quot;https://www.reddit.com/r/MachineLearning/comments/bx0apm/d_how_do_you_manage_your_machine_learning/&quot;&gt;[D] How do you manage your machine learning experiments?&lt;/a&gt; from &lt;a href=&quot;http://www.reddit.com/r/MachineLearning&quot;&gt;r/MachineLearning&lt;/a&gt;&lt;/blockquote&gt;&lt;/body&gt;&lt;/html&gt;
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;hr&gt;&lt;/body&gt;&lt;/html&gt;
&lt;h2&gt;Discord gems&lt;/h2&gt;
&lt;p&gt;There are lots of hidden gems in our Discord community discussions. Sometimes
they are scattered all over the channels and hard to track down.&lt;/p&gt;
&lt;p&gt;We are sifting through the issues and discussions and share with you the most
interesting takeaways.&lt;/p&gt;
&lt;h3&gt;Q: I have within one git repository different folders with very different content (basically different projects, or content I want to have different permissions to), and I thought about using different buckets in AWS as remotes. &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485596304961962003/575718048330416158&quot;&gt;I’m not sure if it’s possible with DVC to store some files in some remote, and some other files in some other remote, is it?&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;You can definitely add more than one remote (see
&lt;a href=&quot;https://dvc.org/doc/commands-reference/remote-add&quot;&gt;dvc remote add&lt;/a&gt;) and then
&lt;a href=&quot;https://dvc.org/doc/commands-reference/push&quot;&gt;dvc push&lt;/a&gt; has a &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;-R&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; option to
pick which one to send the cached data files (deps, outs, etc) to. We would not
recommend doing this though. It complicates the commands you have to run — you
will need to remember to specify a remote name for every command that deals with
data — &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;push&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;, &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;pull&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;, &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;gc&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;, &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;fetch&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;, &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;status&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;, etc. Please, leave a comment in
the relevant issue &lt;a href=&quot;https://github.com/iterative/dvc/issues/2095&quot;&gt;here&lt;/a&gt; if this
case is important for you.&lt;/p&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485596304961962003/578532350221352987&quot;&gt;Is that possible with DVC to have multiple (few) metric files and compare them all at once?&lt;/a&gt; For example, we’d like to consider as metrics the loss of a neural network training process (loss as a &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;-M&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; output of a training stage), and also apart knowing the accuracy of the NN on a test set (another &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;-M&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; output of eval stage).&lt;/h3&gt;
&lt;p&gt;Yes, it is totally fine to use &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;-M&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; in different stages. &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc metrics show&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; will
just show both metrics.&lt;/p&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485596304961962003/577362750443880449&quot;&gt;I have a scenario where an artifacts (data) folder is created by the dvc run command via the &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;-o&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; flag. I have manually added another file into or modified the artifacts folder but when I do &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc push&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; nothing happens, is there anyway around this?&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Let’s first do a quick recap on how DVC handles data files (you can definitely
find more information on the &lt;a href=&quot;http://dvc.org/docs&quot;&gt;DVC documentation site&lt;/a&gt;).&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;When you do &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc add&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;, &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc run&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; or &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc import&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; DVC puts artifacts (in case
of &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc run&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; artifacts == outputs produced by the command) into &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;.dvc/cache&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;
directory (default cache location). You don’t see this happening because
&lt;a href=&quot;https://dvc.org/doc/user-guide/large-dataset-optimization&quot;&gt;DVC keeps links&lt;/a&gt;
(or in certain cases creates a copy) to these files/directories.&lt;/li&gt;
&lt;li&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc push&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; does not move files from the workspace (that what you see) to the
remote storage, it always moves files/directories that are already in cache
(default is .dvc/cache).&lt;/li&gt;
&lt;li&gt;So, now you’ve added a file manually, or made some other modifications. But
these files are not in cache yet. The analogy would be &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;git commit&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;. You
change the file, you do &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;git commit&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;, only after that you can push something
to Git server (Github/Gitlab, etc). The difference is that DVC is doing commit
(moves files to cache) automatically in certain cases — &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc add&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;, &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc run&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;,
etc.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;There is an explicit command — &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc commit&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; - that you should run if you want to
enforce the change to the output produced by &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc run&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;. This command will update
the corresponding DVC- files (.dvc extension) and will move data to cache. After
that you should be able to run &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc push&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; to save your data on the external
storage.&lt;/p&gt;
&lt;p&gt;Note, when you do an explicit commit like this you are potentially “breaking”
the reproducibility. In a sense that there is no guarantee now that your
directory can be produced by &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc run&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;/&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc repro&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; — since you changed it
manually.&lt;/p&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485596304961962003/578898899469729796&quot;&gt;I’d like to transform my dataset in-place to avoid copying it, but I can’t use &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc run&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; to do this because it doesn’t allow the same directory as an output and a dependency.&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;You could do this in one step (one stage). So that getting your data and
modifying it, is one stage. So you don’t depend on the data folder. You just
could depend on your download + modifying script.&lt;/p&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485596304961962003/579283950778712076&quot;&gt;Can anyone tell me what this error message is about?&lt;/a&gt; “To avoid unpredictable behavior, rerun command with non overlapping outs paths.”&lt;/h3&gt;
&lt;p&gt;Most likely it means that there is a DVC-file that have the same output twice.
Or there two DVC-files that share the same output file.&lt;/p&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485596304961962003/580176327701823498&quot;&gt;I’m getting “No such file or directory” error when I do &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc run&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; or &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc repro&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;&lt;/a&gt;. The command runs find if I don’t use DVC.&lt;/h3&gt;
&lt;p&gt;That happens because dvc run is trying to ensure that your command is the one
creating your output and removes existing outputs before executing the command.
So that when you run &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc repro&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; later, it will be able to fully reproduce the
output. So you need to make the script create the directory or file.&lt;/p&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485596304961962003/581256265234251776&quot;&gt;I’m implementing a CI/CD and I would like to simplify my CI/CD or even my training code (keeping them cloud agnostic) by using &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc pull&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; inside my Docker container when initializing a training job. &lt;/a&gt; Can DVC be used in this way?&lt;/h3&gt;
&lt;p&gt;Yes, it’s definitely a valid case for DVC. There are different ways of
organizing the storage that training machines are using to access data. From the
very simple — using local storage volume and pulling data from the remote
storage every time — to using NAS or EFS to store a shared DVC cache.&lt;/p&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/563406153334128681/598866528984891403&quot;&gt;I was able to follow the getting started examples, however now I am trying to push my data to Github, I keep getting the following error: “ERROR: failed to push data to the cloud — upload is not supported by https remote”.&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;HTTP remotes do not support upload yet. Example Get Started repository is using
HTTP to keep it read-only and abstract the actual storage provider we are using
internally. If you actually check the remote URL, you should see that it is an
S3 bucket and AWS provides an HTTP end-point to read data from it.&lt;/p&gt;
&lt;h3&gt;Q: I’m looking to configure AWS S3 as a storage for DVC. I’ve set up the remotes and initialized dvc in the git repository. I tried testing it by pushing a dataset in the form of an excel file. The command completed without any issues but this is what I’m seeing in S3. &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485596304961962003/585967551708921856&quot;&gt;DVC seems to have created a subdirectory in the intended directory called “35” where it placed this file with a strange name.&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;This is not an issue, it is an implementation detail. There’s no current way to
upload the files with the original filename (In this case, the S3 bucket will
have the file &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;data.csv&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; but with another name &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;20/893143…&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;). The reason behind
this decision is because we want to store a file only once no matter how many
dataset versions it’s used in. Also, it’s a reliable way to uniquely identify
the file. You don’t have to be afraid that someone decided to create a file with
the same name (path) but a different content.&lt;/p&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/563406153334128681/587730054893666326&quot;&gt;Is it possible to only have a shared ‘local’ cache and no remote?&lt;/a&gt; I’m trying to figure out how to use this in a 40 node cluster which already has very fast NFS storage across all the nodes. Not storing everything twice seems desirable. Esp. for the multi-TB input data&lt;/h3&gt;
&lt;p&gt;Yes and it’s one of the very common use case, actually. All you need to do is to
use dvc cache dir command to setup an external cache. There are few caveats
though. Please, read
&lt;a href=&quot;https://discuss.dvc.org/t/share-nas-data-in-server/180/4?u=shcheklein&quot;&gt;this link&lt;/a&gt;
for an example of the workflow.&lt;/p&gt;
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;hr&gt;&lt;/body&gt;&lt;/html&gt;
&lt;p&gt;If you have any questions, concerns or ideas, let us know in the comments below
or connect with DVC team &lt;a href=&quot;https://dvc.org/support&quot;&gt;here&lt;/a&gt;. Our
&lt;a href=&quot;https://twitter.com/DVCorg&quot;&gt;DMs on Twitter&lt;/a&gt; are always open, too.&lt;/p&gt;</content:encoded></item><item><title><![CDATA[June ’19 DVC❤️Heartbeat]]></title><link>https://blog.dvc.org/june-19-dvc-heartbeat</link><guid isPermaLink="false">https://blog.dvc.org/june-19-dvc-heartbeat</guid><pubDate>Wed, 26 Jun 2019 00:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;News and links&lt;/h2&gt;
&lt;p&gt;We want to start by saying to our users, contributors, and community members how
grateful we are for the fantastic work you are doing contributing to DVC, giving
talks about DVC, sharing your feedback, use cases and your concerns. A huge
thank you to each of you from the DVC team!&lt;/p&gt;
&lt;p&gt;We would love to give back and support any positive initiative around DVC — just
let us know &lt;a href=&quot;https://dvc.org/support&quot;&gt;here&lt;/a&gt; and we will send you a bunch of cool
swag, connect to a tech expert or find another way to support your project. Our
&lt;a href=&quot;https://twitter.com/DVCorg&quot;&gt;DMs on Twitter&lt;/a&gt; are open, too.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;And if you have 4 minutes to spare, we are conducting out first
&lt;a href=&quot;https://docs.google.com/forms/d/1tmn8YHLUkeSi5AIq4DGJi28iZy9HTazl6DWKe3Hxpnc/edit?ts=5cfc47c2&quot;&gt;DVC user survey&lt;/a&gt;
and would love to hear from you!&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Aside from admiring great DVC-related content from our users we have one more
reason to particularly enjoy the past month — DVC team went to Cleveland to
attend &lt;a href=&quot;https://us.pycon.org/2019/about/&quot;&gt;PyCon 2019&lt;/a&gt; and it was a blast!&lt;/p&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;span class=&quot;gatsby-resp-image-wrapper&quot; style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto;  max-width: 700px;&quot;&gt;
      &lt;a class=&quot;gatsby-resp-image-link&quot; href=&quot;/static/b123f78f23b67bb29be863d7452154a3/2d501/cleveland-to-attend-pycon-2019.jpg&quot; style=&quot;display: block&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;
    &lt;span class=&quot;gatsby-resp-image-background-image&quot; style=&quot;padding-bottom: 133.33333333333331%; position: relative; bottom: 0; left: 0; background-image: url(&amp;#x27;data:image/jpeg;base64,/9j/2wBDABALDA4MChAODQ4SERATGCgaGBYWGDEjJR0oOjM9PDkzODdASFxOQERXRTc4UG1RV19iZ2hnPk1xeXBkeFxlZ2P/2wBDARESEhgVGC8aGi9jQjhCY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2P/wgARCAAbABQDASIAAhEBAxEB/8QAGAABAQEBAQAAAAAAAAAAAAAABAADBQL/xAAWAQEBAQAAAAAAAAAAAAAAAAABAAL/2gAMAwEAAhADEAAAAdR9UIjsJOoMi8vmbF//xAAcEAADAAMAAwAAAAAAAAAAAAABAgMAERIEMTL/2gAIAQEAAQUCWSMGOj0cWSc1UGrDlp+WglehdzJtz+rYff8A/8QAFhEAAwAAAAAAAAAAAAAAAAAAABAR/9oACAEDAQE/AXD/xAAXEQEAAwAAAAAAAAAAAAAAAAABABAR/9oACAECAQE/AQvZ/8QAHBAAAgEFAQAAAAAAAAAAAAAAAAECEBESITFB/9oACAEBAAY/AlLPw4cEONtDRG/TKOlakSNP/8QAHxAAAgICAQUAAAAAAAAAAAAAAREAITFBURBhcYHw/9oACAEBAAE/IUQ2FKbqgU4T4JeoQROuTHAYYlytQDtohqConYHzGjswAWcmAPncdL//2gAMAwEAAgADAAAAELAHj//EABcRAQEBAQAAAAAAAAAAAAAAAAEAEBH/2gAIAQMBAT8QZXOL/8QAFxEAAwEAAAAAAAAAAAAAAAAAAAERIf/aAAgBAgEBPxBTIix4Wf/EAB8QAQADAAIBBQAAAAAAAAAAAAEAESExUXFBYYGRof/aAAgBAQABPxBfBk9DI+GpqtPuJW7skWDlNX0eYbyBauAl2vm44FToR5gb4K2oLR75UTWHrqFCz5MmwKNrnMFxZZ+kpYOKgLc//9k=&amp;#x27;); background-size: cover; display: block;&quot;&gt;&lt;/span&gt;
  &lt;picture&gt;
        &lt;source srcset=&quot;/static/b123f78f23b67bb29be863d7452154a3/c54d4/cleveland-to-attend-pycon-2019.webp 175w, /static/b123f78f23b67bb29be863d7452154a3/a3432/cleveland-to-attend-pycon-2019.webp 350w, /static/b123f78f23b67bb29be863d7452154a3/426ac/cleveland-to-attend-pycon-2019.webp 700w, /static/b123f78f23b67bb29be863d7452154a3/c139f/cleveland-to-attend-pycon-2019.webp 1050w, /static/b123f78f23b67bb29be863d7452154a3/7f403/cleveland-to-attend-pycon-2019.webp 1400w, /static/b123f78f23b67bb29be863d7452154a3/e72c3/cleveland-to-attend-pycon-2019.webp 3000w&quot; sizes=&quot;(max-width: 700px) 100vw, 700px&quot; type=&quot;image/webp&quot;&gt;
        &lt;source srcset=&quot;/static/b123f78f23b67bb29be863d7452154a3/8dc06/cleveland-to-attend-pycon-2019.jpg 175w, /static/b123f78f23b67bb29be863d7452154a3/f4417/cleveland-to-attend-pycon-2019.jpg 350w, /static/b123f78f23b67bb29be863d7452154a3/571ad/cleveland-to-attend-pycon-2019.jpg 700w, /static/b123f78f23b67bb29be863d7452154a3/566e2/cleveland-to-attend-pycon-2019.jpg 1050w, /static/b123f78f23b67bb29be863d7452154a3/3a5dd/cleveland-to-attend-pycon-2019.jpg 1400w, /static/b123f78f23b67bb29be863d7452154a3/2d501/cleveland-to-attend-pycon-2019.jpg 3000w&quot; sizes=&quot;(max-width: 700px) 100vw, 700px&quot; type=&quot;image/jpeg&quot;&gt;
        &lt;img class=&quot;gatsby-resp-image-image&quot; src=&quot;/static/b123f78f23b67bb29be863d7452154a3/571ad/cleveland-to-attend-pycon-2019.jpg&quot; alt=&quot;cleveland to attend pycon 2019&quot; title=&quot;cleveland to attend pycon 2019&quot; loading=&quot;lazy&quot; style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;&gt;
      &lt;/picture&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/body&gt;&lt;/html&gt; &lt;em&gt;Amazing
&lt;a href=&quot;https://github.com/sureL&quot;&gt;Jennifer&lt;/a&gt; and her artwork for our
&lt;a href=&quot;https://twitter.com/hashtag/SupportOpenSource&quot;&gt;SupportOpenSource&lt;/a&gt; contest&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;We had it all. Running our first ever conference booth, leading an impromptu
unconference discussion and arranging some cool
&lt;a href=&quot;https://twitter.com/hashtag/SupportOpenSource?src=hashtag_click&quot;&gt;#SupportOpenSource&lt;/a&gt;
activities was great! Last-minute accommodation cancellations, booth equipment
delivery issues, and being late for our very own talk was not so great. Will be
sharing more about it in a separate blogpost soon.&lt;/p&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;iframe width=&quot;100%&quot; height=&quot;315&quot; src=&quot;https://www.youtube-nocookie.com/embed/jkfh2PM5Sz8?rel=0&quot; frameborder=&quot;0&quot; allow=&quot;accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture&quot; allowfullscreen&gt;&lt;/iframe&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;p&gt;Here is &lt;a href=&quot;https://twitter.com/FullStackML&quot;&gt;Dmitry Petrov&lt;/a&gt;’s PyCon
&lt;a href=&quot;https://www.youtube.com/watch?v=jkfh2PM5Sz8&quot;&gt;talk&lt;/a&gt; and
&lt;a href=&quot;https://docs.google.com/presentation/d/1CYt0w8WoZAXiQEtVDVDsTnQumzdZx91v32MwEK20R-E/edit&quot;&gt;slides&lt;/a&gt;
on Machine learning model and dataset versioning practices.&lt;/p&gt;
&lt;p&gt;We absolutely loved being at PyCon and can’t wait for our next conference!&lt;/p&gt;
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;hr&gt;&lt;/body&gt;&lt;/html&gt;
&lt;p&gt;Our team is so happy every time we discover an article featuring DVC or
addressing one of the burning ML issues we are trying to solve. Here are some of
the links that caught our eye past month:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://towardsdatascience.com/the-rise-of-dataops-from-the-ashes-of-data-governance-da3e0c3ac2c4&quot;&gt;The Rise of DataOps (from the ashes of Data Governance)&lt;/a&gt;
by &lt;a href=&quot;https://towardsdatascience.com/@ryanwgross&quot;&gt;Ryan Gross&lt;/a&gt;.&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A brilliant comprehensive read on the current data management issues. It might
be the best article we have ever read on this subject. Every word strongly
resonates with our vision and ideas behind DVC. Highly recommended by DVC team!&lt;/p&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;section class=&quot;elp-content-holder&quot;&gt;
      &lt;a href=&quot;https://towardsdatascience.com/the-rise-of-dataops-from-the-ashes-of-data-governance-da3e0c3ac2c4&quot; class=&quot;external-link-preview&quot;&gt;
          &lt;div class=&quot;elp-description-holder&quot;&gt;
            &lt;h4 class=&quot;elp-title&quot;&gt;The Rise of DataOps (from the ashes of Data Governance)&lt;/h4&gt;
            &lt;div class=&quot;elp-description&quot;&gt;Legacy Data Governance is broken in the ML era. Let’s rebuild it as an engineering discipline to drive…&lt;/div&gt;
            &lt;div class=&quot;elp-link&quot;&gt;towardsdatascience.com&lt;/div&gt;
          &lt;/div&gt;
           &lt;div class=&quot;elp-image-holder&quot;&gt;
                &lt;img src=&quot;/uploads/images/2019-06-26/the-rise-of-data-ops.png&quot; alt=&quot;The Rise of DataOps (from the ashes of Data Governance)&quot;&gt;
            &lt;/div&gt;
      &lt;/a&gt;
    &lt;/section&gt;
    &lt;/body&gt;&lt;/html&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Legacy Data Governance is broken in the ML era. Let’s rebuild it as an
engineering discipline. At the end of the transformation, data governance will
look a lot more like DevOps, with data stewards, scientists, and engineers
working closely together to codify the governance policies.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/@christopher.samiullah/first-impressions-of-data-science-version-control-dvc-fe96ab29cdda&quot;&gt;First Impressions of Data Science Version Control (DVC)&lt;/a&gt;
by &lt;a href=&quot;https://christophergs.github.io/&quot;&gt;Christopher Samiullah&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;section class=&quot;elp-content-holder&quot;&gt;
      &lt;a href=&quot;https://medium.com/@christopher.samiullah/first-impressions-of-data-science-version-control-dvc-fe96ab29cdda&quot; class=&quot;external-link-preview&quot;&gt;
          &lt;div class=&quot;elp-description-holder&quot;&gt;
            &lt;h4 class=&quot;elp-title&quot;&gt;First Impressions of Data Science Version Control (DVC)&lt;/h4&gt;
            &lt;div class=&quot;elp-description&quot;&gt;A Powerful New Machine Learning Tool&lt;/div&gt;
            &lt;div class=&quot;elp-link&quot;&gt;medium.com&lt;/div&gt;
          &lt;/div&gt;
           &lt;div class=&quot;elp-image-holder&quot;&gt;
                &lt;img src=&quot;/uploads/images/2019-06-26/first-impressions-of-data-science-version-control.png&quot; alt=&quot;First Impressions of Data Science Version Control (DVC)&quot;&gt;
            &lt;/div&gt;
      &lt;/a&gt;
    &lt;/section&gt;
    &lt;/body&gt;&lt;/html&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;In 2019, we tend to find organizations using a mix of git, Makefiles, ad hoc
scripts and reference files to try and achieve reproducibility. DVC enters
this mix offering a cleaner solution, specifically targeting Data Science
challenges.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://github.com/peopledoc/mlv-tools-tutorial&quot;&gt;Versioning and Reproducibility with MLV-tools and DVC&lt;/a&gt;:
&lt;a href=&quot;https://peopledoc.github.io/mlv-tools-tutorial/talks/pyData/presentation.html#/&quot;&gt;Talk&lt;/a&gt;
and
&lt;a href=&quot;https://peopledoc.github.io/mlv-tools-tutorial/talks/workshop/presentation.html#/&quot;&gt;Tutorial&lt;/a&gt;
by &lt;a href=&quot;https://github.com/sbracaloni&quot;&gt;Stéphanie Bracaloni&lt;/a&gt; and
&lt;a href=&quot;https://github.com/SdgJlbl&quot;&gt;Sarah Diot-Girard&lt;/a&gt;.&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;span class=&quot;gatsby-resp-image-wrapper&quot; style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto;  max-width: 700px;&quot;&gt;
      &lt;a class=&quot;gatsby-resp-image-link&quot; href=&quot;/static/72397df92519affe8d30d67d72539d3f/2feb5/versioning-and-reproducibility-with-mlv-tools.png&quot; style=&quot;display: block&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;
    &lt;span class=&quot;gatsby-resp-image-background-image&quot; style=&quot;padding-bottom: 45.45454545454545%; position: relative; bottom: 0; left: 0; background-image: url(&amp;#x27;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAJCAIAAAC9o5sfAAAACXBIWXMAAAsSAAALEgHS3X78AAABxklEQVQoz02Qy27TQBSG8wxtigIFGid2Q1tQBR43q+LYTpukCbHHHl8S26kdx21Um/iaNhUFse8KiS1LtrwBD8LbMC6oivRr9J0z5zsjTalGgSPuiCCZSoV+DZjDJrP/ltk7ZHYIQB0w1RogKEDtM/jE/Xqj4K0y/S+lFzvg4B0epcubdLUOyL3Cabxhnr+k8V58i7dg59k2qL8qFhEk2NwA/+XyBmgyhsB7Au92ujbfclvHDnvsCoJ7cjrhWJdjHY6ddno2x5632KnAO2ei2W57RJUtPdkCZ/352LqZTCPLSRC8tr07016NrcS0Y/P8E47tLpx5bHlfDCP3rkJrEo+MFUUJhdzt+pqaLe/H33/3psGFOvrsBMnXH7LlBkhdueG1F/pQyq3ZnZ/MNT2SYappOUnyD3LPl8Wb22+jn3/eB7ee9GFpOov7X8PZ4hIO8/FkMYvmCCWakfjxpabHBT/KnY6PC0VO9VGMVJwEszjIVbWYU2Amy6mqx4qcQSnTjEhR1l/uXuj6EqFMktL+MIBKrKpZX7zqi8FADDEPYSjCj1iASjSQQoRSPE+RQgn/eI1oNXZPdqk2Dm49wgML6811flpp/gXV7JpbWQkJgwAAAABJRU5ErkJggg==&amp;#x27;); background-size: cover; display: block;&quot;&gt;&lt;/span&gt;
  &lt;picture&gt;
        &lt;source srcset=&quot;/static/72397df92519affe8d30d67d72539d3f/c54d4/versioning-and-reproducibility-with-mlv-tools.webp 175w, /static/72397df92519affe8d30d67d72539d3f/a3432/versioning-and-reproducibility-with-mlv-tools.webp 350w, /static/72397df92519affe8d30d67d72539d3f/426ac/versioning-and-reproducibility-with-mlv-tools.webp 700w, /static/72397df92519affe8d30d67d72539d3f/c139f/versioning-and-reproducibility-with-mlv-tools.webp 1050w, /static/72397df92519affe8d30d67d72539d3f/7f403/versioning-and-reproducibility-with-mlv-tools.webp 1400w, /static/72397df92519affe8d30d67d72539d3f/4b6df/versioning-and-reproducibility-with-mlv-tools.webp 2266w&quot; sizes=&quot;(max-width: 700px) 100vw, 700px&quot; type=&quot;image/webp&quot;&gt;
        &lt;source srcset=&quot;/static/72397df92519affe8d30d67d72539d3f/17006/versioning-and-reproducibility-with-mlv-tools.png 175w, /static/72397df92519affe8d30d67d72539d3f/d6f3f/versioning-and-reproducibility-with-mlv-tools.png 350w, /static/72397df92519affe8d30d67d72539d3f/69344/versioning-and-reproducibility-with-mlv-tools.png 700w, /static/72397df92519affe8d30d67d72539d3f/b1f9d/versioning-and-reproducibility-with-mlv-tools.png 1050w, /static/72397df92519affe8d30d67d72539d3f/3fc71/versioning-and-reproducibility-with-mlv-tools.png 1400w, /static/72397df92519affe8d30d67d72539d3f/2feb5/versioning-and-reproducibility-with-mlv-tools.png 2266w&quot; sizes=&quot;(max-width: 700px) 100vw, 700px&quot; type=&quot;image/png&quot;&gt;
        &lt;img class=&quot;gatsby-resp-image-image&quot; src=&quot;/static/72397df92519affe8d30d67d72539d3f/69344/versioning-and-reproducibility-with-mlv-tools.png&quot; alt=&quot;versioning and reproducibility with mlv tools&quot; title=&quot;versioning and reproducibility with mlv tools&quot; loading=&quot;lazy&quot; style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;&gt;
      &lt;/picture&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://www.oreilly.com/ideas/becoming-a-machine-learning-company-means-investing-in-foundational-technologies&quot;&gt;Becoming a machine learning company means investing in foundational technologies&lt;/a&gt;
by &lt;a href=&quot;https://www.oreilly.com/people/4e7ad-ben-lorica&quot;&gt;Ben Lorica&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;section class=&quot;elp-content-holder&quot;&gt;
      &lt;a href=&quot;https://www.oreilly.com/ideas/becoming-a-machine-learning-company-means-investing-in-foundational-technologies&quot; class=&quot;external-link-preview&quot;&gt;
          &lt;div class=&quot;elp-description-holder&quot;&gt;
            &lt;h4 class=&quot;elp-title&quot;&gt;Becoming a machine learning company means investing in foundational technologies&lt;/h4&gt;
            &lt;div class=&quot;elp-description&quot;&gt;Get expert knowledge on the tools and technologies you need to put your data strategies to work. Join us at the…&lt;/div&gt;
            &lt;div class=&quot;elp-link&quot;&gt;oreilly.com&lt;/div&gt;
          &lt;/div&gt;
           &lt;div class=&quot;elp-image-holder&quot;&gt;
                &lt;img src=&quot;/uploads/images/2019-06-26/becoming-a-machine-learning-company.jpeg&quot; alt=&quot;Becoming a machine learning company means investing in foundational technologies&quot;&gt;
            &lt;/div&gt;
      &lt;/a&gt;
    &lt;/section&gt;
    &lt;/body&gt;&lt;/html&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;With an eye toward the growing importance of machine learning, we recently
completed
&lt;a href=&quot;https://www.oreilly.com/data/free/evolving-data-infrastructure.csp&quot;&gt;a data infrastructure survey&lt;/a&gt;
that drew more than 3,200 respondents.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;hr&gt;&lt;/body&gt;&lt;/html&gt;
&lt;h2&gt;Discord gems&lt;/h2&gt;
&lt;p&gt;There are lots of hidden gems in our Discord community discussions. Sometimes
they are scattered all over the channels and hard to track down.&lt;/p&gt;
&lt;p&gt;We are sifting through the issues and discussions and share with you the most
interesting takeaways.&lt;/p&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/563406153334128681/575655655629651968&quot;&gt;Does DVC support Azure Data Lake Gen1?&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Azure data lake is HDFS compatible. And DVC supports HDFS remotes. Give it a try
and let us know if you hit any problems &lt;a href=&quot;https://dvc.org/chat&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/563406153334128681/575681811401801748&quot;&gt;An excellent discussion on versioning tabular (SQL) data.&lt;/a&gt; Do you know of any tools that deal better with SQL-specific versioning?&lt;/h3&gt;
&lt;p&gt;It’s a wide topic. The actual solution might depend on a specific scenario and
what exactly needs to be versioned. DVC does not provide any special
functionality on top of databases to version their content.&lt;/p&gt;
&lt;p&gt;Depending on your use case, our recommendation would be to run SQL and pull the
result file (CSV/TSV file?) that then can be used to do analysis. This file can
be taken under DVC control. Alternatively, in certain cases source files (that
are used to populate the databases) can be taken under control and we can keep
versions of them, or track incoming updates.&lt;/p&gt;
&lt;p&gt;Read the
&lt;a href=&quot;https://discordapp.com/channels/485586884165107732/563406153334128681/575681811401801748&quot;&gt;discussion&lt;/a&gt;
to learn more.&lt;/p&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/563406153334128681/575686711821205504&quot;&gt;How does DVC do the versioning between binary files?&lt;/a&gt; Is there a binary diff, similar to git? Or is every version stored distinctly in full?&lt;/h3&gt;
&lt;p&gt;DVC is just saving every file as is, we don’t use binary diffs right now. There
won’t be a full directory (if you added just a few files to a 10M files
directory) duplication, though, since we treat every file inside as a separate
entity.&lt;/p&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/563406153334128681/576160840701575169&quot;&gt;Is there a way to pass parameters from e.g. &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc repro&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; to stages?&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;The simplest option is to create a config file — json or whatnot — that your
scripts would read and your stages depend on.&lt;/p&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/563406153334128681/577852740034625576&quot;&gt;What is the best way to get cached output files from different branches simultaneously?&lt;/a&gt; For example, cached tensorboard files from different branches to compare experiments.&lt;/h3&gt;
&lt;p&gt;There is a way to do that through our (still not officially released) API pretty
easily. Here is an
&lt;a href=&quot;https://cdn.discordapp.com/attachments/563406153334128681/577894682722304030/dvc_get_output_files.py&quot;&gt;example script&lt;/a&gt;
how it could be done.&lt;/p&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/563406153334128681/583949033685516299&quot;&gt;Docker and DVC.&lt;/a&gt; To being able to push/pull data we need to run a git clone to get DVC-files and remote definitions — but we worry that would make the container quite heavy (since it contains our entire project history).&lt;/h3&gt;
&lt;p&gt;You can do &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;git clone — depth 1&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;, which will not download any history except the
latest commits.&lt;/p&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485596304961962003/574133734136086559&quot;&gt;After DVC pushing the same file, it creates multiple copies of the same file. Is that how it’s supposed to work?&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;If you are pushing the same file, there are no copies pushed or saved in the
cache. DVC is using checksums to identify files, so if you add the same file
once again, it will detect that cache for it is already in the local cache and
wont copy it again to cache. Same with dvc push, if it sees that you already
have cache file with that checksum on your remote, it won’t upload it again.&lt;/p&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485596304961962003/574941227624169492&quot;&gt;How do I uninstall DVC on Mac (installed via &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;pkg&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; installer)?&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Something like this should work:&lt;/p&gt;
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;dvc&quot;&gt;&lt;pre class=&quot;language-dvc&quot;&gt;&lt;code class=&quot;language-dvc&quot;&gt;&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token command&quot;&gt;which&lt;/span&gt; dvc
&lt;/span&gt;/usr/local/bin/dvc -&gt; /usr/local/lib/dvc/dvc

&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token command&quot;&gt;ls&lt;/span&gt; -la /usr/local/bin/dvc
&lt;/span&gt;/usr/local/bin/dvc -&gt; /usr/local/lib/dvc/dvc

&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token command&quot;&gt;sudo&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;rm&lt;/span&gt; -f /usr/local/bin/dvc
&lt;/span&gt;&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token command&quot;&gt;sudo&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;rm&lt;/span&gt; -rf /usr/local/lib/dvc
&lt;/span&gt;&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token command&quot;&gt;sudo&lt;/span&gt; pkgutil --forget com.iterative.dvc&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/body&gt;&lt;/html&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485596304961962003/575236576309674024&quot;&gt;How do I pull from a public S3 bucket (that contains DVC remote)?&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Just add public URL of the bucket as an HTTP endpoint. See
&lt;a href=&quot;https://github.com/iterative/example-get-started/blob/master/.dvc/config&quot;&gt;here&lt;/a&gt;
for an example.
&lt;a href=&quot;https://remote.dvc.org/get-started&quot;&gt;https://remote.dvc.org/get-started&lt;/a&gt; is made
to redirect to the S3 bucket anyone can read from.&lt;/p&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485596304961962003/575535709490905101&quot;&gt;I’m getting the same error over and over about locking:&lt;/a&gt; &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;ERROR: failed to lock before running a command — cannot perform the cmd since DVC is busy and locked. Please retry the command later.&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;&lt;/h3&gt;
&lt;p&gt;Most likely it happens due to an attempt to run DVC on NFS that has some
configuration problems. There is a
&lt;a href=&quot;https://github.com/iterative/dvc/issues/1918&quot;&gt;well known problem with DVC on NFS&lt;/a&gt;
— sometimes it hangs on trying to lock a file. The usual workaround for this
problem is to allocate DVC cache on NFS, but run the project (git clone, DVC
metafiles, etc) on the local file system. Read
&lt;a href=&quot;https://discuss.dvc.org/t/share-nas-data-in-server/180/4?u=shcheklein&quot;&gt;this answer&lt;/a&gt;
to see how it can be setup.&lt;/p&gt;
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;hr&gt;&lt;/body&gt;&lt;/html&gt;
&lt;p&gt;If you have any questions, concerns or ideas, let us know in the comments below
or connect with DVC team &lt;a href=&quot;https://dvc.org/support&quot;&gt;here&lt;/a&gt;. Our
&lt;a href=&quot;https://twitter.com/DVCorg&quot;&gt;DMs on Twitter&lt;/a&gt; are open, too.&lt;/p&gt;</content:encoded></item><item><title><![CDATA[May ’19 DVC❤️Heartbeat]]></title><link>https://blog.dvc.org/may-19-dvc-heartbeat</link><guid isPermaLink="false">https://blog.dvc.org/may-19-dvc-heartbeat</guid><pubDate>Tue, 21 May 2019 00:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;News and links&lt;/h2&gt;
&lt;p&gt;This section of DVC Heartbeat is growing with every new Issue and this is
already quite a good piece of news!&lt;/p&gt;
&lt;p&gt;One of the most exciting things we want to share this month is acceptance of DVC
into the &lt;a href=&quot;https://developers.google.com/season-of-docs/&quot;&gt;Google Season of Docs&lt;/a&gt;.
It is a new and unique program sponsored by Google that pairs technical writers
with open source projects to collaborate and improve the open source project
documentation. You can find the outline of DVC vision and project ideas in
&lt;a href=&quot;https://blog.dataversioncontrol.com/dvc-project-ideas-for-google-summer-of-docs-2019-defe3a73b248&quot;&gt;this dedicated blogpost&lt;/a&gt;
and check the
&lt;a href=&quot;https://developers.google.com/season-of-docs/docs/participants/&quot;&gt;full list of participating open source organizations&lt;/a&gt;.
Technically the
&lt;a href=&quot;https://developers.google.com/season-of-docs/docs/timeline&quot;&gt;program is starting in a few months&lt;/a&gt;,
but there is already a fantastic increase in the amount of commits and
contributors, and we absolutely love it!&lt;/p&gt;
&lt;p&gt;The other important milestone for us was the first offline meeting with our
distributed remote team. Working side by side and having non-Zoom meetings with
the team was amazing. Joining our forces to prepare for the upcoming conferences
turned out to be the most valuable, educating and uniting experience for the
whole team.&lt;/p&gt;
&lt;p&gt;It’s a shame that our tech lead was unable to join us it due to another visa
denial. We do hope he will finally make it to the USA for the next big
conference.&lt;/p&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;span class=&quot;gatsby-resp-image-wrapper&quot; style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto;  max-width: 700px;&quot;&gt;
      &lt;a class=&quot;gatsby-resp-image-link&quot; href=&quot;/static/060f8f204b833689b1569a4162d67e3d/6d894/the-world-is-changing.png&quot; style=&quot;display: block&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;
    &lt;span class=&quot;gatsby-resp-image-background-image&quot; style=&quot;padding-bottom: 58.801955990220044%; position: relative; bottom: 0; left: 0; background-image: url(&amp;#x27;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAMCAIAAADtbgqsAAAACXBIWXMAAAsSAAALEgHS3X78AAAB+UlEQVQoz3VSiW7aQBT0L7QY0xASzA0xCYe9mCOcpsQ4uIGkVdKqPggYjA0BQwqNFKnqn/cZQ0SlVhpZs7M7b988LxbgDEpYxFpT1FsH6xNAvDWjb54z4orqWHhOdf8f2ElND3NmpGmmxCVZ0/2VcbhhXHSsZHsR5+dvZnyLQ2Kb3Uh+T3/HkeSiJZyRYIlvFQdETvGwCpGTCdYm7pwM8LB7c6ypZcRZlNNizdE5P0l8HEcaGnxT19MLwaT4SRT0tglKvDkCMckbUW54XFRxZJtH2e4m1npKCha6+0m1rcTV/Ky9oLtrurdhbjfBugH5U50lxc9hLsUvL6G67i1ApwpGtbSUMM2Ks/ytVX1YJ9sGkExnhm7mbM8qfl6d8RP22yv79ZW5fylJvyrq7/PuKsxPfZyJEawKwdw5Bd+mtYMh2VaQHc+NJCKvkvxTRFgAAlez2PUSkBBXIX6O/T39PiR54wTb300YQWnFFlkVZxSckV2MDGRndjGKvzpKf1rS3R/BxuQdLZFVPS0u0+LKXx3D0Q/5frb7nBQWznli96sObvYWH/2VwW6b7R8V+v7yAIhdHaknl4+e/GFrezN05S0NAzU91NCPS0OnSbI2hnYcDiIM+bQyOioM/mE+LWvRpgmvzVfWHEOYM+C1Oru+Sw1Kk9Wxbd5f/gdV/LcK4QQLOwAAAABJRU5ErkJggg==&amp;#x27;); background-size: cover; display: block;&quot;&gt;&lt;/span&gt;
  &lt;picture&gt;
        &lt;source srcset=&quot;/static/060f8f204b833689b1569a4162d67e3d/c54d4/the-world-is-changing.webp 175w, /static/060f8f204b833689b1569a4162d67e3d/a3432/the-world-is-changing.webp 350w, /static/060f8f204b833689b1569a4162d67e3d/426ac/the-world-is-changing.webp 700w, /static/060f8f204b833689b1569a4162d67e3d/c139f/the-world-is-changing.webp 1050w, /static/060f8f204b833689b1569a4162d67e3d/7f403/the-world-is-changing.webp 1400w, /static/060f8f204b833689b1569a4162d67e3d/2ec87/the-world-is-changing.webp 1636w&quot; sizes=&quot;(max-width: 700px) 100vw, 700px&quot; type=&quot;image/webp&quot;&gt;
        &lt;source srcset=&quot;/static/060f8f204b833689b1569a4162d67e3d/17006/the-world-is-changing.png 175w, /static/060f8f204b833689b1569a4162d67e3d/d6f3f/the-world-is-changing.png 350w, /static/060f8f204b833689b1569a4162d67e3d/69344/the-world-is-changing.png 700w, /static/060f8f204b833689b1569a4162d67e3d/b1f9d/the-world-is-changing.png 1050w, /static/060f8f204b833689b1569a4162d67e3d/3fc71/the-world-is-changing.png 1400w, /static/060f8f204b833689b1569a4162d67e3d/6d894/the-world-is-changing.png 1636w&quot; sizes=&quot;(max-width: 700px) 100vw, 700px&quot; type=&quot;image/png&quot;&gt;
        &lt;img class=&quot;gatsby-resp-image-image&quot; src=&quot;/static/060f8f204b833689b1569a4162d67e3d/69344/the-world-is-changing.png&quot; alt=&quot;the world is changing&quot; title=&quot;the world is changing&quot; loading=&quot;lazy&quot; style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;&gt;
      &lt;/picture&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;p&gt;While we were busy finalizing all the PyCon 2019 prep, our own
&lt;a href=&quot;https://twitter.com/FullStackML&quot;&gt;Dmitry Petrov&lt;/a&gt; flew to New York to speak at
the
&lt;a href=&quot;https://conferences.oreilly.com/artificial-intelligence/ai-ny&quot;&gt;O’Reilly AI Conference&lt;/a&gt;
about the
&lt;a href=&quot;https://www.oreilly.com/library/view/artificial-intelligence-conference/9781492050544/video324691.html&quot;&gt;Open Source tools for Machine Learning Models and Datasets versioning&lt;/a&gt;.
Unfortunately the video is available for the registered users only (with a free
trial option) but you can have a look at Dmitry’s slides
&lt;a href=&quot;https://www.slideshare.net/DmitryPetrov15/dvc-oreilly-artificial-intelligence-conference-2019-new-york&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;span class=&quot;gatsby-resp-image-wrapper&quot; style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto;  max-width: 404px;&quot;&gt;
      &lt;a class=&quot;gatsby-resp-image-link&quot; href=&quot;/static/bee9b4ed9981db1bf7eb9db8450fc8d1/38b39/iterative-ai-twitter.png&quot; style=&quot;display: block&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;
    &lt;span class=&quot;gatsby-resp-image-background-image&quot; style=&quot;padding-bottom: 25.247524752475247%; position: relative; bottom: 0; left: 0; background-image: url(&amp;#x27;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAFCAIAAADKYVtkAAAACXBIWXMAAAsSAAALEgHS3X78AAAA2klEQVQY02VQ226DMAzl/z9p0l76tKcxKYxEo0xlpWkLTQm5X0iYNdaHaUeWjmX52McujI/BO29t237WGDdNczz215FdhtvAJgipDZtmPsuU8voXxW5Pd0/Puj+gmpTlW1W9v5YlPV+Usc4aQIzROr+ktP5DceXygzROiRoTTAgsRwh9dZ1UpuOZijSY9URvpKVVfWCzBU1+OCg2yjnf75OUSgghgbSajBh1nO3Cw8rHkVWY01NwHprB/2bkVwxqbWx+zIwp9ursktsqi1DiBSVl88/ZPgR4ASTfmIMa4VKNNzMAAAAASUVORK5CYII=&amp;#x27;); background-size: cover; display: block;&quot;&gt;&lt;/span&gt;
  &lt;picture&gt;
        &lt;source srcset=&quot;/static/bee9b4ed9981db1bf7eb9db8450fc8d1/c54d4/iterative-ai-twitter.webp 175w, /static/bee9b4ed9981db1bf7eb9db8450fc8d1/a3432/iterative-ai-twitter.webp 350w, /static/bee9b4ed9981db1bf7eb9db8450fc8d1/426ac/iterative-ai-twitter.webp 700w, /static/bee9b4ed9981db1bf7eb9db8450fc8d1/2b269/iterative-ai-twitter.webp 808w&quot; sizes=&quot;(max-width: 700px) 100vw, 700px&quot; type=&quot;image/webp&quot;&gt;
        &lt;source srcset=&quot;/static/bee9b4ed9981db1bf7eb9db8450fc8d1/17006/iterative-ai-twitter.png 175w, /static/bee9b4ed9981db1bf7eb9db8450fc8d1/d6f3f/iterative-ai-twitter.png 350w, /static/bee9b4ed9981db1bf7eb9db8450fc8d1/69344/iterative-ai-twitter.png 700w, /static/bee9b4ed9981db1bf7eb9db8450fc8d1/38b39/iterative-ai-twitter.png 808w&quot; sizes=&quot;(max-width: 700px) 100vw, 700px&quot; type=&quot;image/png&quot;&gt;
        &lt;img class=&quot;gatsby-resp-image-image&quot; src=&quot;/static/bee9b4ed9981db1bf7eb9db8450fc8d1/69344/iterative-ai-twitter.png&quot; alt=&quot;iterative ai twitter&quot; title=&quot;iterative ai twitter&quot; loading=&quot;lazy&quot; style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;&gt;
      &lt;/picture&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;p&gt;We renamed our Twitter! Our old handle was a bit misleading and we moved from
@Iterativeai to &lt;a href=&quot;https://twitter.com/DVCorg&quot;&gt;@DVCorg&lt;/a&gt; (yet keep the old one for
future projects).&lt;/p&gt;
&lt;p&gt;Our team is so happy every time we discover an article featuring DVC or
addressing one of the burning ML issues we are trying to solve. Here are some of
our favorite links from the past month:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://www.pythonpodcast.com/data-version-control-episode-206/&quot;&gt;Version Control For Your Machine Learning Projects — Episode 206&lt;/a&gt;&lt;/strong&gt;
by &lt;strong&gt;&lt;a href=&quot;https://www.linkedin.com/in/tmacey/&quot;&gt;Tobias Macey&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;section class=&quot;elp-content-holder&quot;&gt;
      &lt;a href=&quot;https://www.pythonpodcast.com/data-version-control-episode-206/&quot; class=&quot;external-link-preview&quot;&gt;
          &lt;div class=&quot;elp-description-holder&quot;&gt;
            &lt;h4 class=&quot;elp-title&quot;&gt;Version Control For Machine Learning Projects&lt;/h4&gt;
            &lt;div class=&quot;elp-description&quot;&gt;An interview with the creator of DVC about how it improves collaboration and reduces duplicate effort on data science…&lt;/div&gt;
            &lt;div class=&quot;elp-link&quot;&gt;pythonpodcast.com&lt;/div&gt;
          &lt;/div&gt;
           &lt;div class=&quot;elp-image-holder&quot;&gt;
                &lt;img src=&quot;/uploads/images/2019-05-21/version-control-for-your-machine-learning-projects.png&quot; alt=&quot;Version Control For Machine Learning Projects&quot;&gt;
            &lt;/div&gt;
      &lt;/a&gt;
    &lt;/section&gt;
    &lt;/body&gt;&lt;/html&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Version control has become table stakes for any software team, but for machine
learning projects there has been no good answer for tracking all of the data
that goes into building and training models, and the output of the models
themselves. To address that need Dmitry Petrov built the Data Version Control
project known as DVC. In this episode he explains how it simplifies
communication between data scientists, reduces duplicated effort, and
simplifies concerns around reproducing and rebuilding models at different
stages of the projects lifecycle.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Here is an
&lt;a href=&quot;https://towardsdatascience.com/data-version-control-with-dvc-what-do-the-authors-have-to-say-3c3b10f27ee&quot;&gt;article&lt;/a&gt;
by &lt;a href=&quot;https://medium.com/@faviovazquez&quot;&gt;Favio Vázquez&lt;/a&gt; with a transcript of this
podcast episode.&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;section class=&quot;elp-content-holder&quot;&gt;
      &lt;a href=&quot;https://towardsdatascience.com/data-version-control-with-dvc-what-do-the-authors-have-to-say-3c3b10f27ee&quot; class=&quot;external-link-preview&quot;&gt;
          &lt;div class=&quot;elp-description-holder&quot;&gt;
            &lt;h4 class=&quot;elp-title&quot;&gt;Data version control with DVC. What do the authors have to say?&lt;/h4&gt;
            &lt;div class=&quot;elp-description&quot;&gt;Data versioning is one of the most ignored features in data science projects, but that has to change. Here I’ll discuss…&lt;/div&gt;
            &lt;div class=&quot;elp-link&quot;&gt;towardsdatascience.com&lt;/div&gt;
          &lt;/div&gt;
           &lt;div class=&quot;elp-image-holder&quot;&gt;
                &lt;img src=&quot;/uploads/images/2019-05-21/data-version-control-with-dvc.png&quot; alt=&quot;Data version control with DVC. What do the authors have to say?&quot;&gt;
            &lt;/div&gt;
      &lt;/a&gt;
    &lt;/section&gt;
    &lt;/body&gt;&lt;/html&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://towardsdatascience.com/why-git-and-git-lfs-is-not-enough-to-solve-the-machine-learning-reproducibility-crisis-f733b49e96e8&quot;&gt;Why Git and Git-LFS is not enough to solve the Machine Learning Reproducibility crisis&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;section class=&quot;elp-content-holder&quot;&gt;
      &lt;a href=&quot;https://towardsdatascience.com/why-git-and-git-lfs-is-not-enough-to-solve-the-machine-learning-reproducibility-crisis-f733b49e96e8&quot; class=&quot;external-link-preview&quot;&gt;
          &lt;div class=&quot;elp-description-holder&quot;&gt;
            &lt;h4 class=&quot;elp-title&quot;&gt;Why Git and Git-LFS is not enough to solve the Machine Learning Reproducibility crisis&lt;/h4&gt;
            &lt;div class=&quot;elp-description&quot;&gt;Some claim the machine learning field is in a crisis due to software tooling that’s insufficient to ensure repeatable…&lt;/div&gt;
            &lt;div class=&quot;elp-link&quot;&gt;towardsdatascience.com&lt;/div&gt;
          &lt;/div&gt;
           &lt;div class=&quot;elp-image-holder&quot;&gt;
                &lt;img src=&quot;/uploads/images/2019-05-21/why-git-and-git-lfs-is-not-enough.jpeg&quot; alt=&quot;Why Git and Git-LFS is not enough to solve the Machine Learning Reproducibility crisis&quot;&gt;
            &lt;/div&gt;
      &lt;/a&gt;
    &lt;/section&gt;
    &lt;/body&gt;&lt;/html&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;With Git-LFS your team has better control over the data, because it is now
version controlled. Does that mean the problem is solved? Earlier we said the
“&lt;em&gt;key issue is the training data&lt;/em&gt;”, but that was a lie. Sort of. Yes keeping
the data under version control is a big improvement. But is the lack of
version control of the data files the entire problem? No.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;hr&gt;&lt;/body&gt;&lt;/html&gt;
&lt;h2&gt;Discord gems&lt;/h2&gt;
&lt;p&gt;There are lots of hidden gems in our Discord community discussions. Sometimes
they are scattered all over the channels and hard to track down.&lt;/p&gt;
&lt;p&gt;We are sifting through the issues and discussions and share with you the most
interesting takeaways.&lt;/p&gt;
&lt;h3&gt;Q: This might be &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485598848111083531/572960640122224640&quot;&gt;a favourite gem of ours &lt;/a&gt; — our engineers are so fast that someone assumed they were bots.&lt;/h3&gt;
&lt;p&gt;We feared that too until we met them in person. They appeared to be real (unless
bots also love Ramen now)!&lt;/p&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;span class=&quot;gatsby-resp-image-wrapper&quot; style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto;  max-width: 700px;&quot;&gt;
      &lt;a class=&quot;gatsby-resp-image-link&quot; href=&quot;/static/4926411413e184b4531924e6c0aeaf02/e0305/bots-also-love-ramen-now.png&quot; style=&quot;display: block&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;
    &lt;span class=&quot;gatsby-resp-image-background-image&quot; style=&quot;padding-bottom: 76.90387016229712%; position: relative; bottom: 0; left: 0; background-image: url(&amp;#x27;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAPCAIAAABr+ngCAAAACXBIWXMAAAsSAAALEgHS3X78AAABfklEQVQoz5WTXUvDMBSG+3f0Qpxd06T5PEmapp3adptzMAYOvVLEW3++pymIV7LBQzmkvJwnp6fZ9/vudNhw3VQCjLaUa8LUmWRfr+NxN8R2jPf9Nq6ktAU9O3y9qO4KGaC1dVwQWTB5QedYe7Dgg+HKMwHnJ6cwIbxkyoeoXU+5JRd1/v54fjlsqKxppcpKX9b58zQc98PjuH3o15TDZZ2vbtkdEc4HZSPOuaDyl3MG5gCsdSCUFdpV0s7888HL6YIpLKXSpl5v9+PQSwjaNohxDRMmJwL7L5PFshTIZFQKNF2mXcjwHRM+dE84cOOitgF8TPlo6864Fnyb6nbCt8q2+3UbashLmS2KilAN5rHkgBs6kbRZKlg6wSfDw1RTAUpjoXFA2TBuQuhowZIbR/JkmOAJ8Yf5CnJe4ez09r57fglm1cQH26whDKiH5qgq7ajdPdZorqCJMW56lxM5T2sK3+RVUUHdYaYDv0KSsMEn5Q53js3a0yEIqf/+Nj/OONkeIe6EBgAAAABJRU5ErkJggg==&amp;#x27;); background-size: cover; display: block;&quot;&gt;&lt;/span&gt;
  &lt;picture&gt;
        &lt;source srcset=&quot;/static/4926411413e184b4531924e6c0aeaf02/c54d4/bots-also-love-ramen-now.webp 175w, /static/4926411413e184b4531924e6c0aeaf02/a3432/bots-also-love-ramen-now.webp 350w, /static/4926411413e184b4531924e6c0aeaf02/426ac/bots-also-love-ramen-now.webp 700w, /static/4926411413e184b4531924e6c0aeaf02/c139f/bots-also-love-ramen-now.webp 1050w, /static/4926411413e184b4531924e6c0aeaf02/7f403/bots-also-love-ramen-now.webp 1400w, /static/4926411413e184b4531924e6c0aeaf02/e2173/bots-also-love-ramen-now.webp 1602w&quot; sizes=&quot;(max-width: 700px) 100vw, 700px&quot; type=&quot;image/webp&quot;&gt;
        &lt;source srcset=&quot;/static/4926411413e184b4531924e6c0aeaf02/17006/bots-also-love-ramen-now.png 175w, /static/4926411413e184b4531924e6c0aeaf02/d6f3f/bots-also-love-ramen-now.png 350w, /static/4926411413e184b4531924e6c0aeaf02/69344/bots-also-love-ramen-now.png 700w, /static/4926411413e184b4531924e6c0aeaf02/b1f9d/bots-also-love-ramen-now.png 1050w, /static/4926411413e184b4531924e6c0aeaf02/3fc71/bots-also-love-ramen-now.png 1400w, /static/4926411413e184b4531924e6c0aeaf02/e0305/bots-also-love-ramen-now.png 1602w&quot; sizes=&quot;(max-width: 700px) 100vw, 700px&quot; type=&quot;image/png&quot;&gt;
        &lt;img class=&quot;gatsby-resp-image-image&quot; src=&quot;/static/4926411413e184b4531924e6c0aeaf02/69344/bots-also-love-ramen-now.png&quot; alt=&quot;bots also love ramen now&quot; title=&quot;bots also love ramen now&quot; loading=&quot;lazy&quot; style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;&gt;
      &lt;/picture&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485596304961962003/572974117351849997&quot;&gt;Is this the best way to track data with DVC when code and data are separate?&lt;/a&gt; Having being burned by this a couple of times, i.e accidentally pushing large files to GitHub, I now keep my code and data separate.&lt;/h3&gt;
&lt;p&gt;Every time you run &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc add&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; to start tracking some data artifact, its path is
automatically added to the &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;.gitignore&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; file, as a result it is hard to commit
it to git by mistake — you would need to explicitly modify the &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;.gitignore&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;
first. The feature to track some external data is called
&lt;a href=&quot;https://dvc.org/doc/user-guide/external-outputs&quot;&gt;external outputs&lt;/a&gt; (if all you
need is to track some data artifacts). Usually it is used when you have some
data on S3 or SSH and don’t want to pull it into your working space, but it’s
working even when your data is located on the same machine outside of the
repository.&lt;/p&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485596304961962003/571342592508428289&quot;&gt;How do I wrap a step that downloads a file/directory into a DVC stage?&lt;/a&gt; I want to ensure that it runs only if file has no been downloaded yet&lt;/h3&gt;
&lt;p&gt;Use &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc import&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; to track and download the remote data first time and next time
when you do dvc repro if data has changed remotely. If you don’t want to track
remote changes (lock the data after it was downloaded), use &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc run&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; with a
dummy dependency (any text file will do you do not touch) that runs an actual
wget/curl to get the data.&lt;/p&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485596304961962003/570943786151313408&quot;&gt;How do I show a pipeline that does not have a default Dvcfile?&lt;/a&gt; (e.g. I assigned all files names manually with &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;-f&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; in the &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc run&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; command and I just don’t have &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;Dvcfile&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; anymore)&lt;/h3&gt;
&lt;p&gt;Almost any command in DVC that deals with pipelines (set of DVC-files) accepts a
single stage as a target, for example:&lt;/p&gt;
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;dvc&quot;&gt;&lt;pre class=&quot;language-dvc&quot;&gt;&lt;code class=&quot;language-dvc&quot;&gt;&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token dvc&quot;&gt;dvc pipeline show&lt;/span&gt; — ascii model.dvc&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/body&gt;&lt;/html&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485596304961962003/570843482218823682&quot;&gt;DVC hangs or I’m getting &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;database is locked&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; issue&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;It’s a well known problem with NFS, CIFS (Azure) — they do not support file
locks properly which is required by the SQLLite engine to operate. The easiest
workaround — don’t create a DVC project on network attached partition. In
certain cases a fix can be made by changing mounting options, check
&lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485596304961962003/570276668694855690&quot;&gt;this discussion&lt;/a&gt;
for the Azure ML Service.&lt;/p&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485596304961962003/570091809594671126&quot;&gt;How do I use DVC if I use a separate drive to store the data and a small/fast SSD to run computations?&lt;/a&gt; I don’t have enough space to bring data to my working space.&lt;/h3&gt;
&lt;p&gt;An excellent question! The short answer is:&lt;/p&gt;
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;dvc&quot;&gt;&lt;pre class=&quot;language-dvc&quot;&gt;&lt;code class=&quot;language-dvc&quot;&gt;&lt;span class=&quot;token comment&quot;&gt;# To move your data cache to a big partition&lt;/span&gt;
&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token dvc&quot;&gt;dvc cache dir&lt;/span&gt; --local /path/to/an/external/partition
&lt;/span&gt;
&lt;span class=&quot;token comment&quot;&gt;# To enable symlinks/harldinks to avoid actual copying&lt;/span&gt;
&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token dvc&quot;&gt;dvc config&lt;/span&gt; cache.type reflink, hardlink, symlink, copy
&lt;/span&gt;
&lt;span class=&quot;token comment&quot;&gt;# To protect the cache&lt;/span&gt;
&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token dvc&quot;&gt;dvc config&lt;/span&gt; cache.protected &lt;span class=&quot;token boolean&quot;&gt;true&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/body&gt;&lt;/html&gt;
&lt;p&gt;The last one is highly recommended to make links in your working space read-only
to avoid corrupting the cache. Read more about different link types
&lt;a href=&quot;https://dvc.org/doc/user-guide/large-dataset-optimization&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;To add your data first time to the DVC cache, do a clone of the repository on a
big partition and run &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc add&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; to add your data. Then you can do &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;git pull&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;,
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc pull&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; on a small partition and DVC will create all the necessary links.&lt;/p&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485596304961962003/571335064374345749&quot;&gt;Why I’m getting &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;Paths for outs overlap&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; error when I run &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc add&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; or &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc run&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;?&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Usually it means that a parent directory of one of the arguments for &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc add&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; /
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc run&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; is already tracked. For example, you’ve added the whole datasets
directory already. And now you are trying to add a subdirectory, which is
already tracked as a part of the datasets one. No need to do that. You could
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc add datasets&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; or &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc repro datasets.dvc&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; to save changes.&lt;/p&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485596304961962003/567310354766495747&quot;&gt;I’m getting &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;ascii codec can’t encode character&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; error on DVC commands when I deal with unicode file names&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;&lt;a href=&quot;https://perlgeek.de/en/article/set-up-a-clean-utf8-environment&quot;&gt;Check the locale settings you have&lt;/a&gt;
(&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;locale&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; command in Linux). Python expects a locale that can handle unicode
printing. Usually it’s solved with these commands: &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;export LC_ALL=en_US.UTF-8&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;
and &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;export LANG=en_US.UTF-8&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;. You can place those exports into &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;.bashrc&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; or
other file that defines your environment.&lt;/p&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485596304961962003/563149775340568576&quot;&gt;Does DVC use the same logins &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;aws-cli&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; has when using an S3 bucket as its repo/remote storage&lt;/a&gt;?&lt;/h3&gt;
&lt;p&gt;In short — yes, but it can be also configured. DVC is going to use either your
default profile (from &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;~/.aws/*&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;) or your env vars by default. If you need more
flexibility (e.g. you need to use different credentials for different projects,
etc) check out
&lt;a href=&quot;https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-profiles.html&quot;&gt;this guide&lt;/a&gt;
to configure custom aws profiles and then you could use them with DVC using
these
&lt;a href=&quot;https://dvc.org/doc/commands-reference/remote-add#options&quot;&gt;remote options&lt;/a&gt;.&lt;/p&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485596304961962003/566000729505136661&quot;&gt;How can I output multiple metrics from a single file?&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Let’s say I have the following in a file:&lt;/p&gt;
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;json&quot;&gt;&lt;pre class=&quot;language-json&quot;&gt;&lt;code class=&quot;language-json&quot;&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;
  “AUC_RATIO”&lt;span class=&quot;token operator&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;
      “train”&lt;span class=&quot;token operator&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token number&quot;&gt;0.8922748258797667&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
      “valid”&lt;span class=&quot;token operator&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token number&quot;&gt;0.8561602726251776&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
      “xval”&lt;span class=&quot;token operator&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token number&quot;&gt;0.8843431199314923&lt;/span&gt;
    &lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/body&gt;&lt;/html&gt;
&lt;p&gt;How can I show both &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;train&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; and &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;valid&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; without &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;xval&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;?&lt;/p&gt;
&lt;p&gt;You can use &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc metrics show&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; command &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;--xpath&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; option and provide multiple
attribute names to it:&lt;/p&gt;
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;dvc&quot;&gt;&lt;pre class=&quot;language-dvc&quot;&gt;&lt;code class=&quot;language-dvc&quot;&gt;&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token dvc&quot;&gt;dvc metrics show&lt;/span&gt; metrics.json &lt;span class=&quot;token punctuation&quot;&gt;\&lt;/span&gt;
                  --type json &lt;span class=&quot;token punctuation&quot;&gt;\&lt;/span&gt;
                  --xpath AUC_RATIO&lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;train,valid&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;
&lt;/span&gt;    metrics.json:
                 0.89227482588
                 0.856160272625&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/body&gt;&lt;/html&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485596304961962003/566314479499870211&quot;&gt;What is the quickest way to add a new dependency to a DVC-file?&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;There are a few options to add a new dependency:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;simply opening a file with your favorite editor and adding a dependency there
without md5. DVC will understand that that stage is changed and will re-run
and re-calculate md5 checksums during the next DVC repro;&lt;/li&gt;
&lt;li&gt;use &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc run --no-exec&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; is another option. It will rewrite the existing file
for you with new parameters.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485596304961962003/566315265646788628&quot;&gt;Is there a way to add a dependency to a python package, so it runs a stage again if it imported the updated library?&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;The only recommended way so far would be to somehow make DVC know about your
package’s version. One way to do that would be to create a separate stage that
would be dynamically printing version of that specific package into a file, that
your stage would depend on:&lt;/p&gt;
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;dvc&quot;&gt;&lt;pre class=&quot;language-dvc&quot;&gt;&lt;code class=&quot;language-dvc&quot;&gt;&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token dvc&quot;&gt;dvc run&lt;/span&gt; -o mypkgver &apos;pip show mypkg &lt;span class=&quot;token operator&quot;&gt;&gt;&lt;/span&gt; mypkgver’
&lt;/span&gt;&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token dvc&quot;&gt;dvc run&lt;/span&gt; -d mypkgver -d &lt;span class=&quot;token punctuation&quot;&gt;..&lt;/span&gt;. -o &lt;span class=&quot;token punctuation&quot;&gt;..&lt;/span&gt; mycmd&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/body&gt;&lt;/html&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485596304961962003/564807276146458624&quot;&gt;Is there anyway to forcibly recompute the hashes of dependencies in a pipeline DVC-file?&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;E.g. I made some whitespace/comment changes in my code and I want to tell DVC
“it’s ok, you don’t have to recompute everything”.&lt;/p&gt;
&lt;p&gt;Yes, you could &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc commit -f&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;. It will save all current checksum without
re-running your commands.&lt;/p&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485596304961962003/563352000281182218&quot;&gt;I have projects that use data that’s stored in S3. I never have data locally to use &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc push&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;, but I would like to have this data version controlled.&lt;/a&gt; Is there a way to use the features of DVC in this use case?&lt;/h3&gt;
&lt;p&gt;Yes! This DVC features is called
&lt;a href=&quot;https://dvc.org/doc/user-guide/external-outputs&quot;&gt;external outputs&lt;/a&gt; and
&lt;a href=&quot;https://dvc.org/doc/user-guide/external-dependencies&quot;&gt;external dependencies&lt;/a&gt;.
You can use one of them or both to track, process, and version your data on a
cloud storage without downloading it locally.&lt;/p&gt;
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;hr&gt;&lt;/body&gt;&lt;/html&gt;
&lt;p&gt;If you have any questions, concerns or ideas, let us know
&lt;a href=&quot;https://dvc.org/support&quot;&gt;here&lt;/a&gt; and our stellar team will get back to you in no
time!&lt;/p&gt;</content:encoded></item><item><title><![CDATA[DVC project ideas for Google Season of Docs 2019]]></title><link>https://blog.dvc.org/dvc-project-ideas-for-google-summer-of-docs-2019</link><guid isPermaLink="false">https://blog.dvc.org/dvc-project-ideas-for-google-summer-of-docs-2019</guid><pubDate>Tue, 23 Apr 2019 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;We strongly believe that well-shaped documentation is key for making the product
truly open. We have been investing lots of time and energy in improving our docs
lately. Being a team of 90% engineers we are eager to welcome the writers into
our team and our community. We are happy to share our experience, introduce them
to the world of open source and machine learning best practices, guide through
the OS contribution process and work together on improving our documentation.&lt;/p&gt;
&lt;p&gt;DVC was started in late 2017 by a data scientist and an engineer. It is now
growing pretty fast and though our in-house team is quite small, we have to
thank our contributors (more than 80 in both code and docs) for developing DVC
with us. When working with DVC the technical writer will not only get lots of
hands-on experience in writing technical docs, but will also immerse into DVC
community — a warm and welcoming gathering of ML and DS enthusiasts and an
invaluable source of inspiration and expertise in ML engineering.&lt;/p&gt;
&lt;h3&gt;About DVC&lt;/h3&gt;
&lt;p&gt;DVC is a brainchild of a data scientist and an engineer, that was created to
fill in the gaps in the ML processes tooling and evolved into a successful open
source project.&lt;/p&gt;
&lt;p&gt;ML brings changes in development and research processes. These ML processes
require new tools for data versioning, ML pipeline versioning, resource
management for model training and others that haven’t been formalized. The
traditional software development tools do not fully cover ML team’s needs but
there are no good alternatives. It makes engineers to custom develop a new
toolset to manage data files, keep track of ML experiments and connect data and
source code together. The ML process becomes very fragile and requires tons of
tribal knowledge.&lt;/p&gt;
&lt;p&gt;We have been working on &lt;a href=&quot;http://DVC.org&quot;&gt;DVC&lt;/a&gt; by adopting best ML practices and
turning them into Git-like command line tool. DVC versions multi-gigabyte
datasets and ML models, make them shareable and reproducible. The tool helps to
organize a more rigorous process around datasets and the data derivatives. Your
favorite cloud storage (S3, GCS, or bare metal SSH server) could be used with
DVC as a data file backend.&lt;/p&gt;
&lt;p&gt;If you are interested in learning a little bit more about DVC and its journey,
here is a great interview with DVC creator in the Episode 206 of
Podcast.&lt;strong&gt;init&lt;/strong&gt;. Listen to it
&lt;a href=&quot;https://www.pythonpodcast.com/data-version-control-episode-206/&quot;&gt;HERE &lt;/a&gt;or read
the transcript
&lt;a href=&quot;https://towardsdatascience.com/data-version-control-with-dvc-what-do-the-authors-have-to-say-3c3b10f27ee&quot;&gt;HERE.&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;The state of DVC documentation&lt;/h3&gt;
&lt;p&gt;DVC is a pretty young project, developed and maintained solely by engineers. As
many OS projects we started from the bottom and for a long time our
&lt;a href=&quot;https://dvc.org/doc&quot;&gt;documentation&lt;/a&gt; was a bunch of bits and pieces. Nowadays
improving documentation is one of our top priorities. We moved to the new
in-house built documentation engine and started working with several technical
writers. Certain parts have been tremendously improved recently, e.g.
&lt;a href=&quot;https://dvc.org/doc/get-started&quot;&gt;Get Started&lt;/a&gt; and
&lt;a href=&quot;https://dvc.org/doc/commands-reference/fetch&quot;&gt;certain parts of Commands Reference&lt;/a&gt;
. So far most of our documentation has been written majorly by the engineering
team and there is need for improving the overall structure and making some parts
more friendly from a new user perspective. We have mostly complete
&lt;a href=&quot;https://dvc.org/doc/commands-reference&quot;&gt;reference documentation&lt;/a&gt; for each
command, although some functions are missing good actionable examples. We also
have a &lt;a href=&quot;https://dvc.org/doc/user-guide/dvc-files-and-directories&quot;&gt;User Guide&lt;/a&gt;,
however it is not in very good shape. We strive for making our documentation
clear and comprehensive for users of various backgrounds and proficiency levels
and this is where we do need some fresh perspective.&lt;/p&gt;
&lt;h3&gt;How DVC documentation is built&lt;/h3&gt;
&lt;p&gt;We have an open Github Apache-2 licensed repository for the
&lt;a href=&quot;https://github.com/iterative/dvc.org&quot;&gt;DVC website&lt;/a&gt;, the documentation engine
and the &lt;a href=&quot;https://github.com/iterative/dvc.org&quot;&gt;documentation files&lt;/a&gt;. The website
is built with Node.js + React, including the documentation engine (built
in-house).&lt;/p&gt;
&lt;p&gt;Each documentation page is a static Markdown file in the repository, e.g.
&lt;a href=&quot;https://github.com/iterative/dvc.org/blob/master/static/docs/get-started/example-versioning.md.&quot;&gt;example here&lt;/a&gt;.
It is rendered dynamically in the browser, no preprocessing is required. It
means that tech writers or contributors need to write/edit a Markdown file,
create a pull request and merge it into the master branch of the
&lt;a href=&quot;https://github.com/iterative/dvc.org&quot;&gt;repository.&lt;/a&gt; The complete
&lt;a href=&quot;https://github.com/iterative/dvc.org/blob/master/README.md#contributing&quot;&gt;documentation contributing guide&lt;/a&gt;
describes the directory structure and locations for the different documentation
parts.&lt;/p&gt;
&lt;h3&gt;DVC’s approach to documentation work&lt;/h3&gt;
&lt;p&gt;Documentation tasks and issues are maintained on our doc’s GitHub
&lt;a href=&quot;https://github.com/iterative/dvc.org/issues&quot;&gt;issue tracker&lt;/a&gt;. Changes to the
documentation are made via pull requests on GitHub, and go through our standard
review process which is the same for documentation and code. A technical writer
would be trained in working with our current development process. It generally
means that tech writers or contributors need to write/edit a Markdown file, use
git and Github to create a pull request and publish it. The documentation
&lt;a href=&quot;https://github.com/iterative/dvc.org/blob/master/README.md#contributing&quot;&gt;contributing guide&lt;/a&gt;
includes style conventions and other details. Documentation is considered of the
same importance as code. Engineering team has a policy to write or update the
relevant sections if something new is released. If it’s something too involved
engineers may create a ticket and ask for help. There is one maintainer who is
responsible for doing final reviews and merging the changes. In this sense, our
documentation is very similar to any other open source project.&lt;/p&gt;
&lt;h2&gt;Project ideas for GSoD’19&lt;/h2&gt;
&lt;p&gt;We identified a number of ideas to work on and there are two major topics these
ideas fall into. Both topics are pretty broad and we don’t expect we can
completely cover them during this GSoD but hopefully we can make certain
progress.&lt;/p&gt;
&lt;p&gt;First of all, we want to bring more structure and logic to our documentation to
improve user onboarding experience. The goal is for a new user to have a clear
path they can follow and understand what takeaways each part of the
documentation provides. In particular, improving how
&lt;a href=&quot;https://dvc.org/doc/get-started&quot;&gt;Get Started&lt;/a&gt;,
&lt;a href=&quot;https://dvc.org/doc/tutorial&quot;&gt;Tutorials&lt;/a&gt; and
&lt;a href=&quot;https://dvc.org/doc/get-started/example-versioning&quot;&gt;Examples&lt;/a&gt; relate to each
other, restructuring the existing &lt;a href=&quot;https://dvc.org/doc/user-guide&quot;&gt;User Guide&lt;/a&gt;
to explain basic concepts, and writing more use cases that resonate with ML
engineers and data scientists.&lt;/p&gt;
&lt;p&gt;The other issue we would like to tackle is improving and expanding the existing
reference docs — commands descriptions, examples, etc. It involves filling in
the gaps and developing new sections, similar to
&lt;a href=&quot;https://dvc.org/doc/commands-reference/fetch&quot;&gt;this one&lt;/a&gt;. We would also love to
see more illustrative materials.&lt;/p&gt;
&lt;h3&gt;Project 1: Improving and expanding User Guide&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Description and details:&lt;/strong&gt; Reviewing, restructuring and filling major gaps in
the User Guide (introductory parts of the basic concepts of DVC), e.g. have a
look at &lt;a href=&quot;https://github.com/iterative/dvc.org/issues/144&quot;&gt;this ticket&lt;/a&gt; or
&lt;a href=&quot;https://github.com/iterative/dvc.org/issues/53&quot;&gt;this one&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Mentors&lt;/strong&gt;: &lt;a href=&quot;https://github.com/shcheklein&quot;&gt;@shcheklein&lt;/a&gt; and
&lt;a href=&quot;https://github.com/dmpetrov&quot;&gt;@dmpetrov&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;Project 2: Expanding and developing new tutorials and use cases.&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Description and details:&lt;/strong&gt; We already have some requests for more tutorials,
e.g. &lt;a href=&quot;https://github.com/iterative/dvc.org/issues/96&quot;&gt;this ticket&lt;/a&gt;. Here is
another good &lt;a href=&quot;https://github.com/iterative/dvc.org/issues/194&quot;&gt;use case request&lt;/a&gt;
. If you are going to work on this project you would need some domain knowledge,
preferably some basic ML or data science experience.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Mentors&lt;/strong&gt;: &lt;a href=&quot;https://github.com/shcheklein&quot;&gt;@shcheklein&lt;/a&gt; and
&lt;a href=&quot;https://github.com/dmpetrov&quot;&gt;@dmpetrov&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;Project 3: Improving new user onboarding&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Description and details:&lt;/strong&gt; Analyze and restructure user walkthrough across
&lt;a href=&quot;https://dvc.org/doc/get-started&quot;&gt;Get started&lt;/a&gt;,
&lt;a href=&quot;https://dvc.org/doc/tutorial&quot;&gt;Tutorials&lt;/a&gt; and
&lt;a href=&quot;https://dvc.org/doc/get-started/example-versioning&quot;&gt;Examples&lt;/a&gt;. These three have
one thing in common — hands-on experience with DVC. If you choose this project,
we will work together to come up with a better location for the Examples (to
move them out of the Get Started shadow), and a better location for the
Tutorials (to reference external tutorials that were developed by our community
members and published on different platforms).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Mentors&lt;/strong&gt;: &lt;a href=&quot;https://github.com/shcheklein&quot;&gt;@shcheklein&lt;/a&gt; and
&lt;a href=&quot;https://github.com/dmpetrov&quot;&gt;@dmpetrov&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;Project 4: Improving commands reference&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Description and details:&lt;/strong&gt; We will work on improving our
&lt;a href=&quot;https://dvc.org/doc/commands-reference&quot;&gt;Commands reference&lt;/a&gt; section. This
includes expanding and filling in the gaps. One of the biggest pain points right
now are Examples. Users want them to be
&lt;a href=&quot;https://github.com/iterative/dvc.org/issues/198&quot;&gt;easy to run and try&lt;/a&gt; and here
is a lot to be done in terms of improvement. We have a good example of how is
should be done &lt;a href=&quot;https://dvc.org/doc/commands-reference/fetch&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Mentors&lt;/strong&gt;: &lt;a href=&quot;https://github.com/shcheklein&quot;&gt;@shcheklein&lt;/a&gt; and
&lt;a href=&quot;https://github.com/dmpetrov&quot;&gt;@dmpetrov&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;Project 5: Describe and integrate “DVC packages”&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Description and details:&lt;/strong&gt; Describe the brand new feature “DVC packages” and
integrate it with the rest of the documentation. We have been working hard to
release a few new commands to help with datasets management (have a look at
&lt;a href=&quot;https://github.com/iterative/dvc/issues/1487&quot;&gt;this ticket&lt;/a&gt;). It’s a major
feature that deserves its place in the Get Started, Use cases, Commands
Reference, etc.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Mentors&lt;/strong&gt;: &lt;a href=&quot;https://github.com/shcheklein&quot;&gt;@shcheklein&lt;/a&gt; and
&lt;a href=&quot;https://github.com/dmpetrov&quot;&gt;@dmpetrov&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;The ideas we outline above are just an example of what we can work on. We are
open for any other suggestions and would like to work together with the
technical writer to make the contribution experience both useful and enjoyable
for all parties involved. If you have any suggestions or questions we would love
to hear from you =&gt; DVC.org/support and our DMs on
&lt;a href=&quot;https://twitter.com/DVCorg&quot;&gt;Twitter&lt;/a&gt; are always open!&lt;/p&gt;
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;hr&gt;&lt;/body&gt;&lt;/html&gt;
&lt;p&gt;Special thanks to the &lt;a href=&quot;https://numfocus.org/&quot;&gt;NumFOCUS&lt;/a&gt; for the ideas list
inspiration.&lt;/p&gt;
&lt;p&gt;If you are a tech writer — check the
&lt;a href=&quot;https://developers.google.com/season-of-docs/docs/tech-writer-guide&quot;&gt;Technical writer guide&lt;/a&gt;.
From April 30, 2019 you can see the list of participating open source
organizations on the &lt;a href=&quot;https://g.co/seasonofdocs&quot;&gt;Season of Docs website&lt;/a&gt;. The
application period for technical writers opens on &lt;strong&gt;May 29, 2019&lt;/strong&gt; and ends on
June 28, 2019.&lt;/p&gt;</content:encoded></item><item><title><![CDATA[April ’19 DVC❤️Heartbeat]]></title><link>https://blog.dvc.org/april-19-dvc-heartbeat</link><guid isPermaLink="false">https://blog.dvc.org/april-19-dvc-heartbeat</guid><pubDate>Thu, 18 Apr 2019 00:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;News and links&lt;/h2&gt;
&lt;p&gt;We have some exciting news to share this month!&lt;/p&gt;
&lt;p&gt;DVC is going to &lt;a href=&quot;https://us.pycon.org/2019/&quot;&gt;PyCon 2019&lt;/a&gt;! It is the first
conference that we attend as a team. When we say ‘team’ — we mean it. Our
engineers are flying from all over the globe to get together offline and catch
up with fellow Pythonistas.&lt;/p&gt;
&lt;p&gt;The &lt;a href=&quot;https://us.pycon.org/2019/schedule/talks/list/&quot;&gt;speaker pipeline&lt;/a&gt; is
amazing! DVC creator Dmitry Petrov is giving a talk on
&lt;a href=&quot;https://us.pycon.org/2019/schedule/presentation/176/&quot;&gt;Machine learning model and dataset versioning practices&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Stop by our booth at the Startup Row on Saturday, May 4, reach out and let us
know that you are willing to chat, or simply find a person with a huge DVC owl
on their shirt!&lt;/p&gt;
&lt;p&gt;Speaking of the owls — DVC has done some rebranding recently and we love our new
logo. Special thanks to &lt;a href=&quot;https://99designs.com/&quot;&gt;99designs.com&lt;/a&gt; for building a
great platform for finding trusted designers.&lt;/p&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;span class=&quot;gatsby-resp-image-wrapper&quot; style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto;  max-width: 700px;&quot;&gt;
      &lt;a class=&quot;gatsby-resp-image-link&quot; href=&quot;/static/91d26fd1613290e118c7a4ad1fc5a088/d947d/trusted-designers.png&quot; style=&quot;display: block&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;
    &lt;span class=&quot;gatsby-resp-image-background-image&quot; style=&quot;padding-bottom: 97.33333333333333%; position: relative; bottom: 0; left: 0; background-image: url(&amp;#x27;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAATCAIAAAAf7rriAAAACXBIWXMAAAsSAAALEgHS3X78AAAEgUlEQVQ4yw2RWVeSCQBAv58zr3OqmcnMRoFPRHaE2ASJZBMxSBZBUEACVGIHRSEE6nNcghQXXFFBURRXTOtYcNJsppnjnHmYeR1f7tO9TxeYe+PIrCby+YWtvZWx+Ot44vWg2+JRS+dCltP85PnBdDxgcclFC+PB96db25tzZ8Vs+dN+6aJQKh0BY05dcnwgl19Y21my+Fxd+k6VTEK/X9FFAodVHKeQLHj4M+ve3VYiup3PbRc2R/UtUWXjeL8iZmoDLo6nPl3sXF0Wr6+KJ6fZQj4VG3YyQKRZKrr+8+TgYJWLqqc9rJI+4ehVsgloaH93JZeZLe4v769AwMHW6PmHfLl0Ui4fHR5tZFMQFHaaNO0Rm7aQ9B0thiI2nYTVGPK8yM5HT7Nv9zeTxWzicCn28WQNKG5Dh8cb5dLx9fVZOj0zA/mHA45enXwlflsOZRO+FORUCnleq3ZzNnK8GC3MhVYhezLQkxlzAfPJ4Mbu6qfy8eXvH3KpxOK7N06b0W3W+M0dchZNQiEpb8mgG+WtW+tTNzel4s7M/KincKsG7UAdSMETOYo+D7Qw4+2zety2lxZdj7yV8WtNEwwuxaE5MFgTFmdsl4QDjnx+7eR0e2N5Ij37WqdSAEgkGYkgotF0DEjEInDMBlobr7lDwGUjwDY8VkMj8VF1FDiSCKI5VPpjTMNjDMXn7I3H3HqlFMBxxJQWBew+glZHCFqNRkMXogJORdQJUUgxqlaGreeAIOzeIxISNx0fOf+wA8Ve4UDCejqZy60AFT/cqf6x8lEFSH0Ax1eC4mYek0CG36lUkzA8EC5DI3nIWsQDsFPWWthd+HL1/t//Lt32l3wWZzQ2CPCZ9Jr7MDRIYtXia+AkSWtbM5UOu1vFQyDYMLgIhcRXPCJjSL1diuVkcGl2ZGsl5rMb6ATyC6UY+PJlj0VmSKn0NBSYi4cHLd2P6wmIX6qljTQxGvWcgKEhkJia+rC7x6SW+Po178ZcfDZbwGTEISfwcSk+4uirvlc9Pxr+drEb9ToQlbWNGLzd0GE3KB269n5dewMK30ShNjOZapn4mYCPgWMDbvOwywic7a5efzs3qzoISDyHyhY08WRCoYz7xKySRv0vBiwaaMD8W6iXSWioq6oj1GJhP1WLONzI0EuliAv88Xfp7GRrc3r8/Hgz8cp/9fVsajKiaRVY1M/9Vm3E3RP19KxN+QP9WhaRImAwRewmPp3h7TeaVM+Azdlo+m1oIzFSzEwf5eZu/vm8u7Po6unolrYM9neFHIaBXt3ypGfQqr69TYEhsA9rcFUwo1ou5bABq1ru0ygDyhZ/pzhiVky49YlXNq9ZY3wu8Vq0Xmunz6wZH7JOBq0iBv0pmdyIxRNrQGt3p0rIBaYK+wf54ll6Yi0zmUmPpVORyTe+iK/XYVAb5RK/VTdk6w459Jkpb6izeVD1dEjb4pbzgiZF2KYDdvZWPl9clg+Xvn/N33w/+ut67yA3n3obc5u0JmWbrUvh0CsgrykF2dd/69set65H9LMu+ahBGFY//R8dTmrJc0Ny0wAAAABJRU5ErkJggg==&amp;#x27;); background-size: cover; display: block;&quot;&gt;&lt;/span&gt;
  &lt;picture&gt;
        &lt;source srcset=&quot;/static/91d26fd1613290e118c7a4ad1fc5a088/c54d4/trusted-designers.webp 175w, /static/91d26fd1613290e118c7a4ad1fc5a088/a3432/trusted-designers.webp 350w, /static/91d26fd1613290e118c7a4ad1fc5a088/426ac/trusted-designers.webp 700w, /static/91d26fd1613290e118c7a4ad1fc5a088/c139f/trusted-designers.webp 1050w, /static/91d26fd1613290e118c7a4ad1fc5a088/7f403/trusted-designers.webp 1400w, /static/91d26fd1613290e118c7a4ad1fc5a088/44758/trusted-designers.webp 1650w&quot; sizes=&quot;(max-width: 700px) 100vw, 700px&quot; type=&quot;image/webp&quot;&gt;
        &lt;source srcset=&quot;/static/91d26fd1613290e118c7a4ad1fc5a088/17006/trusted-designers.png 175w, /static/91d26fd1613290e118c7a4ad1fc5a088/d6f3f/trusted-designers.png 350w, /static/91d26fd1613290e118c7a4ad1fc5a088/69344/trusted-designers.png 700w, /static/91d26fd1613290e118c7a4ad1fc5a088/b1f9d/trusted-designers.png 1050w, /static/91d26fd1613290e118c7a4ad1fc5a088/3fc71/trusted-designers.png 1400w, /static/91d26fd1613290e118c7a4ad1fc5a088/d947d/trusted-designers.png 1650w&quot; sizes=&quot;(max-width: 700px) 100vw, 700px&quot; type=&quot;image/png&quot;&gt;
        &lt;img class=&quot;gatsby-resp-image-image&quot; src=&quot;/static/91d26fd1613290e118c7a4ad1fc5a088/69344/trusted-designers.png&quot; alt=&quot;trusted designers&quot; title=&quot;trusted designers&quot; loading=&quot;lazy&quot; style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;&gt;
      &lt;/picture&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;p&gt;DVC is moving fast (almost as fast as my two-year-old). We do our best to keep
up and totally love all the buzz in our community channels lately!&lt;/p&gt;
&lt;p&gt;Here is a number of interesting reads that caught our eye:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://blog.codecentric.de/en/2019/03/walkthrough-dvc/&quot;&gt;A walkthrough of DVC&lt;/a&gt;
by &lt;a href=&quot;https://www.linkedin.com/in/bert-besser-284564182/&quot;&gt;Bert Besser&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;section class=&quot;elp-content-holder&quot;&gt;
      &lt;a href=&quot;https://blog.codecentric.de/en/2019/03/walkthrough-dvc/&quot; class=&quot;external-link-preview&quot;&gt;
          &lt;div class=&quot;elp-description-holder&quot;&gt;
            &lt;h4 class=&quot;elp-title&quot;&gt;A walkthrough of DVC — codecentric AG Blog&lt;/h4&gt;
            &lt;div class=&quot;elp-description&quot;&gt;This post is on how to systematially organize Machine Learning (ML) model development. A model’s performance improves…&lt;/div&gt;
            &lt;div class=&quot;elp-link&quot;&gt;blog.codecentric.de&lt;/div&gt;
          &lt;/div&gt;
           &lt;div class=&quot;elp-image-holder&quot;&gt;
                &lt;img src=&quot;/uploads/images/2019-04-18/walkthrough-of-dvc.png&quot; alt=&quot;A walkthrough of DVC — codecentric AG Blog&quot;&gt;
            &lt;/div&gt;
      &lt;/a&gt;
    &lt;/section&gt;
    &lt;/body&gt;&lt;/html&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;p&gt;A great article about using DVC with a quite advanced scenario and docker. If
you haven’t had a chance to try &lt;a href=&quot;http://dvc.org/&quot;&gt;DVC.org&lt;/a&gt; yet — this is a great
comprehensive read on why you should do so right away.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://github.com/EthicalML/state-of-mlops-2019&quot;&gt;The state of machine learning operations&lt;/a&gt;
by &lt;a href=&quot;https://www.linkedin.com/in/axsaucedo/&quot;&gt;Alejandro Saucedo&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;section class=&quot;elp-content-holder&quot;&gt;
      &lt;a href=&quot;https://github.com/EthicalML/state-of-mlops-2019&quot; class=&quot;external-link-preview&quot;&gt;
          &lt;div class=&quot;elp-description-holder&quot;&gt;
            &lt;h4 class=&quot;elp-title&quot;&gt;The state of machine learning operations&lt;/h4&gt;
            &lt;div class=&quot;elp-description&quot;&gt;Contribute to EthicalML/state-of-mlops-2019 development by creating an account on GitHub.&lt;/div&gt;
            &lt;div class=&quot;elp-link&quot;&gt;github.com&lt;/div&gt;
          &lt;/div&gt;
           &lt;div class=&quot;elp-image-holder&quot;&gt;
                &lt;img src=&quot;/uploads/images/2019-04-18/the-state-of-machine-learning-operations.jpeg&quot; alt=&quot;The state of machine learning operations&quot;&gt;
            &lt;/div&gt;
      &lt;/a&gt;
    &lt;/section&gt;
    &lt;/body&gt;&lt;/html&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;p&gt;A short (only 8 minutes!) and inspiring talk by Alejandro Saucedo at FOSDEM.
Alejandro covers the key trends in machine learning operations, as well as most
recent open source tools and frameworks. Focused on reproducibility, monitoring
and explainability, this lightning talk is a great snapshot of the current state
of ML operations.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hackernoon.com/interview-with-kaggle-grandmaster-senior-cv-engineer-at-lyft-dr-vladimir-i-iglovikov-9938e1fc7c&quot;&gt;Interview with Kaggle Grandmaster, Senior Computer Vision Engineer at Lyft: Dr. Vladimir I. Iglovikov&lt;/a&gt;
by &lt;a href=&quot;https://twitter.com/bhutanisanyam1&quot;&gt;Sanyam Bhutani&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;section class=&quot;elp-content-holder&quot;&gt;
      &lt;a href=&quot;https://hackernoon.com/interview-with-kaggle-grandmaster-senior-cv-engineer-at-lyft-dr-vladimir-i-iglovikov-9938e1fc7c&quot; class=&quot;external-link-preview&quot;&gt;
          &lt;div class=&quot;elp-description-holder&quot;&gt;
            &lt;h4 class=&quot;elp-title&quot;&gt;Interview with Kaggle Grandmaster, Senior Computer Vision Engineer at Lyft: Dr. Vladimir I. Iglovikov&lt;/h4&gt;
            &lt;div class=&quot;elp-description&quot;&gt;Part 24 of The series where I interview my heroes.&lt;/div&gt;
            &lt;div class=&quot;elp-link&quot;&gt;hackernoon.com&lt;/div&gt;
          &lt;/div&gt;
           &lt;div class=&quot;elp-image-holder&quot;&gt;
                &lt;img src=&quot;/uploads/images/2019-04-18/interview-with-kaggle-grandmaster.jpeg&quot; alt=&quot;Interview with Kaggle Grandmaster, Senior Computer Vision Engineer at Lyft: Dr. Vladimir I. Iglovikov&quot;&gt;
            &lt;/div&gt;
      &lt;/a&gt;
    &lt;/section&gt;
    &lt;/body&gt;&lt;/html&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;There is no way you will become Kaggle Master and not learn how to approach
anew, the unknown problem in a fast hacking way with a very high number of
iterations per unit of time. This skill in the world of competitive learning
is the question of survival&lt;/p&gt;
&lt;/blockquote&gt;
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;hr&gt;&lt;/body&gt;&lt;/html&gt;
&lt;h2&gt;Discord gems&lt;/h2&gt;
&lt;p&gt;There are lots of hidden gems in our Discord community discussions. Sometimes
they are scattered all over the channels and hard to track down.&lt;/p&gt;
&lt;p&gt;We are sifting through the issues and discussions and share with you the most
interesting takeaways.&lt;/p&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485596304961962003/552098155861114891&quot;&gt;What are the system requirements to install DVC (type of operating system, dependencies of another application (as GIT), memory, cpu, etc).&lt;/a&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;It supports Windows, Mac, Linux. Python 2 and 3.&lt;/li&gt;
&lt;li&gt;No specific CPU or RAM requirements — it’s a lightweight command line tool and
should be able run pretty much everywhere you can run Python.&lt;/li&gt;
&lt;li&gt;It depends on a few Python libraries that it installs as dependencies (they
are specified in the
&lt;a href=&quot;https://github.com/iterative/dvc/blob/master/requirements.txt&quot;&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;requirements.txt&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;It does not depend on Git and theoretically could be run without any SCM.
Running it on top of a Git repository however is recommended and gives you an
ability to actually save history of datasets, models, etc (even though it does
not put them into Git directly).&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485596304961962003/560212552638791706&quot;&gt;Do I have to buy a server license to run DVC, do you have this?&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;No server licenses for DVC. It is 100% free and open source.&lt;/p&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485596304961962003/560154903331340289&quot;&gt;What is the storage limit when using DVC?&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;I am trying to version control datasets and models with &gt;10 GB (Potentially even
bigger). Can DVC handle this?&lt;/p&gt;
&lt;p&gt;There is no limit. None enforced by DVC itself. It depends on the size of your
local or &lt;a href=&quot;https://dvc.org/doc/commands-reference/remote&quot;&gt;remote storages&lt;/a&gt;. You
need to have some space available on S3, your SSH server or other storage you
are using to keep these data files, models and their version, which you would
like to store.&lt;/p&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485596304961962003/553731815228178433&quot;&gt;How does DVC know the sequence of stages to run&lt;/a&gt;?&lt;/h3&gt;
&lt;p&gt;How does it connect them? Does it see that there is a dependency which is
outputted from the first run?&lt;/p&gt;
&lt;p&gt;DVC figures out the pipeline by looking at the dependencies and outputs of the
stages. For example, having the following:&lt;/p&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;div id=&quot;gist95747345&quot; class=&quot;gist&quot;&gt;
    &lt;div class=&quot;gist-file&quot;&gt;
      &lt;div class=&quot;gist-data&quot;&gt;
        &lt;div class=&quot;js-gist-file-update-container js-task-list-container file-box&quot;&gt;
  &lt;div id=&quot;file-heartbeat-dvc-run-2019-04-sh&quot; class=&quot;file&quot;&gt;
    

  &lt;div itemprop=&quot;text&quot; class=&quot;Box-body p-0 blob-wrapper data type-shell&quot;&gt;
      
&lt;table class=&quot;highlight tab-size js-file-line-container&quot; data-tab-size=&quot;8&quot;&gt;
      &lt;tbody&gt;&lt;tr&gt;
        &lt;td id=&quot;file-heartbeat-dvc-run-2019-04-sh-L1&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;1&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-heartbeat-dvc-run-2019-04-sh-LC1&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;$ dvc run -f download.dvc \&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-heartbeat-dvc-run-2019-04-sh-L2&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;2&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-heartbeat-dvc-run-2019-04-sh-LC2&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;          -o joke.txt \&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-heartbeat-dvc-run-2019-04-sh-L3&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;3&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-heartbeat-dvc-run-2019-04-sh-LC3&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;          &lt;span class=&quot;pl-s&quot;&gt;&lt;span class=&quot;pl-pds&quot;&gt;&quot;&lt;/span&gt;curl https://geek-jokes.sameerkumar.website/api &gt; joke.txt&lt;span class=&quot;pl-pds&quot;&gt;&quot;&lt;/span&gt;&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-heartbeat-dvc-run-2019-04-sh-L4&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;4&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-heartbeat-dvc-run-2019-04-sh-LC4&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;$ dvc run -f duplicate.dvc \&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-heartbeat-dvc-run-2019-04-sh-L5&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;5&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-heartbeat-dvc-run-2019-04-sh-LC5&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;          -d joke.txt \&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-heartbeat-dvc-run-2019-04-sh-L6&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;6&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-heartbeat-dvc-run-2019-04-sh-LC6&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;          -o dulpicate.txt \&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-heartbeat-dvc-run-2019-04-sh-L7&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;7&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-heartbeat-dvc-run-2019-04-sh-LC7&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;          &lt;span class=&quot;pl-s&quot;&gt;&lt;span class=&quot;pl-pds&quot;&gt;&quot;&lt;/span&gt;cat joke.txt joke.txt &gt; duplicate.txt&lt;span class=&quot;pl-pds&quot;&gt;&quot;&lt;/span&gt;&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;


  &lt;/div&gt;

  &lt;/div&gt;
&lt;/div&gt;

      &lt;/div&gt;
      &lt;div class=&quot;gist-meta&quot;&gt;
        &lt;a href=&quot;https://gist.github.com/SvetaGr/a2a28fbc9db0a675422785bc5f925e14/raw/3802fa1b440a2b798568e0cac1be81ae10dd2acd/heartbeat-dvc-run-2019-04.sh&quot; style=&quot;float:right&quot;&gt;view raw&lt;/a&gt;
        &lt;a href=&quot;https://gist.github.com/SvetaGr/a2a28fbc9db0a675422785bc5f925e14#file-heartbeat-dvc-run-2019-04-sh&quot;&gt;heartbeat-dvc-run-2019-04.sh&lt;/a&gt;
        hosted with ❤ by &lt;a href=&quot;https://github.com&quot;&gt;GitHub&lt;/a&gt;
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;p&gt;you will end up with two stages: &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;download.dvc&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; and &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;duplicate.dvc&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;. The
download one will have &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;joke.txt&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; as an output . The duplicate one defined
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;joke.txt&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; as a dependency, as it is the same file. DVC detects that and creates
a pipeline by joining those stages.&lt;/p&gt;
&lt;p&gt;You can inspect the content of each stage file
&lt;a href=&quot;https://dvc.org/doc/user-guide/dvc-file-format&quot;&gt;here&lt;/a&gt; (they are human
readable).&lt;/p&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485596304961962003/560022999848321026&quot;&gt;Is it possible to use the same data of a remote in two different repositories?&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;(e.g. in one repo &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;run dvc pull -r my_remote&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; to pull some data and running the
same command in a different git repo should also pull the same)&lt;/p&gt;
&lt;p&gt;Yes! It’s a frequent scenario for multiple repos to share remotes and even local
cache. DVC file serves as a link to the actual data. If you add the same DVC
file (e.g. &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;data.dvc&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;) to the new repo and do &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc pull -r remotename data.dvc&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;-
it will fetch data. You have to use &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc remote add&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; first to specify the
coordinates of the remote storage you would like to share in every project.
Alternatively (check out the question below), you could use &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;--global&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; to
specify a single default remote (and/or cache dir) per machine.&lt;/p&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485586884165107734/559653121228275727&quot;&gt;Could I set a global remote server, instead of config in each project?&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Use &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;--global&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; when you specify the remote settings. Then remote will be visible
for all projects on the same machine. &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;--global&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; — saves remote configuration to
the global config (e.g. &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;~/.config/dvc/config&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;) instead of a per project one —
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;.dvc/config&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;. See more details
&lt;a href=&quot;https://dvc.org/doc/commands-reference/remote-add&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485596304961962003/554679392823934977&quot;&gt;How do I version a large dataset in S3 or any other storage?&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;We would recommend to skim through our
&lt;a href=&quot;https://dvc.org/doc/get-started&quot;&gt;get started&lt;/a&gt; tutorial, to summarize the data
versioning process of DVC:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You create stage (aka DVC) files by adding, importing files (&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc add&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; /
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc import&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;) , or run a command to generate files:&lt;/li&gt;
&lt;/ul&gt;
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;dvc&quot;&gt;&lt;pre class=&quot;language-dvc&quot;&gt;&lt;code class=&quot;language-dvc&quot;&gt;&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token dvc&quot;&gt;dvc run&lt;/span&gt; --out file.csv &lt;span class=&quot;token string&quot;&gt;&quot;wget https://example.com/file.csv&quot;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/body&gt;&lt;/html&gt;
&lt;ul&gt;
&lt;li&gt;This stage files are tracked by &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;git&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;&lt;/li&gt;
&lt;li&gt;You use git to retrieve previous stage files (e.g. &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;git checkout v1.0&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;)&lt;/li&gt;
&lt;li&gt;Then use &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc checkout&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; to retrieve all the files related by those stage files&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All your files (with each different version) are stored in a &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;.dvc/cache&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;
directory, that you sync with a remote file storage (for example, S3) using the
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc push&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; or &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc pull&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; commands (analogous to a &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;git push&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; / &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;git pull&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;, but
instead of syncing your &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;.git&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;, you are syncing your &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;.dvc&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; directory) on a
remote repository (let’s say an S3 bucket).&lt;/p&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485596304961962003/558216007684980736&quot;&gt;How do I move/rename a DVC-file?&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;If you need to move your dvc file somewhere, it is pretty easy, even if done
manually:&lt;/p&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;div id=&quot;gist95752643&quot; class=&quot;gist&quot;&gt;
    &lt;div class=&quot;gist-file&quot;&gt;
      &lt;div class=&quot;gist-data&quot;&gt;
        &lt;div class=&quot;js-gist-file-update-container js-task-list-container file-box&quot;&gt;
  &lt;div id=&quot;file-heartbeat-dvc-rename-sh&quot; class=&quot;file&quot;&gt;
    

  &lt;div itemprop=&quot;text&quot; class=&quot;Box-body p-0 blob-wrapper data type-shell&quot;&gt;
      
&lt;table class=&quot;highlight tab-size js-file-line-container&quot; data-tab-size=&quot;8&quot;&gt;
      &lt;tbody&gt;&lt;tr&gt;
        &lt;td id=&quot;file-heartbeat-dvc-rename-sh-L1&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;1&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-heartbeat-dvc-rename-sh-LC1&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;$ mv my.dvc data/my.dvc&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-heartbeat-dvc-rename-sh-L2&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;2&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-heartbeat-dvc-rename-sh-LC2&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;#&lt;/span&gt; and now open my.dvc with your favorite editor and change wdir in it to &apos;wdir: ../&apos;.&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;


  &lt;/div&gt;

  &lt;/div&gt;
&lt;/div&gt;

      &lt;/div&gt;
      &lt;div class=&quot;gist-meta&quot;&gt;
        &lt;a href=&quot;https://gist.github.com/SvetaGr/b25a5b45773bf94d36e60d48462502f4/raw/b9f920208a50afb55bda6c7527081babfcc323fe/heartbeat-dvc-rename.sh&quot; style=&quot;float:right&quot;&gt;view raw&lt;/a&gt;
        &lt;a href=&quot;https://gist.github.com/SvetaGr/b25a5b45773bf94d36e60d48462502f4#file-heartbeat-dvc-rename-sh&quot;&gt;heartbeat-dvc-rename.sh&lt;/a&gt;
        hosted with ❤ by &lt;a href=&quot;https://github.com&quot;&gt;GitHub&lt;/a&gt;
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485596304961962003/555431645402890255&quot;&gt;I performed &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc push&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; of a file to a remote. On the remote there is created a directory called &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;8f&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; with a file inside called &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;2ec34faf91ff15ef64abf3fbffa7ee&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;. The original CSV file doesn’t appear on the remote. Is that expected behaviour?&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;This is an expected behavior. DVC saves files under the name created from their
checksum in order to prevent duplication. If you delete “pushed” file in your
project directory and perform &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc pull&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;, DVC will take care of pulling the file
and renaming it to “original” name.&lt;/p&gt;
&lt;p&gt;Below are some details about how DVC cache works, just to illustrate the logic.
When you add a data source:&lt;/p&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;div id=&quot;gist95752678&quot; class=&quot;gist&quot;&gt;
    &lt;div class=&quot;gist-file&quot;&gt;
      &lt;div class=&quot;gist-data&quot;&gt;
        &lt;div class=&quot;js-gist-file-update-container js-task-list-container file-box&quot;&gt;
  &lt;div id=&quot;file-heartbeat-remote-file-naming-sh&quot; class=&quot;file&quot;&gt;
    

  &lt;div itemprop=&quot;text&quot; class=&quot;Box-body p-0 blob-wrapper data type-shell&quot;&gt;
      
&lt;table class=&quot;highlight tab-size js-file-line-container&quot; data-tab-size=&quot;8&quot;&gt;
      &lt;tbody&gt;&lt;tr&gt;
        &lt;td id=&quot;file-heartbeat-remote-file-naming-sh-L1&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;1&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-heartbeat-remote-file-naming-sh-LC1&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;$ &lt;span class=&quot;pl-c1&quot;&gt;echo&lt;/span&gt; &lt;span class=&quot;pl-s&quot;&gt;&lt;span class=&quot;pl-pds&quot;&gt;&quot;&lt;/span&gt;foo&lt;span class=&quot;pl-pds&quot;&gt;&quot;&lt;/span&gt;&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;&gt;&lt;/span&gt; data.txt&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-heartbeat-remote-file-naming-sh-L2&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;2&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-heartbeat-remote-file-naming-sh-LC2&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;$ dvc add data.txt&lt;/td&gt;
      &lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;


  &lt;/div&gt;

  &lt;/div&gt;
&lt;/div&gt;

      &lt;/div&gt;
      &lt;div class=&quot;gist-meta&quot;&gt;
        &lt;a href=&quot;https://gist.github.com/SvetaGr/b69fa8ce36bcce00ecd69e7f2d7ccd2e/raw/34017336326e3773f2e3a490e1f66265025f8c81/heartbeat-remote-file-naming.sh&quot; style=&quot;float:right&quot;&gt;view raw&lt;/a&gt;
        &lt;a href=&quot;https://gist.github.com/SvetaGr/b69fa8ce36bcce00ecd69e7f2d7ccd2e#file-heartbeat-remote-file-naming-sh&quot;&gt;heartbeat-remote-file-naming.sh&lt;/a&gt;
        hosted with ❤ by &lt;a href=&quot;https://github.com&quot;&gt;GitHub&lt;/a&gt;
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;p&gt;It computes the (md5) checksum of the file and generates a DVC file with related
information:&lt;/p&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;div id=&quot;gist95752688&quot; class=&quot;gist&quot;&gt;
    &lt;div class=&quot;gist-file&quot;&gt;
      &lt;div class=&quot;gist-data&quot;&gt;
        &lt;div class=&quot;js-gist-file-update-container js-task-list-container file-box&quot;&gt;
  &lt;div id=&quot;file-heartbeat-dvc-file-2019-04-yaml&quot; class=&quot;file&quot;&gt;
    

  &lt;div itemprop=&quot;text&quot; class=&quot;Box-body p-0 blob-wrapper data type-yaml&quot;&gt;
      
&lt;table class=&quot;highlight tab-size js-file-line-container&quot; data-tab-size=&quot;8&quot;&gt;
      &lt;tbody&gt;&lt;tr&gt;
        &lt;td id=&quot;file-heartbeat-dvc-file-2019-04-yaml-L1&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;1&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-heartbeat-dvc-file-2019-04-yaml-LC1&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-ent&quot;&gt;md5&lt;/span&gt;: &lt;span class=&quot;pl-s&quot;&gt;3bccbf004063977442029334c3448687&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-heartbeat-dvc-file-2019-04-yaml-L2&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;2&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-heartbeat-dvc-file-2019-04-yaml-LC2&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-ent&quot;&gt;outs&lt;/span&gt;:&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-heartbeat-dvc-file-2019-04-yaml-L3&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;3&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-heartbeat-dvc-file-2019-04-yaml-LC3&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;- &lt;span class=&quot;pl-ent&quot;&gt;cache&lt;/span&gt;: &lt;span class=&quot;pl-c1&quot;&gt;true&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-heartbeat-dvc-file-2019-04-yaml-L4&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;4&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-heartbeat-dvc-file-2019-04-yaml-LC4&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;  &lt;span class=&quot;pl-ent&quot;&gt;md5&lt;/span&gt;: &lt;span class=&quot;pl-s&quot;&gt;d3b07384d113edec49eaa6238ad5ff00&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-heartbeat-dvc-file-2019-04-yaml-L5&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;5&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-heartbeat-dvc-file-2019-04-yaml-LC5&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;  &lt;span class=&quot;pl-ent&quot;&gt;metric&lt;/span&gt;: &lt;span class=&quot;pl-c1&quot;&gt;false&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-heartbeat-dvc-file-2019-04-yaml-L6&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;6&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-heartbeat-dvc-file-2019-04-yaml-LC6&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;  &lt;span class=&quot;pl-ent&quot;&gt;path&lt;/span&gt;: &lt;span class=&quot;pl-s&quot;&gt;data.txt&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-heartbeat-dvc-file-2019-04-yaml-L7&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;7&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-heartbeat-dvc-file-2019-04-yaml-LC7&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-ent&quot;&gt;wdir&lt;/span&gt;: &lt;span class=&quot;pl-s&quot;&gt;..&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;


  &lt;/div&gt;

  &lt;/div&gt;
&lt;/div&gt;

      &lt;/div&gt;
      &lt;div class=&quot;gist-meta&quot;&gt;
        &lt;a href=&quot;https://gist.github.com/SvetaGr/110ae76df929654ec573ea9e4b1e1980/raw/3ccd7b7ab89e1e4246c1d8c83d6051df2379bd6d/heartbeat-dvc-file-2019-04.yaml&quot; style=&quot;float:right&quot;&gt;view raw&lt;/a&gt;
        &lt;a href=&quot;https://gist.github.com/SvetaGr/110ae76df929654ec573ea9e4b1e1980#file-heartbeat-dvc-file-2019-04-yaml&quot;&gt;heartbeat-dvc-file-2019-04.yaml&lt;/a&gt;
        hosted with ❤ by &lt;a href=&quot;https://github.com&quot;&gt;GitHub&lt;/a&gt;
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;p&gt;The original file is moved to the cache and a link or copy (depending on your
filesystem) is created to replace it on your working space:&lt;/p&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;div id=&quot;gist95752708&quot; class=&quot;gist&quot;&gt;
    &lt;div class=&quot;gist-file&quot;&gt;
      &lt;div class=&quot;gist-data&quot;&gt;
        &lt;div class=&quot;js-gist-file-update-container js-task-list-container file-box&quot;&gt;
  &lt;div id=&quot;file-heartbeat-cache-structure-2019-04-sh&quot; class=&quot;file&quot;&gt;
    

  &lt;div itemprop=&quot;text&quot; class=&quot;Box-body p-0 blob-wrapper data type-shell&quot;&gt;
      
&lt;table class=&quot;highlight tab-size js-file-line-container&quot; data-tab-size=&quot;8&quot;&gt;
      &lt;tbody&gt;&lt;tr&gt;
        &lt;td id=&quot;file-heartbeat-cache-structure-2019-04-sh-L1&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;1&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-heartbeat-cache-structure-2019-04-sh-LC1&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;.dvc/cache&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-heartbeat-cache-structure-2019-04-sh-L2&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;2&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-heartbeat-cache-structure-2019-04-sh-LC2&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;└── d3&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-heartbeat-cache-structure-2019-04-sh-L3&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;3&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-heartbeat-cache-structure-2019-04-sh-LC3&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;    └── b07384d113edec49eaa6238ad5ff00&lt;/td&gt;
      &lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;


  &lt;/div&gt;

  &lt;/div&gt;
&lt;/div&gt;

      &lt;/div&gt;
      &lt;div class=&quot;gist-meta&quot;&gt;
        &lt;a href=&quot;https://gist.github.com/SvetaGr/133cb93e5a21c6f21a86f8709ed39ea9/raw/540aa50da9bb891da01030a8877688b74eecc20e/heartbeat-cache-structure-2019-04.sh&quot; style=&quot;float:right&quot;&gt;view raw&lt;/a&gt;
        &lt;a href=&quot;https://gist.github.com/SvetaGr/133cb93e5a21c6f21a86f8709ed39ea9#file-heartbeat-cache-structure-2019-04-sh&quot;&gt;heartbeat-cache-structure-2019-04.sh&lt;/a&gt;
        hosted with ❤ by &lt;a href=&quot;https://github.com&quot;&gt;GitHub&lt;/a&gt;
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485586884165107734/553570391000481802&quot;&gt;Is it possible to integrate dvc with our in-house tools developed in Python?&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Absolutely! There are three ways you could interact with DVC:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Use &lt;a href=&quot;https://docs.python.org/3/library/subprocess.html&quot;&gt;subprocess&lt;/a&gt; to launch
DVC&lt;/li&gt;
&lt;li&gt;Use &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;from dvc.main import main&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; and use it with regular CLI logic like
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;ret = main(‘add’, ‘foo’)&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;&lt;/li&gt;
&lt;li&gt;Use our internal API (see &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc/repo&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; and &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc/command&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; in our source to get a
grasp of it). It is not officially public yet, and we don’t have any special
docs for it, but it is fairly stable and could definitely be used for a POC.
We’ll add docs and all the official stuff for it in the not-so-distant
future.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485586884165107734/555750217522216990&quot;&gt;Can I still track the linkage between data and model without using &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc run&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;&lt;/a&gt; and a graph of tasks? Basically what would like extremely minimal DVC invasion into my GIT repo for an existing machine learning application?&lt;/h3&gt;
&lt;p&gt;There are two options:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Use &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc add&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; to track models and/or input datasets. It should be enough if
you use &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;git commit&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; on DVC files produced by &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc add&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;. This is the very
minimum you can get with DVC and it does not require using DVC run. Check the
first part (up to the Pipelines/Add transformations section) of the DVC
&lt;a href=&quot;https://dvc.org/doc/get-started&quot;&gt;get started&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;You could use &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;--no-exec&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; in &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc run&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; and then just &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc commit&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; and
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;git commit&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; the results. That way you’ll get your DVC files with all the
linkages, without having to actually run your commands through DVC.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;If you have any questions, concerns or ideas, let us know
&lt;a href=&quot;https://dvc.org/support&quot;&gt;here&lt;/a&gt; and our stellar team will get back to you in no
time.&lt;/p&gt;</content:encoded></item><item><title><![CDATA[March ’19 DVC❤️Heartbeat]]></title><link>https://blog.dvc.org/march-19-dvc-heartbeat</link><guid isPermaLink="false">https://blog.dvc.org/march-19-dvc-heartbeat</guid><pubDate>Tue, 05 Mar 2019 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;This is the very first issue of the DVC❤️Heartbeat. Every month we will be
sharing our news, findings, interesting reads, community takeaways, and
everything along the way.&lt;/p&gt;
&lt;p&gt;Some of those are related to our brainchild &lt;a href=&quot;https://dvc.org&quot;&gt;DVC&lt;/a&gt; and its
journey. The others are a collection of exciting stories and ideas centered
around ML best practices and workflow.&lt;/p&gt;
&lt;h2&gt;News and links&lt;/h2&gt;
&lt;p&gt;We read a ton of articles and posts every day and here are a few that caught our
eye. Well-written, offering a different perspective and definitely worth
checking.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://veekaybee.github.io/2019/02/13/data-science-is-different/&quot;&gt;Data science is different now&lt;/a&gt;
by &lt;a href=&quot;https://veekaybee.github.io/&quot;&gt;Vicki Boykis&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;section class=&quot;elp-content-holder&quot;&gt;
      &lt;a href=&quot;https://veekaybee.github.io/2019/02/13/data-science-is-different/&quot; class=&quot;external-link-preview&quot;&gt;
          &lt;div class=&quot;elp-description-holder&quot;&gt;
            &lt;h4 class=&quot;elp-title&quot;&gt;Data science is different now&lt;/h4&gt;
            &lt;div class=&quot;elp-description&quot;&gt;Woman holding a balance, Vermeer 1664 What do you think of when you read the phrase &apos;data science&apos;? It&apos;s probably some…&lt;/div&gt;
            &lt;div class=&quot;elp-link&quot;&gt;veekaybee.github.io&lt;/div&gt;
          &lt;/div&gt;
           &lt;div class=&quot;elp-image-holder&quot;&gt;
                &lt;img src=&quot;/uploads/images/2019-03-05/data-science-is-different-now.png&quot; alt=&quot;Data science is different now&quot;&gt;
            &lt;/div&gt;
      &lt;/a&gt;
    &lt;/section&gt;
    &lt;/body&gt;&lt;/html&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;What is becoming clear is that, in the late stage of the hype cycle, data
science is asymptotically moving closer to engineering, and the
&lt;a href=&quot;https://www.youtube.com/watch?v=frQeK8xo9Ls&quot;&gt;skills that data scientists need&lt;/a&gt;
moving forward are less visualization and statistics-based, and
&lt;a href=&quot;https://tech.trivago.com/2018/12/03/teardown-rebuild-migrating-from-hive-to-pyspark/&quot;&gt;more in line with traditional computer science curricula&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://emilygorcenski.com/post/data-versioning/&quot;&gt;Data Versioning&lt;/a&gt; by
&lt;a href=&quot;https://emilygorcenski.com/&quot;&gt;Emily F. Gorcenski&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;section class=&quot;elp-content-holder&quot;&gt;
      &lt;a href=&quot;https://emilygorcenski.com/post/data-versioning/&quot; class=&quot;external-link-preview&quot;&gt;
          &lt;div class=&quot;elp-description-holder&quot;&gt;
            &lt;h4 class=&quot;elp-title&quot;&gt;Data Versioning&lt;/h4&gt;
            &lt;div class=&quot;elp-description&quot;&gt;Productionizing machine learning/AI/data science is a challenge. Not only are the outputs of machine-learning…&lt;/div&gt;
            &lt;div class=&quot;elp-link&quot;&gt;emilygorcenski.com&lt;/div&gt;
          &lt;/div&gt;
           &lt;div class=&quot;elp-image-holder&quot;&gt;
                &lt;img src=&quot;/uploads/images/2019-03-05/data-versioning.jpeg&quot; alt=&quot;Data Versioning&quot;&gt;
            &lt;/div&gt;
      &lt;/a&gt;
    &lt;/section&gt;
    &lt;/body&gt;&lt;/html&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;I want to explore how the degrees of freedom in versioning machine learning
systems poses a unique challenge. I’ll identify four key axes on which machine
learning systems have a notion of version, along with some brief
recommendations for how to simplify this a bit.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://blog.mi.hdm-stuttgart.de/index.php/2019/02/26/reproducibility-in-ml/&quot;&gt;Reproducibility in Machine Learning&lt;/a&gt;
by &lt;a href=&quot;https://blog.mi.hdm-stuttgart.de/index.php/author/pf023/&quot;&gt;Pascal Fecht&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;section class=&quot;elp-content-holder&quot;&gt;
      &lt;a href=&quot;https://emilygorcenski.com/post/data-versioning/&quot; class=&quot;external-link-preview&quot;&gt;
          &lt;div class=&quot;elp-description-holder&quot;&gt;
            &lt;h4 class=&quot;elp-title&quot;&gt;Reproducibility in Machine Learning | Computer Science Blog&lt;/h4&gt;
            &lt;div class=&quot;elp-description&quot;&gt;The rise of Machine Learning has led to changes across all areas of computer science. From a very abstract point of…&lt;/div&gt;
            &lt;div class=&quot;elp-link&quot;&gt;blog.mi.hdm-stuttgart.de&lt;/div&gt;
          &lt;/div&gt;
           &lt;div class=&quot;elp-image-holder&quot;&gt;
                &lt;img src=&quot;/uploads/images/2019-03-05/reproducibility-in-machine-learning.jpeg&quot; alt=&quot;Reproducibility in Machine Learning | Computer Science Blog&quot;&gt;
            &lt;/div&gt;
      &lt;/a&gt;
    &lt;/section&gt;
    &lt;/body&gt;&lt;/html&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;…the objective of this post is not to philosophize about the dangers and
dark sides of AI. In fact, this post aims to work out common challenges in
reproducibility for machine learning and shows programming differences to
other areas of Computer Science. Secondly, we will see practices and workflows
to create a higher grade of reproducibility in machine learning algorithms.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;hr&gt;&lt;/body&gt;&lt;/html&gt;
&lt;h2&gt;Discord gems&lt;/h2&gt;
&lt;p&gt;There are lots of hidden gems in our Discord community discussions. Sometimes
they are scattered all over the channels and hard to track down.&lt;/p&gt;
&lt;p&gt;We will be sifting through the issues and discussions and share the most
interesting takeaways.&lt;/p&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485586884165107734/541622187296161816&quot;&gt;Edit and define DVC files manually, in a Makefile style&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;There is no separate guide for that, but it is very straight forward. See
&lt;a href=&quot;https://dvc.org/doc/user-guide/dvc-file-format&quot;&gt;DVC file format&lt;/a&gt; description
for how DVC file looks inside in general. All &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc add&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; or &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc run&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; does is
just computing &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;md5&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; fields in it, that is all. You could write your DVC-file
and then run &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc repro&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; that will run a command(if any) and compute all needed
checksums,&lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485586884165107734/541622187296161816&quot;&gt;read more&lt;/a&gt;.&lt;/p&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485586884165107734/547424240677158915&quot;&gt;Best practices to define the code dependencies&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;There’s a ton of code in that project, and it’s very non-trivial to define the
code dependencies for my training stage — there are a lot of imports going on,
the training code is distributed across many modules,
&lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485586884165107734/547424240677158915&quot;&gt;read more&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485586884165107734/548495589428428801&quot;&gt;Azure data lake support&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;DVC officially only supports regular Azure blob storage. Gen1 Data Lake should
be accessible by the same interface, so configuring a regular azure remote for
DVC should work. Seems like Gen2 Data Lake
&lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485586884165107734/550546413197590539&quot;&gt;has disable&lt;/a&gt;
blob API. If you know more details about the difference between Gen1 and Gen2,
feel free to join &lt;a href=&quot;https://dvc.org/chat&quot;&gt;our community&lt;/a&gt; and share this
knowledge.&lt;/p&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485596304961962003/542390986299539459&quot;&gt;What licence DVC is released under&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Apache 2.0. One of the &lt;a href=&quot;https://opensource.org/licenses&quot;&gt;most common&lt;/a&gt; and
permissible OSS licences.&lt;/p&gt;
&lt;h3&gt;Q: Setting up S3 compatible remote&lt;/h3&gt;
&lt;p&gt;(&lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485596304961962003/543445798868746278&quot;&gt;Localstack&lt;/a&gt;,
&lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485596304961962003/541466951474479115&quot;&gt;wasabi&lt;/a&gt;)&lt;/p&gt;
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;dvc&quot;&gt;&lt;pre class=&quot;language-dvc&quot;&gt;&lt;code class=&quot;language-dvc&quot;&gt;&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token dvc&quot;&gt;dvc remote add&lt;/span&gt; upstream s3://my-bucket
&lt;/span&gt;&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token dvc&quot;&gt;dvc remote modify&lt;/span&gt; upstream region REGION_NAME
&lt;/span&gt;&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token dvc&quot;&gt;dvc remote modify&lt;/span&gt; upstream endpointurl &lt;span class=&quot;token operator&quot;&gt;&amp;#x3C;&lt;/span&gt;url&lt;span class=&quot;token operator&quot;&gt;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/body&gt;&lt;/html&gt;
&lt;p&gt;Find and click the &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;S3 API compatible storage&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; on
&lt;a href=&quot;https://dvc.org/doc/commands-reference/remote-add&quot;&gt;this page&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485596304961962003/543914550173368332&quot;&gt;Why DVC creates and updates &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;.gitignore&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; file?&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;It adds your data files there, that are tracked by DVC, so that you don’t
accidentally add them to git as well you can open it with file editor of your
liking and see your data files listed there.&lt;/p&gt;
&lt;h3&gt;Q: &lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485596304961962003/545562334983356426&quot;&gt;Managing data and pipelines with DVC on HDFS&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;With DVC, you could connect your data sources from HDFS with your pipeline in
your local project, by simply specifying it as an external dependency. For
example let’s say your script &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;process.cmd&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; works on an input file on HDFS and
then downloads a result to your local workspace, then with DVC it could look
something like:&lt;/p&gt;
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;dvc&quot;&gt;&lt;pre class=&quot;language-dvc&quot;&gt;&lt;code class=&quot;language-dvc&quot;&gt;&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token dvc&quot;&gt;dvc run&lt;/span&gt; -d hdfs://example.com/home/shared/input &lt;span class=&quot;token punctuation&quot;&gt;\&lt;/span&gt;
          -d process.cmd &lt;span class=&quot;token punctuation&quot;&gt;\&lt;/span&gt;
          -o output process.cmd&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/body&gt;&lt;/html&gt;
&lt;p&gt;&lt;a href=&quot;https://discordapp.com/channels/485586884165107732/485596304961962003/545562334983356426&quot;&gt;read more&lt;/a&gt;.&lt;/p&gt;
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;hr&gt;&lt;/body&gt;&lt;/html&gt;
&lt;p&gt;If you have any questions, concerns or ideas, let us know
&lt;a href=&quot;https://dvc.org/support&quot;&gt;here&lt;/a&gt; and our stellar team will get back to you in no
time.&lt;/p&gt;</content:encoded></item><item><title><![CDATA[ML best practices in PyTorch dev conf 2018]]></title><link>https://blog.dvc.org/ml-best-practices-in-pytorch-dev-conf-2018</link><guid isPermaLink="false">https://blog.dvc.org/ml-best-practices-in-pytorch-dev-conf-2018</guid><pubDate>Thu, 18 Oct 2018 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;The issues discussed included applying traditional software development
techniques like unit testing, CI/CD systems, automated deployment, version
control, and more to the ML field. In this blog post, we will go over the best
practices ideas from PTDC-18 and the future of ML tool developments.&lt;/p&gt;
&lt;h2&gt;1. Engineering practices from PyTorch developers&lt;/h2&gt;
&lt;p&gt;In the PTDC-18
&lt;a href=&quot;https://www.facebook.com/pytorch/videos/482401942168584/&quot;&gt;keynote speech&lt;/a&gt;,
&lt;strong&gt;Jerome Pesenti&lt;/strong&gt; described the motivation and goals of PyTorch project and
what the future of machine learning looks like.&lt;/p&gt;
&lt;h3&gt;1.1. ML tooling future&lt;/h3&gt;
&lt;p&gt;Regarding the future of ML, Jerome envisioned a “streamlined development, more
accessible tools, breakthrough hardware, and more”. Talking about the gap huge
gap between software engineering and ML engineering, Presenti said:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Machine learning engineering is where we were in Software Engineering 20 years
ago. A lot of things still need to be invented. We need to figure out what
testing means, what CD (continuous delivery) means, we need to develop tools
and environments that people can develop &lt;strong&gt;robust ML that does not have too
many biases&lt;/strong&gt; and does not overfit.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;In that gap lives many opportunities to develop new tools and services. We in
the ML ecosystem are called upon to implement the future of machine learning
tools. Traditional software engineering has many useful tools and techniques
which can either be repurposed for Machine Learning development or used as a
source for ideas in developing new tools.&lt;/p&gt;
&lt;h3&gt;1.2. PyTorch motivation&lt;/h3&gt;
&lt;p&gt;PyTorch 1.0 implements one important engineering principle — “a seamless
transition from AI research to production”. It helps to move AI technology from
research into production as quickly as possible. In order to do that a few
challenges were solved:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Write code once&lt;/strong&gt; — not have to rewrite or re-optimize code to go from
research to prod.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Performance&lt;/strong&gt; — training model on large datasets.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Other languages&lt;/strong&gt; — not only Python which is great for prototyping but also
C++ and other languages.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scaling&lt;/strong&gt; — deploy PyTorch at scale more easily.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;2. Engineering practices for software 2.0&lt;/h2&gt;
&lt;h3&gt;2.1. Melting of software 2.0 and software 1.0&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Andrej Karpathy&lt;/strong&gt; from Tesla AI had a
&lt;a href=&quot;https://www.facebook.com/pytorch/videos/169366590639145/&quot;&gt;dedicated talk&lt;/a&gt; about
best engineering practices in ML. He drew a contrast between traditional
software development (software 1.0) with software utilizing Machine Learning
techniques (software 2.0), saying that&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;“software 2.0 code also has new feature demands, contains bugs, and requires
iterations.”&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Meaning that ML development has a lifecycle similar to traditional software:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;“When you are working with these [neural] networks &lt;strong&gt;in production&lt;/strong&gt; you are
doing much more than that [training and measuring models]. You maintaining the
codebase and that codebase is alive is just like 1.0 code.”&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Machine Learning models need to grow and develop feature-by-feature, bugs need
to be found and fixed, and repeatable processes are a must, as in earlier non-ML
software development practices.&lt;/p&gt;
&lt;h3&gt;2.2. Software 2.0 best practices&lt;/h3&gt;
&lt;p&gt;Karpathy went on to describe how software 1.0 best practices can be used in
software 2.0 (ML modeling):&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Test-driven development&lt;/strong&gt; — test/train dataset separation is not enough
since it describes only expected performance. Edge cases have to be tested to
ensure the model performs as required. That requires incorporating more
examples in datasets, or changing model architecture, or changing
optimization functions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Continues Integration and Continues Delivery&lt;/strong&gt; (CI/CD) — Intelligently used
of CI/CD can propel a team into rapid agile development of software systems.
The phases of CI/CD jobs include: 1) ML model auto re-training when code or
dataset changes; 2) running unit-tests; 3) easy access to the last model; 4)
Auto-deployment to test and/or production systems.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Version Control&lt;/strong&gt; — track all the changes in datasets (labels), not only
code.&lt;/li&gt;
&lt;li&gt;Train a &lt;strong&gt;single model&lt;/strong&gt; from scratch every time without using other
pre-trained models. (External pre-trained models don’t count as far as I
understand.) A chain of fine-tuning models very quickly disintegrates
codebase. In software 1.0 a single &lt;strong&gt;monorepo&lt;/strong&gt; is an analog of a single
model which also helps to avoid disintegration.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This list of best practices shows how serious Tesla AI is about robust software
which is not surprising for self-driving car area. Any company needs these
practices in order to organize a manageable ML development process.&lt;/p&gt;
&lt;h2&gt;3. Data file-centric tools&lt;/h2&gt;
&lt;p&gt;Frameworks and libraries like PyTorch make a significant step in machine
learning tooling and bringing the best practices. However, frameworks and
libraries might be not enough for many of the ML best practices. For example,
dataset versioning, ML model versioning, continuous integration (CI) and
continuous delivery (CD) requires manipulation and transferring data files.
These can be done in a &lt;strong&gt;more efficient and natural way by data management
tools&lt;/strong&gt; and storage systems rather than libraries.&lt;/p&gt;
&lt;p&gt;The need for a machine learning artifact manipulation tool with &lt;strong&gt;data
file-centric philosophy&lt;/strong&gt; was the major motivation behind open source project
that we created — Data Version Control (DVC) or &lt;a href=&quot;http://dvc.org&quot;&gt;DVC.org&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;DVC connects Git with data files and machine learning pipelines which helps keep
version control on machine learning models and datasets using familiar Git
semantics coupled with the power of cloud storage systems such as Amazon’s S3,
Google’s GCS, Microsoft’s Azure or bare-metal servers accessed by SSH.&lt;/p&gt;
&lt;p&gt;If PyTorch helps in organizing code inside an ML project then data-centric tools
like DVC help organized different pieces of ML projects into a single workflow.
The machine learning future requires both types of tools — code level and data
file level.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Thus far only the first steps have been taken toward using machine learning
tooling and the best machine learning practices. Mostly large companies are
using these practices because they faced the problems a while ago. Best
practices should be embraced by the entire industry which will help to bring
machine learning to a higher new level.&lt;/p&gt;</content:encoded></item><item><title><![CDATA[Best practices of orchestrating Python and R code in ML projects]]></title><link>https://blog.dvc.org/best-practices-of-orchestrating-python-and-r-code-in-ml-projects</link><guid isPermaLink="false">https://blog.dvc.org/best-practices-of-orchestrating-python-and-r-code-in-ml-projects</guid><pubDate>Tue, 26 Sep 2017 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Beside Git and shell scripting additional tools are developed to facilitate the
development of predictive model in a multi-language environments. For fast data
exchange between R and Python let’s use binary data file format
&lt;a href=&quot;https://blog.rstudio.com/2016/03/29/feather/&quot;&gt;Feather&lt;/a&gt;. Another language
agnostic tool &lt;a href=&quot;http://dvc.org&quot;&gt;DVC&lt;/a&gt; can make the research reproducible — let’s
use DVC to orchestrate R and Python code instead of a regular shell scripts.&lt;/p&gt;
&lt;h2&gt;Machine learning with R and Python&lt;/h2&gt;
&lt;p&gt;Both R and Python are having powerful libraries/packages used for predictive
modeling. Usually algorithms used for classification or regression are
implemented in both languages and some scientist are using R while some of them
preferring Python. In an example that was explained in previous
&lt;a href=&quot;https://blog.dataversioncontrol.com/r-code-and-reproducible-model-development-with-dvc-1507a0e3687b&quot;&gt;tutorial&lt;/a&gt;
target variable was binary output and logistic regression was used as a training
algorithm. One of the algorithms that could also be used for prediction is a
popular &lt;a href=&quot;https://en.wikipedia.org/wiki/Random_forest&quot;&gt;Random Forest algorithm&lt;/a&gt;
which is implemented in both programming languages. Because of performances it
was decided that Random Forest classifier should be implemented in Python (it
shows better performances than random forest package in R).&lt;/p&gt;
&lt;h2&gt;R example used for DVC demo&lt;/h2&gt;
&lt;p&gt;We will use the same example from previous blog
&lt;a href=&quot;https://blog.dataversioncontrol.com/r-code-and-reproducible-model-development-with-dvc-1507a0e3687b&quot;&gt;story&lt;/a&gt;,
add some Python codes and explain how Feather and DVC can simplify the
development process in this combined environment.&lt;/p&gt;
&lt;p&gt;Let’s recall briefly the R codes from previous tutorial:&lt;/p&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;span class=&quot;gatsby-resp-image-wrapper&quot; style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto;  max-width: 335px;&quot;&gt;
      &lt;a class=&quot;gatsby-resp-image-link&quot; href=&quot;/static/68824bc8c4ac0c84edf737da9f1bfa01/9be56/r-jobs.png&quot; style=&quot;display: block&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;
    &lt;span class=&quot;gatsby-resp-image-background-image&quot; style=&quot;padding-bottom: 78.65671641791046%; position: relative; bottom: 0; left: 0; background-image: url(&amp;#x27;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAQCAYAAAAWGF8bAAAACXBIWXMAAAsSAAALEgHS3X78AAADs0lEQVQ4y22UXVAbVRTH70y77GzVwVl19MUnp/rmm31xnNGxdcqItAOD+qC+MU3F2nxtMULIB0lKEgqjEFoSyBJKS5qQhFKVggntlI9WGpJAdpOUJ18qmVZJcIb1KV7PvSlQRx9+c8499//fvXvuSdDZkTTWiBmsHc1g3R5rdC2MrWFHJI+7owUaz0fz+JvxdawV9zUkEsgzCEg9sqroRtOK5ZqkmINZxQTQHOgKSUp3NKe4pwo7zlh+h6wtoarOCnFXawvLyrlARtH4UwrqDqY4dyjNXbqR5S5OZ7nvY2vc0A8SNwDxE9tNrmXsIfv60ZbnD7/7BX/EnGVPmH7ijIEVzj+T43w/StQ3CL7eyQxHnoVOX1oxtnrvG1uH/s1X3qSxfTzTaRhf79B5Fx1nBhJua0ju6ICaZmTV+CXVPK2vgtzxEu6b38bueBn3zpdxTwKg+TYevruD/fcU7FvewT7IAyt/YfEXBQ/c+ZNqiJf4LiTKNBKQIyILzmhOsIerEXomnIcaiabguuCI5PQdl1NtlolMW+dEVt95ldSq+7awBMhUb5+UBQeAGqxxvt6a4E/a5vl6S4L/0Byn8SOoNdvj/AdOqfbNk8Krbxxtea25f6O20fYzfwL2GoAmxy2+yX6Laom/EXJ0dni1ohPTFa2YopCcADdWgZurwKkr8Oa/4RRArmK8ul7R+lN7+n1PinoQNLgIV1+EMSh2hSWaw3js1jZdsVwRPmvTNVX4DR5Kc/g8qt/3kbpUbBvLFJFhaInRDS4wmoEFRri4yHzrXWLM/ruMeyLJqPtma9Dp4oFnDr9fi155i0ef/nqgwRCrcV25z/QGk0wPaAygV/ffYbSeBepFX/uSHjilR/2EM7CGkaFAOzymoNSvH77nM4jLYldI7teLqb19gvopLwEZYw+x6fompRNwzj7CF+YeY9fsY9wX/x17F8vYc3sL+5dhjJbK+Lv5P7Dr5iPcAxo3YJ7e3PMTyE9PBQ2lQLNV1mtZFfRKZZ+UVA4A+nOq/fJqK8EclE7ZwxLd6wYNXJZKP5qmvl0/+twxx35smWHf006xdW3T7IuNoyw65mOfrfezz9UNsujt5EH0wjs8eunIy6j5wcFDxz1sTd0waLzsIdAQ3/Fz02yTaYb9zD7HkrEpacV0Cd70H2AcSvAHsOWMymU40Tbc8JYQ+H8tjE8JxqaENOIaNlyRsSP6AFsnC7gLsEcK2BzKY0s4j51TG9h1fYNGoiE1W6Sqsz6BrNsnZKwPZPE/BKccZhNrfagAAAAASUVORK5CYII=&amp;#x27;); background-size: cover; display: block;&quot;&gt;&lt;/span&gt;
  &lt;picture&gt;
        &lt;source srcset=&quot;/static/68824bc8c4ac0c84edf737da9f1bfa01/c54d4/r-jobs.webp 175w, /static/68824bc8c4ac0c84edf737da9f1bfa01/a3432/r-jobs.webp 350w, /static/68824bc8c4ac0c84edf737da9f1bfa01/6cceb/r-jobs.webp 670w&quot; sizes=&quot;(max-width: 670px) 100vw, 670px&quot; type=&quot;image/webp&quot;&gt;
        &lt;source srcset=&quot;/static/68824bc8c4ac0c84edf737da9f1bfa01/17006/r-jobs.png 175w, /static/68824bc8c4ac0c84edf737da9f1bfa01/d6f3f/r-jobs.png 350w, /static/68824bc8c4ac0c84edf737da9f1bfa01/9be56/r-jobs.png 670w&quot; sizes=&quot;(max-width: 670px) 100vw, 670px&quot; type=&quot;image/png&quot;&gt;
        &lt;img class=&quot;gatsby-resp-image-image&quot; src=&quot;/static/68824bc8c4ac0c84edf737da9f1bfa01/9be56/r-jobs.png&quot; alt=&quot;R Jobs&quot; title=&quot;R Jobs&quot; loading=&quot;lazy&quot; style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;&gt;
      &lt;/picture&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/body&gt;&lt;/html&gt;&lt;em&gt;R Jobs&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Input data are StackOverflow posts — an XML file. Predictive variables are
created from text posts — relative importance
&lt;a href=&quot;https://en.wikipedia.org/wiki/Tf%E2%80%93idf&quot;&gt;tf-idf&lt;/a&gt; of words among all
available posts is calculated. With tf-idf matrices target is predicted and
lasso logistic regression for predicting binary output is used. AUC is
calculated on the test set and AUC metric is used on evaluation.&lt;/p&gt;
&lt;p&gt;Instead of using logistic regression in R we will write Python jobs in which we
will try to use random forest as training model. Train_model.R and evaluate.R
will be replaced with appropriate Python jobs.&lt;/p&gt;
&lt;p&gt;R codes can be seen
&lt;a href=&quot;https://blog.dataversioncontrol.com/r-code-and-reproducible-model-development-with-dvc-1507a0e3687b&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Code for &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;train_model_Python.py&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; is presented below:&lt;/p&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;div id=&quot;gist73527556&quot; class=&quot;gist&quot;&gt;
    &lt;div class=&quot;gist-file&quot;&gt;
      &lt;div class=&quot;gist-data&quot;&gt;
        &lt;div class=&quot;js-gist-file-update-container js-task-list-container file-box&quot;&gt;
  &lt;div id=&quot;file-train_model_python-py&quot; class=&quot;file&quot;&gt;
    

  &lt;div itemprop=&quot;text&quot; class=&quot;Box-body p-0 blob-wrapper data type-python&quot;&gt;
      
&lt;table class=&quot;highlight tab-size js-file-line-container&quot; data-tab-size=&quot;8&quot;&gt;
      &lt;tbody&gt;&lt;tr&gt;
        &lt;td id=&quot;file-train_model_python-py-L1&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;1&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_model_python-py-LC1&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-k&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;pl-s1&quot;&gt;numpy&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;pl-s1&quot;&gt;np&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_model_python-py-L2&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;2&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_model_python-py-LC2&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-k&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;pl-s1&quot;&gt;sklearn&lt;/span&gt;.&lt;span class=&quot;pl-s1&quot;&gt;ensemble&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;pl-v&quot;&gt;RandomForestClassifier&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_model_python-py-L3&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;3&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_model_python-py-LC3&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-k&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;pl-s1&quot;&gt;sys&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_model_python-py-L4&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;4&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_model_python-py-LC4&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-k&quot;&gt;try&lt;/span&gt;: &lt;span class=&quot;pl-k&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;pl-s1&quot;&gt;cPickle&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;pl-s1&quot;&gt;pickle&lt;/span&gt;   &lt;span class=&quot;pl-c&quot;&gt;# python2&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_model_python-py-L5&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;5&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_model_python-py-LC5&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-k&quot;&gt;except&lt;/span&gt;: &lt;span class=&quot;pl-k&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;pl-s1&quot;&gt;pickle&lt;/span&gt;           &lt;span class=&quot;pl-c&quot;&gt;# python3&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_model_python-py-L6&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;6&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_model_python-py-LC6&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-k&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;pl-s1&quot;&gt;scipy&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;pl-s1&quot;&gt;sparse&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_model_python-py-L7&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;7&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_model_python-py-LC7&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-k&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;pl-s1&quot;&gt;numpy&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;pl-s1&quot;&gt;loadtxt&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_model_python-py-L8&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;8&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_model_python-py-LC8&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-k&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;pl-s1&quot;&gt;feather&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;pl-s1&quot;&gt;ft&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_model_python-py-L9&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;9&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_model_python-py-LC9&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_model_python-py-L10&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;10&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_model_python-py-LC10&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;pl-en&quot;&gt;len&lt;/span&gt;(&lt;span class=&quot;pl-s1&quot;&gt;sys&lt;/span&gt;.&lt;span class=&quot;pl-s1&quot;&gt;argv&lt;/span&gt;) &lt;span class=&quot;pl-c1&quot;&gt;!=&lt;/span&gt; &lt;span class=&quot;pl-c1&quot;&gt;4&lt;/span&gt;:&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_model_python-py-L11&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;11&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_model_python-py-LC11&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;    &lt;span class=&quot;pl-s1&quot;&gt;sys&lt;/span&gt;.&lt;span class=&quot;pl-s1&quot;&gt;stderr&lt;/span&gt;.&lt;span class=&quot;pl-en&quot;&gt;write&lt;/span&gt;(&lt;span class=&quot;pl-s&quot;&gt;&apos;Arguments error. Usage:&lt;span class=&quot;pl-cce&quot;&gt;\n&lt;/span&gt;&apos;&lt;/span&gt;)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_model_python-py-L12&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;12&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_model_python-py-LC12&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;    &lt;span class=&quot;pl-s1&quot;&gt;sys&lt;/span&gt;.&lt;span class=&quot;pl-s1&quot;&gt;stderr&lt;/span&gt;.&lt;span class=&quot;pl-en&quot;&gt;write&lt;/span&gt;(&lt;span class=&quot;pl-s&quot;&gt;&apos;&lt;span class=&quot;pl-cce&quot;&gt;\t&lt;/span&gt;python train_model.py INPUT_MATRIX_FILE SEED OUTPUT_MODEL_FILE&lt;span class=&quot;pl-cce&quot;&gt;\n&lt;/span&gt;&apos;&lt;/span&gt;)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_model_python-py-L13&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;13&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_model_python-py-LC13&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;    &lt;span class=&quot;pl-s1&quot;&gt;sys&lt;/span&gt;.&lt;span class=&quot;pl-en&quot;&gt;exit&lt;/span&gt;(&lt;span class=&quot;pl-c1&quot;&gt;1&lt;/span&gt;)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_model_python-py-L14&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;14&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_model_python-py-LC14&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_model_python-py-L15&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;15&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_model_python-py-LC15&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-s1&quot;&gt;input&lt;/span&gt; &lt;span class=&quot;pl-c1&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;pl-s1&quot;&gt;sys&lt;/span&gt;.&lt;span class=&quot;pl-s1&quot;&gt;argv&lt;/span&gt;[&lt;span class=&quot;pl-c1&quot;&gt;1&lt;/span&gt;]&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_model_python-py-L16&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;16&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_model_python-py-LC16&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-s1&quot;&gt;seed&lt;/span&gt; &lt;span class=&quot;pl-c1&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;pl-en&quot;&gt;int&lt;/span&gt;(&lt;span class=&quot;pl-s1&quot;&gt;sys&lt;/span&gt;.&lt;span class=&quot;pl-s1&quot;&gt;argv&lt;/span&gt;[&lt;span class=&quot;pl-c1&quot;&gt;2&lt;/span&gt;])&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_model_python-py-L17&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;17&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_model_python-py-LC17&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-s1&quot;&gt;output&lt;/span&gt; &lt;span class=&quot;pl-c1&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;pl-s1&quot;&gt;sys&lt;/span&gt;.&lt;span class=&quot;pl-s1&quot;&gt;argv&lt;/span&gt;[&lt;span class=&quot;pl-c1&quot;&gt;3&lt;/span&gt;]&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_model_python-py-L18&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;18&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_model_python-py-LC18&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_model_python-py-L19&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;19&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_model_python-py-LC19&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-s1&quot;&gt;df&lt;/span&gt; &lt;span class=&quot;pl-c1&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;pl-s1&quot;&gt;ft&lt;/span&gt;.&lt;span class=&quot;pl-en&quot;&gt;read_dataframe&lt;/span&gt;(&lt;span class=&quot;pl-s1&quot;&gt;input&lt;/span&gt;)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_model_python-py-L20&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;20&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_model_python-py-LC20&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-s1&quot;&gt;labels&lt;/span&gt; &lt;span class=&quot;pl-c1&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;pl-s1&quot;&gt;df&lt;/span&gt;.&lt;span class=&quot;pl-s1&quot;&gt;loc&lt;/span&gt;[:,&lt;span class=&quot;pl-s&quot;&gt;&apos;label&apos;&lt;/span&gt;]&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_model_python-py-L21&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;21&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_model_python-py-LC21&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-s1&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;pl-c1&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;pl-s1&quot;&gt;df&lt;/span&gt;.&lt;span class=&quot;pl-s1&quot;&gt;loc&lt;/span&gt;[:, &lt;span class=&quot;pl-s1&quot;&gt;df&lt;/span&gt;.&lt;span class=&quot;pl-s1&quot;&gt;columns&lt;/span&gt; &lt;span class=&quot;pl-c1&quot;&gt;!=&lt;/span&gt; &lt;span class=&quot;pl-s&quot;&gt;&apos;label&apos;&lt;/span&gt;]&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_model_python-py-L22&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;22&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_model_python-py-LC22&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_model_python-py-L23&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;23&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_model_python-py-LC23&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-s1&quot;&gt;clf&lt;/span&gt; &lt;span class=&quot;pl-c1&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;pl-v&quot;&gt;RandomForestClassifier&lt;/span&gt;(&lt;span class=&quot;pl-s1&quot;&gt;n_estimators&lt;/span&gt;&lt;span class=&quot;pl-c1&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;pl-c1&quot;&gt;100&lt;/span&gt;, &lt;span class=&quot;pl-s1&quot;&gt;n_jobs&lt;/span&gt;&lt;span class=&quot;pl-c1&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;pl-c1&quot;&gt;2&lt;/span&gt;, &lt;span class=&quot;pl-s1&quot;&gt;random_state&lt;/span&gt;&lt;span class=&quot;pl-c1&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;pl-s1&quot;&gt;seed&lt;/span&gt;)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_model_python-py-L24&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;24&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_model_python-py-LC24&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-s1&quot;&gt;clf&lt;/span&gt;.&lt;span class=&quot;pl-en&quot;&gt;fit&lt;/span&gt;(&lt;span class=&quot;pl-s1&quot;&gt;x&lt;/span&gt;, &lt;span class=&quot;pl-s1&quot;&gt;labels&lt;/span&gt;.&lt;span class=&quot;pl-s1&quot;&gt;ix&lt;/span&gt;[:,&lt;span class=&quot;pl-c1&quot;&gt;0&lt;/span&gt;])&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_model_python-py-L25&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;25&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_model_python-py-LC25&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_model_python-py-L26&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;26&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_model_python-py-LC26&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-k&quot;&gt;with&lt;/span&gt; &lt;span class=&quot;pl-en&quot;&gt;open&lt;/span&gt;(&lt;span class=&quot;pl-s1&quot;&gt;output&lt;/span&gt;, &lt;span class=&quot;pl-s&quot;&gt;&apos;wb&apos;&lt;/span&gt;) &lt;span class=&quot;pl-k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;pl-s1&quot;&gt;fd&lt;/span&gt;:&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_model_python-py-L27&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;27&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_model_python-py-LC27&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;    &lt;span class=&quot;pl-s1&quot;&gt;pickle&lt;/span&gt;.&lt;span class=&quot;pl-en&quot;&gt;dump&lt;/span&gt;(&lt;span class=&quot;pl-s1&quot;&gt;clf&lt;/span&gt;, &lt;span class=&quot;pl-s1&quot;&gt;fd&lt;/span&gt;)&lt;/td&gt;
      &lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;


  &lt;/div&gt;

  &lt;/div&gt;
&lt;/div&gt;

      &lt;/div&gt;
      &lt;div class=&quot;gist-meta&quot;&gt;
        &lt;a href=&quot;https://gist.github.com/Zoldin/b312897cc492608feef1eaeae7f6eabc/raw/8dad0f69067945b9b84f8d90a8cdbe52694e36f8/train_model_Python.py&quot; style=&quot;float:right&quot;&gt;view raw&lt;/a&gt;
        &lt;a href=&quot;https://gist.github.com/Zoldin/b312897cc492608feef1eaeae7f6eabc#file-train_model_python-py&quot;&gt;train_model_Python.py&lt;/a&gt;
        hosted with ❤ by &lt;a href=&quot;https://github.com&quot;&gt;GitHub&lt;/a&gt;
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;p&gt;Also here we are adding code for &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;evaluation_python_model.py&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;:&lt;/p&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;div id=&quot;gist73527649&quot; class=&quot;gist&quot;&gt;
    &lt;div class=&quot;gist-file&quot;&gt;
      &lt;div class=&quot;gist-data&quot;&gt;
        &lt;div class=&quot;js-gist-file-update-container js-task-list-container file-box&quot;&gt;
  &lt;div id=&quot;file-evaluation_python_model-py&quot; class=&quot;file&quot;&gt;
    

  &lt;div itemprop=&quot;text&quot; class=&quot;Box-body p-0 blob-wrapper data type-python&quot;&gt;
      
&lt;table class=&quot;highlight tab-size js-file-line-container&quot; data-tab-size=&quot;8&quot;&gt;
      &lt;tbody&gt;&lt;tr&gt;
        &lt;td id=&quot;file-evaluation_python_model-py-L1&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;1&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-evaluation_python_model-py-LC1&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-k&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;pl-s1&quot;&gt;sklearn&lt;/span&gt;.&lt;span class=&quot;pl-s1&quot;&gt;metrics&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;pl-s1&quot;&gt;precision_recall_curve&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-evaluation_python_model-py-L2&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;2&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-evaluation_python_model-py-LC2&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-k&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;pl-s1&quot;&gt;sys&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-evaluation_python_model-py-L3&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;3&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-evaluation_python_model-py-LC3&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-k&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;pl-s1&quot;&gt;sklearn&lt;/span&gt;.&lt;span class=&quot;pl-s1&quot;&gt;metrics&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;pl-s1&quot;&gt;metrics&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-evaluation_python_model-py-L4&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;4&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-evaluation_python_model-py-LC4&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-k&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;pl-s1&quot;&gt;scipy&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;pl-s1&quot;&gt;sparse&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-evaluation_python_model-py-L5&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;5&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-evaluation_python_model-py-LC5&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-k&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;pl-s1&quot;&gt;numpy&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;pl-s1&quot;&gt;loadtxt&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-evaluation_python_model-py-L6&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;6&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-evaluation_python_model-py-LC6&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-k&quot;&gt;try&lt;/span&gt;: &lt;span class=&quot;pl-k&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;pl-s1&quot;&gt;cPickle&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;pl-s1&quot;&gt;pickle&lt;/span&gt;   &lt;span class=&quot;pl-c&quot;&gt;# python2&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-evaluation_python_model-py-L7&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;7&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-evaluation_python_model-py-LC7&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-k&quot;&gt;except&lt;/span&gt;: &lt;span class=&quot;pl-k&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;pl-s1&quot;&gt;pickle&lt;/span&gt;           &lt;span class=&quot;pl-c&quot;&gt;# python3&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-evaluation_python_model-py-L8&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;8&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-evaluation_python_model-py-LC8&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-k&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;pl-s1&quot;&gt;feather&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;pl-s1&quot;&gt;ft&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-evaluation_python_model-py-L9&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;9&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-evaluation_python_model-py-LC9&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-evaluation_python_model-py-L10&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;10&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-evaluation_python_model-py-LC10&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;pl-en&quot;&gt;len&lt;/span&gt;(&lt;span class=&quot;pl-s1&quot;&gt;sys&lt;/span&gt;.&lt;span class=&quot;pl-s1&quot;&gt;argv&lt;/span&gt;) &lt;span class=&quot;pl-c1&quot;&gt;!=&lt;/span&gt; &lt;span class=&quot;pl-c1&quot;&gt;4&lt;/span&gt;:&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-evaluation_python_model-py-L11&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;11&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-evaluation_python_model-py-LC11&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;    &lt;span class=&quot;pl-s1&quot;&gt;sys&lt;/span&gt;.&lt;span class=&quot;pl-s1&quot;&gt;stderr&lt;/span&gt;.&lt;span class=&quot;pl-en&quot;&gt;write&lt;/span&gt;(&lt;span class=&quot;pl-s&quot;&gt;&apos;Arguments error. Usage:&lt;span class=&quot;pl-cce&quot;&gt;\n&lt;/span&gt;&apos;&lt;/span&gt;)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-evaluation_python_model-py-L12&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;12&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-evaluation_python_model-py-LC12&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;    &lt;span class=&quot;pl-s1&quot;&gt;sys&lt;/span&gt;.&lt;span class=&quot;pl-s1&quot;&gt;stderr&lt;/span&gt;.&lt;span class=&quot;pl-en&quot;&gt;write&lt;/span&gt;(&lt;span class=&quot;pl-s&quot;&gt;&apos;&lt;span class=&quot;pl-cce&quot;&gt;\t&lt;/span&gt;python metrics.py MODEL_FILE TEST_MATRIX METRICS_FILE&lt;span class=&quot;pl-cce&quot;&gt;\n&lt;/span&gt;&apos;&lt;/span&gt;)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-evaluation_python_model-py-L13&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;13&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-evaluation_python_model-py-LC13&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;    &lt;span class=&quot;pl-s1&quot;&gt;sys&lt;/span&gt;.&lt;span class=&quot;pl-en&quot;&gt;exit&lt;/span&gt;(&lt;span class=&quot;pl-c1&quot;&gt;1&lt;/span&gt;)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-evaluation_python_model-py-L14&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;14&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-evaluation_python_model-py-LC14&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-evaluation_python_model-py-L15&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;15&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-evaluation_python_model-py-LC15&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-s1&quot;&gt;model_file&lt;/span&gt; &lt;span class=&quot;pl-c1&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;pl-s1&quot;&gt;sys&lt;/span&gt;.&lt;span class=&quot;pl-s1&quot;&gt;argv&lt;/span&gt;[&lt;span class=&quot;pl-c1&quot;&gt;1&lt;/span&gt;]&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-evaluation_python_model-py-L16&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;16&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-evaluation_python_model-py-LC16&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-s1&quot;&gt;test_matrix_file&lt;/span&gt; &lt;span class=&quot;pl-c1&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;pl-s1&quot;&gt;sys&lt;/span&gt;.&lt;span class=&quot;pl-s1&quot;&gt;argv&lt;/span&gt;[&lt;span class=&quot;pl-c1&quot;&gt;2&lt;/span&gt;]&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-evaluation_python_model-py-L17&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;17&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-evaluation_python_model-py-LC17&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-s1&quot;&gt;metrics_file&lt;/span&gt; &lt;span class=&quot;pl-c1&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;pl-s1&quot;&gt;sys&lt;/span&gt;.&lt;span class=&quot;pl-s1&quot;&gt;argv&lt;/span&gt;[&lt;span class=&quot;pl-c1&quot;&gt;3&lt;/span&gt;]&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-evaluation_python_model-py-L18&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;18&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-evaluation_python_model-py-LC18&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-evaluation_python_model-py-L19&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;19&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-evaluation_python_model-py-LC19&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-k&quot;&gt;with&lt;/span&gt; &lt;span class=&quot;pl-en&quot;&gt;open&lt;/span&gt;(&lt;span class=&quot;pl-s1&quot;&gt;model_file&lt;/span&gt;, &lt;span class=&quot;pl-s&quot;&gt;&apos;rb&apos;&lt;/span&gt;) &lt;span class=&quot;pl-k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;pl-s1&quot;&gt;fd&lt;/span&gt;:&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-evaluation_python_model-py-L20&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;20&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-evaluation_python_model-py-LC20&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;    &lt;span class=&quot;pl-s1&quot;&gt;model&lt;/span&gt; &lt;span class=&quot;pl-c1&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;pl-s1&quot;&gt;pickle&lt;/span&gt;.&lt;span class=&quot;pl-en&quot;&gt;load&lt;/span&gt;(&lt;span class=&quot;pl-s1&quot;&gt;fd&lt;/span&gt;)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-evaluation_python_model-py-L21&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;21&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-evaluation_python_model-py-LC21&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-evaluation_python_model-py-L22&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;22&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-evaluation_python_model-py-LC22&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-s1&quot;&gt;df&lt;/span&gt; &lt;span class=&quot;pl-c1&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;pl-s1&quot;&gt;ft&lt;/span&gt;.&lt;span class=&quot;pl-en&quot;&gt;read_dataframe&lt;/span&gt;(&lt;span class=&quot;pl-s1&quot;&gt;test_matrix_file&lt;/span&gt;)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-evaluation_python_model-py-L23&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;23&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-evaluation_python_model-py-LC23&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-s1&quot;&gt;labels&lt;/span&gt; &lt;span class=&quot;pl-c1&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;pl-s1&quot;&gt;df&lt;/span&gt;.&lt;span class=&quot;pl-s1&quot;&gt;loc&lt;/span&gt;[:,&lt;span class=&quot;pl-s&quot;&gt;&apos;label&apos;&lt;/span&gt;]&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-evaluation_python_model-py-L24&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;24&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-evaluation_python_model-py-LC24&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-s1&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;pl-c1&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;pl-s1&quot;&gt;df&lt;/span&gt;.&lt;span class=&quot;pl-s1&quot;&gt;loc&lt;/span&gt;[:, &lt;span class=&quot;pl-s1&quot;&gt;df&lt;/span&gt;.&lt;span class=&quot;pl-s1&quot;&gt;columns&lt;/span&gt; &lt;span class=&quot;pl-c1&quot;&gt;!=&lt;/span&gt; &lt;span class=&quot;pl-s&quot;&gt;&apos;label&apos;&lt;/span&gt;]&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-evaluation_python_model-py-L25&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;25&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-evaluation_python_model-py-LC25&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-s1&quot;&gt;predictions_by_class&lt;/span&gt; &lt;span class=&quot;pl-c1&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;pl-s1&quot;&gt;model&lt;/span&gt;.&lt;span class=&quot;pl-en&quot;&gt;predict_proba&lt;/span&gt;(&lt;span class=&quot;pl-s1&quot;&gt;x&lt;/span&gt;)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-evaluation_python_model-py-L26&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;26&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-evaluation_python_model-py-LC26&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-s1&quot;&gt;predictions&lt;/span&gt; &lt;span class=&quot;pl-c1&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;pl-s1&quot;&gt;predictions_by_class&lt;/span&gt;[:,&lt;span class=&quot;pl-c1&quot;&gt;1&lt;/span&gt;]&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-evaluation_python_model-py-L27&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;27&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-evaluation_python_model-py-LC27&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-evaluation_python_model-py-L28&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;28&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-evaluation_python_model-py-LC28&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-s1&quot;&gt;precision&lt;/span&gt;, &lt;span class=&quot;pl-s1&quot;&gt;recall&lt;/span&gt;, &lt;span class=&quot;pl-s1&quot;&gt;thresholds&lt;/span&gt; &lt;span class=&quot;pl-c1&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;pl-en&quot;&gt;precision_recall_curve&lt;/span&gt;(&lt;span class=&quot;pl-s1&quot;&gt;labels&lt;/span&gt;.&lt;span class=&quot;pl-s1&quot;&gt;ix&lt;/span&gt;[:,&lt;span class=&quot;pl-c1&quot;&gt;0&lt;/span&gt;], &lt;span class=&quot;pl-s1&quot;&gt;predictions&lt;/span&gt;)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-evaluation_python_model-py-L29&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;29&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-evaluation_python_model-py-LC29&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-evaluation_python_model-py-L30&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;30&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-evaluation_python_model-py-LC30&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-s1&quot;&gt;auc&lt;/span&gt; &lt;span class=&quot;pl-c1&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;pl-s1&quot;&gt;metrics&lt;/span&gt;.&lt;span class=&quot;pl-en&quot;&gt;auc&lt;/span&gt;(&lt;span class=&quot;pl-s1&quot;&gt;recall&lt;/span&gt;, &lt;span class=&quot;pl-s1&quot;&gt;precision&lt;/span&gt;)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-evaluation_python_model-py-L31&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;31&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-evaluation_python_model-py-LC31&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;#print(&apos;AUC={}&apos;.format(metrics.auc(recall, precision)))&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-evaluation_python_model-py-L32&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;32&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-evaluation_python_model-py-LC32&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-k&quot;&gt;with&lt;/span&gt; &lt;span class=&quot;pl-en&quot;&gt;open&lt;/span&gt;(&lt;span class=&quot;pl-s1&quot;&gt;metrics_file&lt;/span&gt;, &lt;span class=&quot;pl-s&quot;&gt;&apos;w&apos;&lt;/span&gt;) &lt;span class=&quot;pl-k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;pl-s1&quot;&gt;fd&lt;/span&gt;:&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-evaluation_python_model-py-L33&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;33&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-evaluation_python_model-py-LC33&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;    &lt;span class=&quot;pl-s1&quot;&gt;fd&lt;/span&gt;.&lt;span class=&quot;pl-en&quot;&gt;write&lt;/span&gt;(&lt;span class=&quot;pl-s&quot;&gt;&apos;AUC: {:4f}&lt;span class=&quot;pl-cce&quot;&gt;\n&lt;/span&gt;&apos;&lt;/span&gt;.&lt;span class=&quot;pl-en&quot;&gt;format&lt;/span&gt;(&lt;span class=&quot;pl-s1&quot;&gt;auc&lt;/span&gt;))&lt;/td&gt;
      &lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;


  &lt;/div&gt;

  &lt;/div&gt;
&lt;/div&gt;

      &lt;/div&gt;
      &lt;div class=&quot;gist-meta&quot;&gt;
        &lt;a href=&quot;https://gist.github.com/Zoldin/9eef13632d0a9039fe9b0dba376516a4/raw/8b8837f0d5640e0c208ea1c4910d655d933b9bd0/evaluation_python_model.py&quot; style=&quot;float:right&quot;&gt;view raw&lt;/a&gt;
        &lt;a href=&quot;https://gist.github.com/Zoldin/9eef13632d0a9039fe9b0dba376516a4#file-evaluation_python_model-py&quot;&gt;evaluation_python_model.py&lt;/a&gt;
        hosted with ❤ by &lt;a href=&quot;https://github.com&quot;&gt;GitHub&lt;/a&gt;
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;p&gt;Let’s download necessary R and Python codes from above (clone the
&lt;a href=&quot;https://github.com/Zoldin/R_AND_DVC&quot;&gt;Github&lt;/a&gt; repository):&lt;/p&gt;
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;dvc&quot;&gt;&lt;pre class=&quot;language-dvc&quot;&gt;&lt;code class=&quot;language-dvc&quot;&gt;&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token command&quot;&gt;mkdir&lt;/span&gt; R_DVC_GITHUB_CODE
&lt;/span&gt;&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token command&quot;&gt;cd&lt;/span&gt; R_DVC_GITHUB_CODE
&lt;/span&gt;
&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token git&quot;&gt;git clone&lt;/span&gt; https://github.com/Zoldin/R_AND_DVC&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/body&gt;&lt;/html&gt;
&lt;p&gt;Our dependency graph of this data science project look like this:&lt;/p&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;span class=&quot;gatsby-resp-image-wrapper&quot; style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto;  max-width: 250.5px;&quot;&gt;
      &lt;a class=&quot;gatsby-resp-image-link&quot; href=&quot;/static/fbd7192868b16c9a421107083e2dd45b/f55b8/our-dependency-graph.png&quot; style=&quot;display: block&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;
    &lt;span class=&quot;gatsby-resp-image-background-image&quot; style=&quot;padding-bottom: 199.20159680638722%; position: relative; bottom: 0; left: 0; background-image: url(&amp;#x27;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAoCAYAAAD+MdrbAAAACXBIWXMAAAsSAAALEgHS3X78AAAIoklEQVRIx4VWC1BTZxb+gygBIuEVEgh5kgd5kZvcPCDEEAIFAmJEoAK2dSoCCQ+LqAUF5SEPFRBBIAUlylbXF9aKb1EEJIC0VWtbca2P1VnbnZ3W2c7YnZ3dWe+GewVLDfbPnPn+nP+c757z/+f89wLwm1HYOYZifGsLiiHN77pzOqIJar2Sp9YauRq1XhC2ZoWP8MAKgqqmzMXeiwDN7s2orbl1GDgdM6ToKAA4/73ePoStgCir45FYZf7+lEaCr2+Ht+eMyXun2oEue9/bySwddnyh1U4psd4klbZ973f0GUIwZFmgzC224L4HiOfH1im/wk47ucg67jNtX/P5Q2BpG3mTsOgVYVHHWMQ663hokdUeUvjJKOeDLTZFmrlBn1HcpskobtWaWy6HFndPMvP22qNyW4bQaM17ht4kLGi3Y5F2jGnMbdcZea0jQfmd40HLs6tV6TnVmpQ1VdqU3AY4u/EyPb99jJzTel2bu2eIMO2T13LtTUKHM7bBHWOEgs5xRkHnGN3SPsKoHERIWVu6hea2Qcbm/p/JBdYxOrpuHafErWlHfXJ2X3O+jzOkvx9uAPg5wNXZ2uq6887JohsxsviWPhzwQEBiR7V3wpFaogI20HTRyUJd9DLWkvJt+OSebeTEljOu+p13cLk91ajPssbrr4kSEhJQ/Nohhk3WOQ9J7FKS9MfciDG7ggMM7X5+CQcFpKKTIveZ9YyNtSDNqEfnJpPptWN8fDwACILOqTVfenJrxySc+ls8RvNTNtf2LyZrVb2BVXISYvX8m8Gq/5bDqx0T8+pvMGb8TWtLXhE5SIzvm9G5Jt2yYCAOuEoqL0H8XTf58py9ZHXaBpViRYlMnrYJUpnWRahNhUskWdUCfscPhLDtw4b0LR0+9Urggkb70YcYKc+0DEUSlebLdAe+gs4nwcwau4JVep7Pz7epuAUHFdz8Awq+eb9akGvVsEtOhgVXjnIFe6ZgsTaGRAnwCZj2F6rluNk0l/Dk3nK2iAjReb7hnHB3Yvm4J7X1rpfnMcTd5/D/PKZlcR/iju9H8JTOhwRSw73FkugiH2lgKBHmKPwi2fEeKJHRdsbN2H40RJf0LkeTlB6iTc7k6FNX09O6jtM3Hz2w0Fk17NgX6GLab6bpTDp6eGIsV5P0Toh2WRwtsbuYiZ3y/j6i7sxVf8PhE/5Rtk9JhlODAfE9ZxajizZHBp87ZL1DSh1y7FVGXSUETd8aUvTx9/wTDq0iRZ9aRUq05RMBOJ03++TlALhkm0yLEQ7AIRDAfV8qc31cInSdsIgXTq4PX/RtiWLhZL544YONIldkOXYQf05NxxfpzYvmpJDQfRzorL0A2ncBIDYB4NeP+HIaxgI5dXYKu9ZOYd1FFlOMZoi6ajuXdQUhTus59aMUbsNIEK9hyHuaI9KajzP2WF6TchomUWTWTXiL68bUwppRJr9+ki0uPSWActoixFk1eunqHTppTlt4aMUAV1A3wRbW2lmsqlEDuWJkwRsbzam8giJj1zd+4oYJJX/7hF/Ijjtk4cY+GpSxVSlbuVXhEGVYZrWMXXY5iFX7BTmkZtyfse16DG3rMNrjyHTFTFDHQbpsJU6mNQGVXOWSYDTixaUn2KLNp0WiLf0CYcW5UH7TLZbo46NiqPYqX9Z6hy3fPhAqrzovlFddEMN119Bu0ehMuGR45dwoSV7+3jQKleysVP4LHAflRC/xcwuco5gKmlqQCWUxKBGsQJaKQw5RC6jBagEzFV5LQ4QIDmC5zPE5AT1bIFNoqUFaMTlQwqazlEJGqFrCyJCt8QbPKc8XDXKuKQ6GHeL2Sj4N6YEOcvdLDwqGhCMQQkfQvekLPwmSkhLBdTV2Rb0IfuHdH3pW2h72Cdcm6WV3Sru5B6Be8W3+N3wgyzqCBUFD8Iga8RiX3iAiXMQNVa6KBgiMRTc49RjFHxk/YvY+CA7xQxbeXnrby5HJIoSFOOkqL+ADFgH8b1VdYGJ2/mXwVyj+Svt1VhdVqp+1D6fGAHBW9pfIU/I7okH4mbCC3xHbLDoWeQa+JzkN3ZVelD2STBt2466gJSEkC8HLwJeo8zUD9g5Zl7fOozmr2XFgCHgJOdYOw6OWj1RVS6oUHfrNqpbYMrg5dqOiwViuaItzPCgVTc/xi1+eip2qDLudLRbLDHo4BDcb8lXlE+Ul1X3BVdUj4WnlLYlNdlFhk1+Cu+FzEb2yQTkgA2IogHAfIgUgoIKCOcY63oz5ZueEfzQYdDYnhMchzPzHb8OuvaK8dc4JBxSPXC6q7uEuKO/h2pM+Q28QQX0Yib6HTYuHjSSdPJa/RJpI82+keJMbA1ni3TLU2ViWOH+EYdIwFAN3BWHYRMUHNlI5/rsDuCGNHFHwTrowoInMozQGole92253kLshb4bQ/a0pO8jm/M8ozvRI2ZAyWxqeeUTwCP4JFGYUv44wfx5CShPWlrhuAJQVGtQoJcfo+v7aLM+UwmR8as5yr/g0g0dUSqR79Aqte0yK3j0rM4tQ01r59oOYgv+D4k8CRzeIEBdH56D3nZyqYqcyPiBO1yQS5tDDiMtMWc07LkX+FSunyB/wA6qHuj+pBzUH1APakxFfKavDrEu74LPaI+H28C712cjT6juGYeUzNtqayqfO0x4Mf4rijYhfvC7Iv4svgxp05dCu6HJpk65SvidqB9Sl2ipt0a2XV+r3yc4bJ+GfxdP2w/A8hEeS7qHYZBrEXYx5Sh2J+ydtOO558JDhefBg+jP2wNK/MYdinwfb415Qr7zzd2Z/zH20Rs/FPpo/7ZvCJ84XmGC6bFx+r74v+mV+sn7DAwAOrcc+Iv+B4GBjqr+RDFFN4igqzIZoOn4ELTlIQY4RRZGE1c3YV1h7Mrikf/yWdmvZib24vvjMhbqtXMdcvymJbSlSyrPzObxss4RZUqqjl1ckBTXVop1A2VmNG76Lpfx/QzntZYIry2kAAAAASUVORK5CYII=&amp;#x27;); background-size: cover; display: block;&quot;&gt;&lt;/span&gt;
  &lt;picture&gt;
        &lt;source srcset=&quot;/static/fbd7192868b16c9a421107083e2dd45b/c54d4/our-dependency-graph.webp 175w, /static/fbd7192868b16c9a421107083e2dd45b/a3432/our-dependency-graph.webp 350w, /static/fbd7192868b16c9a421107083e2dd45b/52d01/our-dependency-graph.webp 501w&quot; sizes=&quot;(max-width: 501px) 100vw, 501px&quot; type=&quot;image/webp&quot;&gt;
        &lt;source srcset=&quot;/static/fbd7192868b16c9a421107083e2dd45b/17006/our-dependency-graph.png 175w, /static/fbd7192868b16c9a421107083e2dd45b/d6f3f/our-dependency-graph.png 350w, /static/fbd7192868b16c9a421107083e2dd45b/f55b8/our-dependency-graph.png 501w&quot; sizes=&quot;(max-width: 501px) 100vw, 501px&quot; type=&quot;image/png&quot;&gt;
        &lt;img class=&quot;gatsby-resp-image-image&quot; src=&quot;/static/fbd7192868b16c9a421107083e2dd45b/f55b8/our-dependency-graph.png&quot; alt=&quot;R (marked red) and Python (marked pink) jobs in one project&quot; title=&quot;R (marked red) and Python (marked pink) jobs in one project&quot; loading=&quot;lazy&quot; style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;&gt;
      &lt;/picture&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/body&gt;&lt;/html&gt;&lt;em&gt;R
(marked red) and Python (marked pink) jobs in one project&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Now lets see how it is possible to speed up and simplify process flow with
Feather API and data version control reproducibility.&lt;/p&gt;
&lt;h2&gt;Feather API&lt;/h2&gt;
&lt;p&gt;Feather API is designed to improve meta data and data interchange between R and
Python. It provides fast import/export of data frames among both environments
and keeps meta data information which is an improvement over data exchange via
csv/txt file format. In our example Python job will read an input binary file
that was produced in R with Feather api.&lt;/p&gt;
&lt;p&gt;Let’s install Feather library in both environments.&lt;/p&gt;
&lt;p&gt;For Python 3 on linux environment you can use cmd and pip3:&lt;/p&gt;
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;dvc&quot;&gt;&lt;pre class=&quot;language-dvc&quot;&gt;&lt;code class=&quot;language-dvc&quot;&gt;&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token command&quot;&gt;sudo&lt;/span&gt; pip3 &lt;span class=&quot;token function&quot;&gt;install&lt;/span&gt; feather-format&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/body&gt;&lt;/html&gt;
&lt;p&gt;For R it is necessary to install feather package:&lt;/p&gt;
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;r&quot;&gt;&lt;pre class=&quot;language-r&quot;&gt;&lt;code class=&quot;language-r&quot;&gt;install.packages&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;feather&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/body&gt;&lt;/html&gt;
&lt;p&gt;After successful installation we can use Feather for data exchange.&lt;/p&gt;
&lt;p&gt;Below is an R syntax for data frame export with Feather (featurization.R):&lt;/p&gt;
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;r&quot;&gt;&lt;pre class=&quot;language-r&quot;&gt;&lt;code class=&quot;language-r&quot;&gt;library&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;feather&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;

write_feather&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;dtm_train_tfidf&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;args&lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;token number&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
write_feather&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;dtm_test_tfidf&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;args&lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;token number&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
print&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;Two data frame were created with Feather - one for train and one for test data set&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/body&gt;&lt;/html&gt;
&lt;p&gt;Python syntax for reading feather input binary files (train&lt;em&gt;model&lt;/em&gt;python.py):&lt;/p&gt;
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;python&quot;&gt;&lt;pre class=&quot;language-python&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;&lt;span class=&quot;token keyword&quot;&gt;import&lt;/span&gt; feather &lt;span class=&quot;token keyword&quot;&gt;as&lt;/span&gt; ft

&lt;span class=&quot;token builtin&quot;&gt;input&lt;/span&gt; &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; sys&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;argv&lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;token number&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;
df &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; ft&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;read_dataframe&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token builtin&quot;&gt;input&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/body&gt;&lt;/html&gt;
&lt;h2&gt;Dependency graph with R and Python combined&lt;/h2&gt;
&lt;p&gt;The next question what we are asking ourselves is why do we need DVC, why not
just use shell scripting? DVC automatically derives the dependencies between the
steps and builds
&lt;a href=&quot;https://en.wikipedia.org/wiki/Directed_acyclic_graph&quot;&gt;the dependency graph (DAG)&lt;/a&gt;
transparently to the user. Graph is used for reproducing parts/codes of your
pipeline which were affected by recent changes and we don’t have to think all
the time what we need to repeat (which steps) with the latest changes.&lt;/p&gt;
&lt;p&gt;Firstly, with &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc run&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; command we will execute all jobs that are related to our
model development. In that phase DVC creates dependencies that will be used in
the reproducibility phase:&lt;/p&gt;
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;dvc&quot;&gt;&lt;pre class=&quot;language-dvc&quot;&gt;&lt;code class=&quot;language-dvc&quot;&gt;&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token dvc&quot;&gt;dvc import&lt;/span&gt; https://s3-us-west-2.amazonaws.com/dvc-share/so/25K/Posts.xml.tgz &lt;span class=&quot;token punctuation&quot;&gt;\&lt;/span&gt;
            data/
&lt;/span&gt;
&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token dvc&quot;&gt;dvc run&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;tar&lt;/span&gt; zxf data/Posts.xml.tgz -C data/
&lt;/span&gt;
&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token dvc&quot;&gt;dvc run&lt;/span&gt; Rscript code/parsingxml.R &lt;span class=&quot;token punctuation&quot;&gt;\&lt;/span&gt;
                  data/Posts.xml data/Posts.csv
&lt;/span&gt;
&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token dvc&quot;&gt;dvc run&lt;/span&gt; Rscript code/train_test_spliting.R &lt;span class=&quot;token punctuation&quot;&gt;\&lt;/span&gt;
                  data/Posts.csv &lt;span class=&quot;token number&quot;&gt;0.33&lt;/span&gt; &lt;span class=&quot;token number&quot;&gt;20170426&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;\&lt;/span&gt;
                  data/train_post.csv data/test_post.csv
&lt;/span&gt;
&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token dvc&quot;&gt;dvc run&lt;/span&gt; Rscript code/featurization.R &lt;span class=&quot;token punctuation&quot;&gt;\&lt;/span&gt;
                  data/train_post.csv &lt;span class=&quot;token punctuation&quot;&gt;\&lt;/span&gt;
                  data/test_post.csv data/matrix_train.feather &lt;span class=&quot;token punctuation&quot;&gt;\&lt;/span&gt;
                  data/matrix_test.feather
&lt;/span&gt;
&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token dvc&quot;&gt;dvc run&lt;/span&gt; python3 code/train_model_python.py &lt;span class=&quot;token punctuation&quot;&gt;\&lt;/span&gt;
                  data/matrix_train.feather &lt;span class=&quot;token punctuation&quot;&gt;\&lt;/span&gt;
                  &lt;span class=&quot;token number&quot;&gt;20170426&lt;/span&gt; data/model.p
&lt;/span&gt;
&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token dvc&quot;&gt;dvc run&lt;/span&gt; python3 code/evaluate_python_mdl.py &lt;span class=&quot;token punctuation&quot;&gt;\&lt;/span&gt;
                  data/model.p data/matrix_test.feather &lt;span class=&quot;token punctuation&quot;&gt;\&lt;/span&gt;
                  data/evaluation_python.txt&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/body&gt;&lt;/html&gt;
&lt;p&gt;After this commands jobs are executed and included in DAG graph. Result (AUC
metrics) is written in evaluation_python.txt file:&lt;/p&gt;
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;dvc&quot;&gt;&lt;pre class=&quot;language-dvc&quot;&gt;&lt;code class=&quot;language-dvc&quot;&gt;&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token command&quot;&gt;cat&lt;/span&gt; data/evaluation_python.txt
&lt;/span&gt;AUC: 0.741432&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/body&gt;&lt;/html&gt;
&lt;p&gt;It is possible to improve our result with random forest algorithm.&lt;/p&gt;
&lt;p&gt;We can increase number of trees in the random forest classifier — from 100 to
500:&lt;/p&gt;
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;python&quot;&gt;&lt;pre class=&quot;language-python&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;clf &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; RandomForestClassifier&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;n_estimators&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token number&quot;&gt;500&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
                             n_jobs&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token number&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
                             random_state&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;seed&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
clf&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;fit&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;x&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; labels&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/body&gt;&lt;/html&gt;
&lt;p&gt;After commited changes (in &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;train_model_python.py&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;) with &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc repro&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; command all
necessary jobs for &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;evaluation_python.txt&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; reproduction will be re-executed. We
don’t need to worry which jobs to run and in which order.&lt;/p&gt;
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;dvc&quot;&gt;&lt;pre class=&quot;language-dvc&quot;&gt;&lt;code class=&quot;language-dvc&quot;&gt;&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token git&quot;&gt;git add&lt;/span&gt; &lt;span class=&quot;token builtin class-name&quot;&gt;.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token git&quot;&gt;git commit&lt;/span&gt;
&lt;/span&gt;[master a65f346] Random forest classifier — more trees added
    1 file changed, 1 insertion(+), 1 deletion(-)

&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token dvc&quot;&gt;dvc repro&lt;/span&gt; data/evaluation_python.txt
&lt;/span&gt;
Reproducing run command for data item data/model.p. Args: python3 code/train_model_python.py data/matrix_train.txt 20170426 data/model.p
Reproducing run command for data item data/evaluation_python.txt. Args: python3 code/evaluate_python_mdl.py data/model.p data/matrix_test.txt data/evaluation_python.txt
Data item “data/evaluation_python.txt” was reproduced.&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/body&gt;&lt;/html&gt;
&lt;p&gt;Beside code versioning, DVC also cares about data versioning. For example, if we
change data sets &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;train_post.csv&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; and &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;test_post.csv&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; (use different splitting
ratio) DVC will know that data sets are changed and &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc repro&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; will re-execute
all necessary jobs for evaluation_python.txt.&lt;/p&gt;
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;dvc&quot;&gt;&lt;pre class=&quot;language-dvc&quot;&gt;&lt;code class=&quot;language-dvc&quot;&gt;&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token dvc&quot;&gt;dvc run&lt;/span&gt; Rscript code/train_test_spliting.R &lt;span class=&quot;token punctuation&quot;&gt;\&lt;/span&gt;
                  data/Posts.csv &lt;span class=&quot;token number&quot;&gt;0.15&lt;/span&gt; &lt;span class=&quot;token number&quot;&gt;20170426&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;\&lt;/span&gt;
                  data/train_post.csv &lt;span class=&quot;token punctuation&quot;&gt;\&lt;/span&gt;
                  data/test_post.csv&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/body&gt;&lt;/html&gt;
&lt;p&gt;Re-executed jobs are marked with red color:&lt;/p&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;span class=&quot;gatsby-resp-image-wrapper&quot; style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto;  max-width: 250.5px;&quot;&gt;
      &lt;a class=&quot;gatsby-resp-image-link&quot; href=&quot;/static/10053d985ed8b13cfb9b560ee5d2cc37/f55b8/re-executed-jobs.png&quot; style=&quot;display: block&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;
    &lt;span class=&quot;gatsby-resp-image-background-image&quot; style=&quot;padding-bottom: 199.4011976047904%; position: relative; bottom: 0; left: 0; background-image: url(&amp;#x27;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAoCAYAAAD+MdrbAAAACXBIWXMAAAsSAAALEgHS3X78AAAIcklEQVRIx4VWC1ST5xn+QkQTQiCX/89/SfLnnj/kBoSEawgJFwuIgSoWBXRFKwID6rzNa6LobG0rVkFRQFartuK9R+tWnVOHyFad7qw9azs70bVunR5PndOeVYVvyR+iUmL3nvOd9/0uz/O93/u97/f/ADwjTR0DjC7a/DajNRvLufqtubFpboc+3Vmky0zPSbDMniI09kyNTVvjizrSAwHd0sysLfP/CkSUpm0XnnbmAxbSJhTGrgLx1hYtSi0RIvhbfJFgKz8mvGTVkVPgudI4QlbffoHT2NGPL9h+GV265UvxezdgrHtGXXLlinek+7+AvMUdn4mbAvON2wYEwfXr9v0Z1G7+3VjC5u0DYZ3x6vYBurnjgqZ5R7/2J8t32qfVveae8bO2zOnzNzvr3z5FL+i6qGzYeiG3aMWHjLf5Sz4YS1jffj7saeYrm84p5rSelc7dcp701qxOnTpnTWZpjT+rdO5rtpc3fETN3dyH1W45n1O84kNeEONZdHQsYW1bP6Nf2XKeP6/9gqK2vT8APKdYfnIILV+8zThn02nFosO3sNr289S8tn7F3LZ+PIx9cc2vI8cxTPpD4QAgDih2pLmY/E2RybJeD5Hltx5gAQDBxHafoGDv2nh7Sq48O9drdHm8KudKP7e4ayXm3nCM7Vx/Oaqhs4LBuPy/eUpUPKmY0YPBWCzePmoTb3c6OvEwJz5vg1wysQMVT95tQpYfS+CE58sWvwnmVxQwdnV19VNgSUlJwCnI2BL/x7Fyf5+VWH2JFq+/piG23lWRFevy8OaDSUj7f5SY/4pO5uuzKFsGqDB+9kJ/yCj8LwSTaxpCbk//KfvsFE60yXcyWbX+kp6e1YrZyuanJpbOTzaXLUi2eZvTU0oaXPRLfoPo9cFYQ8u53Epfh2BPlTkqiK/xNYVItd4yRmOUUqTmAaGx4+9y/S8GHIYVH9Hm5l1pxld3241NuxzmpnfSrA2dWYYlH1h16wb0xra/2hKduSglw9AgPiUtg/XkmC59ksCmtQiSFLQ4U5fKxfx/4Kna/hInPgS5+L6hGHw/jEEDtuAE5Cg6B2Nla//ET3HXCRMJWmDTpyE5lIfLEE365fEJxe29GtfkGdrM4nJ1VskMTc5Ls6kpnfsVS97vGRcpG37ebWZ7O+up7FKXIq3Qo80sKVC7XiyUFXY1K0OvS/eheM/xs4hnb6/Y1fMuMungSXR6Ry8/OPdvdXkggXggeA4IJgDIDWHK2utj83urUe+BCnHJ/irEc2gmWtLdGAeEh54mZVEgcStyXHHgRCsL3L0aBU7/NhqcOhkNeg9Eg/d7x4PDR6PBgUPR4PSpaHANMhfhXr6Ns6nKP37UEQo7D4C8be+ChY2LmJh+g+Oi+yIRcV8gwO7HxWEB7/h+Pi/pPT5PA9kg/kF8HH5PJMK/RcXkbYkoPsgxd20pq6irPkRYWl0FoFDI2FAkEkKJJG2Iz1fA2FjVIIIYDlPajJ0ypXsPpXYdlanS7whEWsjnq4b5fCUUi91QSkaNCfQdPFTnkMcTQ4JwQBRFApvgNwiC2qEzObq0RnuX3uzYpTUl38NIKUTEOEQQSWDz3Ec4/rTGf6+hQUVKFstDZ4Jspzuqfl4d55rJrLqp1xv/SdP0bb1eD3VqxecalfG2ktJBJaX6h0ZNX9NoDYMajfEWTcuDPNMS01ne1BdGeylDRUKjTiOJ+HJAyArn7CjRachR/S+UWnZlcoaSzCgitam5uNbukaltbtVMSwYVgLMYrjBnODwIPi7b5qJ0KfmYgk6iNKm5igS7Wz3LnhcHvpXKx5/T0La9Frtuj9Wm3pXk0O5KdCT0GS1WqNEwid3nLQVzZs8GZxKTGMJ7pFR4PMFi7Uh06HYnp2l2BjHWFMvntFELHod3J0QTYKKJ22fQxUEVweTV/uC4J4eZH75ymdFfKUKJDfmABUWc6KszK+OgRRcNFeIIVaVSioBBx3l26Otn7D+S8pCHUtmTsRkbW5+sb0TUAHwnFmffJkkzxHDTPgwvOIOTzruk1HoXw5IfCYXW4MJPRwBmFAXDUiljXywqYnTLyhUxPfX1rGB8h+12AL7EyYY3TLbsPdoEzw6DtaBDayrYrLcWdWvNhf8SIuXhy3A57AyBNcnNaJ/PF9YxYTsUGwRxfEeQCQ8x3HhHJrNco/X263p9ylWdPuMrrTZ5XHJifOCALLi1C1zHMObWh9ls4F+1KjLh/xO9Qq2zagz8cP9+HFO6YGlLS2TCRwgS9YAgWQ8IgvUYlTA1+Ugml0AUocrMqWhaej5tySyWBUpMMEwQqm+syQzu2LTy53tYW1vL6McEEUoJkpzwGCc0EMd198Qi0wMcN0IM0w3hOBKO6dXe3jAh1+db9fwjDuP4qP4sj4fb7HJNCPeP4DirorISLJw6JfT3tXJljG+1/8fJhgKpcVOpZC6At3TpuOIFC3g1y5Zxqurq+Ckzq2OsVVUca3U1N2X6dG5NzcuxMFKNj3oHuKHvDWSz2YGl0SMNFElQnV8oEAbsqEAbF2xhTPfIT+pYL1E0RIaiMQ8xLOczQpr5CSF13sSI1IMSYtIVFHPewMn0i1JF1h1Ekgfj49UMTiBgRfZOLB55uYWCWziRt1ZtdL6ponPeUOidrRqTaweltb+l0Gcv01lcZzD5C5AXY2TCxI97DqFCEXoEDAksKJVKIYbKIIHJICqWQlOCCqqUCigJjEkQMjCu+F4mY3444UhJRpTrak3kCbkUC7yQY74f30skP3IhJAmGRryEMSBqicGKsp1eWVJ2MWUyOyirLUc53lVGpDjyJY8wgrm9v+FSMDySw5EvhqJGjo9EfS2V5XwsV5acUOmSNlrs8h69mb4kU2YPSqnJgTRj8uwhTrCGPv2EwfwPlt2a2JOlvaMAAAAASUVORK5CYII=&amp;#x27;); background-size: cover; display: block;&quot;&gt;&lt;/span&gt;
  &lt;picture&gt;
        &lt;source srcset=&quot;/static/10053d985ed8b13cfb9b560ee5d2cc37/c54d4/re-executed-jobs.webp 175w, /static/10053d985ed8b13cfb9b560ee5d2cc37/a3432/re-executed-jobs.webp 350w, /static/10053d985ed8b13cfb9b560ee5d2cc37/52d01/re-executed-jobs.webp 501w&quot; sizes=&quot;(max-width: 501px) 100vw, 501px&quot; type=&quot;image/webp&quot;&gt;
        &lt;source srcset=&quot;/static/10053d985ed8b13cfb9b560ee5d2cc37/17006/re-executed-jobs.png 175w, /static/10053d985ed8b13cfb9b560ee5d2cc37/d6f3f/re-executed-jobs.png 350w, /static/10053d985ed8b13cfb9b560ee5d2cc37/f55b8/re-executed-jobs.png 501w&quot; sizes=&quot;(max-width: 501px) 100vw, 501px&quot; type=&quot;image/png&quot;&gt;
        &lt;img class=&quot;gatsby-resp-image-image&quot; src=&quot;/static/10053d985ed8b13cfb9b560ee5d2cc37/f55b8/re-executed-jobs.png&quot; alt=&quot;re executed jobs&quot; title=&quot;re executed jobs&quot; loading=&quot;lazy&quot; style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;&gt;
      &lt;/picture&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;dvc&quot;&gt;&lt;pre class=&quot;language-dvc&quot;&gt;&lt;code class=&quot;language-dvc&quot;&gt;&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token dvc&quot;&gt;dvc run&lt;/span&gt; Rscript code/train_test_spliting.R &lt;span class=&quot;token punctuation&quot;&gt;\&lt;/span&gt;
                  data/Posts.csv &lt;span class=&quot;token number&quot;&gt;0.15&lt;/span&gt; &lt;span class=&quot;token number&quot;&gt;20170426&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;\&lt;/span&gt;
                  data/train_post.csv &lt;span class=&quot;token punctuation&quot;&gt;\&lt;/span&gt;
                  data/test_post.csv
&lt;/span&gt;
&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token dvc&quot;&gt;dvc repro&lt;/span&gt; data/evaluation_python.txt
&lt;/span&gt;
Reproducing run command for data item data/matrix_train.txt. Args: Rscript — vanilla code/featurization.R data/train_post.csv data/test_post.csv data/matrix_train.txt data/matrix_test.txt
Reproducing run command for data item data/model.p. Args: python3 code/train_model_python.py data/matrix_train.txt 20170426 data/model.p
Reproducing run command for data item data/evaluation_python.txt. Args: python3 code/evaluate_python_mdl.py data/model.p data/matrix_test.txt data/evaluation_python.txt

Data item “data/evaluation_python.txt” was reproduced.

&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token command&quot;&gt;cat&lt;/span&gt; data/evaluation_python.txt
&lt;/span&gt;AUC: 0.793145&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/body&gt;&lt;/html&gt;
&lt;p&gt;New AUC result is 0.793145 which shows an improvement compared to previous
iteration.&lt;/p&gt;
&lt;h2&gt;Summary&lt;/h2&gt;
&lt;p&gt;In data science projects it is often used R/Python combined programming.
Additional tools beside git and shell scripting are developed to facilitate the
development of predictive model in a multi-language environments. Using data
version control system for reproducibility and Feather for data interoperability
helps you orchestrate R and Python code in a single environment.&lt;/p&gt;</content:encoded></item><item><title><![CDATA[ML Model Ensembling with Fast Iterations]]></title><link>https://blog.dvc.org/ml-model-ensembling-with-fast-iterations</link><guid isPermaLink="false">https://blog.dvc.org/ml-model-ensembling-with-fast-iterations</guid><pubDate>Wed, 23 Aug 2017 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;In a model ensembling setup, the final prediction is a composite of predictions
from individual machine learning algorithms. To make the best model composite,
you have to try dozens of combinations of weights for the model set. It takes a
lot of time to come up with the best one. That is why the iteration speed is
crucial in the ML model ensembling. We are going to make our research
reproducible by using &lt;a href=&quot;http://dvc.org&quot;&gt;Data Version Control&lt;/a&gt; tool -
(&lt;a href=&quot;http://dvc.org&quot;&gt;DVC&lt;/a&gt;). It provides the ability to quickly re-run and replicate
the ML prediction result by executing just a single command &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc repro&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;.&lt;/p&gt;
&lt;p&gt;As we will demonstrate, DVC is a good tool that helps tackling common technical
challenges of building pipelines for the ensemble learning.&lt;/p&gt;
&lt;h2&gt;Project Overview&lt;/h2&gt;
&lt;p&gt;In this case, we will build an R-based solution to attack the
supervised-learning regression problem to predict win sales per
&lt;a href=&quot;https://inclass.kaggle.com/c/pred-411-2016-04-u3-wine/&quot;&gt;Predict Wine Sales&lt;/a&gt;
Kaggle competition.&lt;/p&gt;
&lt;p&gt;An ensemble prediction methodology will be used in the project. The weighted
ensemble of three models will be implemented, trained, and predicted from
(namely, these are Linear Regression, &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;GBM&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;, and &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;XGBoost&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;).&lt;/p&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;span class=&quot;gatsby-resp-image-wrapper&quot; style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto;  max-width: 435px;&quot;&gt;
      &lt;a class=&quot;gatsby-resp-image-link&quot; href=&quot;/static/eb9050a712d4a3f7fd006686b1f41fe2/93314/ensemble-prediction-methodology.png&quot; style=&quot;display: block&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;
    &lt;span class=&quot;gatsby-resp-image-background-image&quot; style=&quot;padding-bottom: 68.62068965517241%; position: relative; bottom: 0; left: 0; background-image: url(&amp;#x27;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAOCAIAAACgpqunAAAACXBIWXMAAAsSAAALEgHS3X78AAACPklEQVQoz12SSW/TQBTH8yV76ZEDFw5UiEOFkBASqir1xAWJj1AhlXSjVNDQJaRLGtIlJe5GS53YcdzE22ye5Q2euE1T3mH83pv/b/625xX0fSCE4yTRI8EYC8JISjnaRAhRyvK8kD8opaX1jd3qvuO6WQkA2bpTrf7c2b28/DPsdH3/24/1X4dHcWxsCnk3SdDK19XPxWLTsnJpZrhZLs8vLFS2d4aw3WotLi8tLi13PO/BWQhxenFdPzkPwvBOCnBzY+8fNp22M4QTlNQbpwfHTUrJPTzY0MjRne2H0iRC+VUAdV+aPjg1CNt5aWAh4cDT1TO7Vq/sudpDRp0eH+HSKvk0RTfWuPU76/Spe9TbO2l+bPyds8JDLpmBAwpv1tKJYu/lFzKxhNevMwcVvnt7NT5mPX3WHh+L3s9kslrn+3TlydTW5HTl+Yf6i5D5hfx7iNK4b+OrEgYtpXEGlkLoQ2MWohBSczdKcQKctDdJcMpAalAGTqVGQsdhD3lnCc9K0EoqgkUS8ZuqJAiYgYVKieK0vUWjS6JSqVIDJ6lOmLLd224sQgbZQVpwGQR9z3daPRbFKow0ZAYRYrHdT9xeF7MIy9jACjThQKSmoUsCezBQKvvVmkf6fFZLrLXpSZBUxLRbo2lAJVYgC6PXoANLO2WDcom6EnkcXZxnCfalkgOBwPp6RYN8NJ7DyA9JHF5625p/dTL32ipONsozHXwrBtuPxIX/0QGtBCQuj900dljspNkr3DmPjpDW/wDEnAvQG1Y76gAAAABJRU5ErkJggg==&amp;#x27;); background-size: cover; display: block;&quot;&gt;&lt;/span&gt;
  &lt;picture&gt;
        &lt;source srcset=&quot;/static/eb9050a712d4a3f7fd006686b1f41fe2/c54d4/ensemble-prediction-methodology.webp 175w, /static/eb9050a712d4a3f7fd006686b1f41fe2/a3432/ensemble-prediction-methodology.webp 350w, /static/eb9050a712d4a3f7fd006686b1f41fe2/426ac/ensemble-prediction-methodology.webp 700w, /static/eb9050a712d4a3f7fd006686b1f41fe2/bf818/ensemble-prediction-methodology.webp 870w&quot; sizes=&quot;(max-width: 700px) 100vw, 700px&quot; type=&quot;image/webp&quot;&gt;
        &lt;source srcset=&quot;/static/eb9050a712d4a3f7fd006686b1f41fe2/17006/ensemble-prediction-methodology.png 175w, /static/eb9050a712d4a3f7fd006686b1f41fe2/d6f3f/ensemble-prediction-methodology.png 350w, /static/eb9050a712d4a3f7fd006686b1f41fe2/69344/ensemble-prediction-methodology.png 700w, /static/eb9050a712d4a3f7fd006686b1f41fe2/93314/ensemble-prediction-methodology.png 870w&quot; sizes=&quot;(max-width: 700px) 100vw, 700px&quot; type=&quot;image/png&quot;&gt;
        &lt;img class=&quot;gatsby-resp-image-image&quot; src=&quot;/static/eb9050a712d4a3f7fd006686b1f41fe2/69344/ensemble-prediction-methodology.png&quot; alt=&quot;ensemble prediction methodology&quot; title=&quot;ensemble prediction methodology&quot; loading=&quot;lazy&quot; style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;&gt;
      &lt;/picture&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;p&gt;If properly designed and used, ensemble prediction can perform much better then
predictions of individual machine learning models composing the ensemble.&lt;/p&gt;
&lt;p&gt;Prediction results will be delivered in a format of output CSV file that is
specified in the requirements to the
&lt;a href=&quot;https://inclass.kaggle.com/c/pred-411-2016-04-u3-wine/&quot;&gt;Predict Wine Sales&lt;/a&gt;
Kaggle competition (so called Kaggle submission file).&lt;/p&gt;
&lt;h2&gt;Important Pre-Requisites&lt;/h2&gt;
&lt;p&gt;In order to try the materials of this
&lt;a href=&quot;https://github.com/gvyshnya/DVC_R_Ensemble&quot;&gt;repository&lt;/a&gt; in your environment,
the following software should be installed on your machine&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;em&gt;Python 3&lt;/em&gt;&lt;/strong&gt; runtime environment for your OS (it is required to run DVC
commands in the batch files)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;em&gt;DVC&lt;/em&gt;&lt;/strong&gt; itself (you can install it as a python package by simply doing the
standard command in your command line prompt: &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;pip install dvc&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;em&gt;R&lt;/em&gt;&lt;/strong&gt; &lt;strong&gt;&lt;em&gt;3.4.x&lt;/em&gt;&lt;/strong&gt; runtime environment for your OS&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;em&gt;git&lt;/em&gt;&lt;/strong&gt; command-line client application for your OS&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Technical Challenges&lt;/h2&gt;
&lt;p&gt;The technical challenges of building the ML pipeline for this project were to
meet business requirements below&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Ability to conditionally trigger execution of 3 different ML prediction models&lt;/li&gt;
&lt;li&gt;Ability to conditionally trigger model ensemble prediction based on
predictions of those 3 individual models&lt;/li&gt;
&lt;li&gt;Ability to specify weights of each of the individual model predictions in the
ensemble&lt;/li&gt;
&lt;li&gt;Quick and fast redeployment and re-run of the ML pipeline upon frequent
reconfiguration and model tweaks&lt;/li&gt;
&lt;li&gt;Reproducibility of the pipeline and forecasting results across the multiple
machines and team members&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The next sections below will explain how these challenges are addressed in the
design of ML pipeline for this project.&lt;/p&gt;
&lt;h2&gt;ML Pipeline&lt;/h2&gt;
&lt;p&gt;The ML pipeline for this project is presented in the diagram below&lt;/p&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;span class=&quot;gatsby-resp-image-wrapper&quot; style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto;  max-width: 365.5px;&quot;&gt;
      &lt;a class=&quot;gatsby-resp-image-link&quot; href=&quot;/static/9cf20fd774b97331a5c6e17a1e92115b/b2a6b/ml-pipeline.png&quot; style=&quot;display: block&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;
    &lt;span class=&quot;gatsby-resp-image-background-image&quot; style=&quot;padding-bottom: 74.00820793433653%; position: relative; bottom: 0; left: 0; background-image: url(&amp;#x27;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAPCAYAAADkmO9VAAAACXBIWXMAAAsSAAALEgHS3X78AAADbUlEQVQ4y41UTWwbVRB+diIVDpUqilROSKBeAKlnLjQIAUIccqGAQLSihwYhOPYCFwRSCEVpJBCoFIWKlkIvgEMLTpzUsb32/tj747XXrrPZ9c/uev23/iH22oljO8PuOiSHSlXnafVmZ+Z9mvlm3kPDZuRbGAowaBJCJXPzRHnz70inFISKvDKsZlcG3QoGVdnrQZaoWc5lyH+hB8qoRSzDTgL6daKuCbdOGTkfa5bDIPEekOIe6DcoaOvrK3Ysw/CoV157MKCpeY7CFj49agSesf9hRD8He+xUM3/npYHhPw0j5kXokycdH+TRsLqEHloAll1NxXfxX82/WJGXf2wUVq+01LvXWopvxvbHY6uoW/Y7sTGpgxKFbZRUdxCb7SIPXhyD9Gt+1DdCE7OXFi1AZUIXvXXo8dCr4dCt4gCDFFhcJuzYqz/9iXolL1qLG/clk1S2XYeZ1VYP9Ubw/YERnO1ovi+2y3c/260G5qAVOmP7ht0MEhiHTrSEq+5bvtSRX/7hjrzx0WW3bSubMAYJF8gDwJjGfMBo3HzwHjaPy+RXbDG+QKn0u2NKAG3UwDkcz/deFSujvGzARqY0SlObnRcOQDiNdnYRIhOcwjRKuzpITRE2jAw0oA6JIpe0/efOXnDfDrCTzpmceSat9iDE5UEsD4DPmYuHWSnEgU4XqBlOpedJKXKJzlNfWvoCo0bfsX2/Lv3munrztgNIpGpvJnJtWCPFYUoxgc93fjgAiRbwQ/ACeT6WI2exdGCW2Ax/Tsn4nGVzOLzw8Yzr6++vT45L7k5L1b1tuQZG1gBIF3cX7is5anonOJWtt2ELxPqG85nWsmhwuvz2W2fdn8x9OlnZI11/YNwjN7zc47+HCsc8Ien4OrN5FOCb/TKV/aYcQ4jX2YvpqrBI56JX4hr7naVfixcZZw7fO3/O/VADHdUxpLerjs7q5LOpOn/aL/heZjRySqhxU0yROPl/LF9hTiUr3Gt8iXmd12NPO9038cdgh56GXuRJJwgXQ85+R/7ZTYgRPqknIJzGICQEIVUWAEsHnUF94qnjj1ISQa8xq0BnY8CpzGXb3i4FsEHTvvMBcb8R4y5/eOMVl1UmZY9LOIUNI/fCA7klAavQzhNz4nmXO5ajQngGt3hlLUB6zrZv6evXh60YWNc08B8bAFkvKDFRaQAAAABJRU5ErkJggg==&amp;#x27;); background-size: cover; display: block;&quot;&gt;&lt;/span&gt;
  &lt;picture&gt;
        &lt;source srcset=&quot;/static/9cf20fd774b97331a5c6e17a1e92115b/c54d4/ml-pipeline.webp 175w, /static/9cf20fd774b97331a5c6e17a1e92115b/a3432/ml-pipeline.webp 350w, /static/9cf20fd774b97331a5c6e17a1e92115b/426ac/ml-pipeline.webp 700w, /static/9cf20fd774b97331a5c6e17a1e92115b/feeb6/ml-pipeline.webp 731w&quot; sizes=&quot;(max-width: 700px) 100vw, 700px&quot; type=&quot;image/webp&quot;&gt;
        &lt;source srcset=&quot;/static/9cf20fd774b97331a5c6e17a1e92115b/17006/ml-pipeline.png 175w, /static/9cf20fd774b97331a5c6e17a1e92115b/d6f3f/ml-pipeline.png 350w, /static/9cf20fd774b97331a5c6e17a1e92115b/69344/ml-pipeline.png 700w, /static/9cf20fd774b97331a5c6e17a1e92115b/b2a6b/ml-pipeline.png 731w&quot; sizes=&quot;(max-width: 700px) 100vw, 700px&quot; type=&quot;image/png&quot;&gt;
        &lt;img class=&quot;gatsby-resp-image-image&quot; src=&quot;/static/9cf20fd774b97331a5c6e17a1e92115b/69344/ml-pipeline.png&quot; alt=&quot;ml pipeline&quot; title=&quot;ml pipeline&quot; loading=&quot;lazy&quot; style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;&gt;
      &lt;/picture&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;p&gt;As you can see, the essential implementation of the solution is as follows&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://gist.github.com/gvyshnya/443424775b0150baac774cc6cf3cb1cc&quot;&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;preprocessing.R&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;&lt;/a&gt;
handles all aspects of data manipulations and pre-processing (reading training
and testing data sets, removing outliers, imputing NAs etc.) as well as stores
refined training and testing set data as new files to reuse by model scripts&lt;/li&gt;
&lt;li&gt;3 model scripts implement training and forecasting algorithms for each of the
models selected for this project
(&lt;a href=&quot;https://gist.github.com/gvyshnya/7ec76316c24bc1b4f595ef1256f52d3a&quot;&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;LR.R&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;&lt;/a&gt;,
&lt;a href=&quot;https://gist.github.com/gvyshnya/50e5ea3efa9771d2e7cc121c2f1a04e4&quot;&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;GBM.R&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;&lt;/a&gt;,
&lt;a href=&quot;https://gist.github.com/gvyshnya/2e5799863f02fec652c194020da82dd3&quot;&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;xgboost.R&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://gist.github.com/gvyshnya/84379d6a68fd085fe3a26aabad453e55&quot;&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;ensemble.R&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;&lt;/a&gt;
is responsible for the weighted ensemble prediction and the final output of
the Kaggle submission file&lt;/li&gt;
&lt;li&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;config.R&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; is responsible for all of the conditional logic switches needed in
the pipeline (it is included as a source to all of modeling and ensemble
prediction scripts, to get this done)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;There is a special note about lack of feature engineering for this project. It
was an intended specification related to the specifics of the dataset. The
existing features were quite instrumental to predict the target values ‘as is’.
Therefore it had been decided to follow the well-known
&lt;a href=&quot;https://en.wikipedia.org/wiki/Pareto_principle&quot;&gt;Pareto principle&lt;/a&gt; (interpreted
as “&lt;strong&gt;&lt;em&gt;20% of efforts address 80% of issues&lt;/em&gt;&lt;/strong&gt;”, in this case) and not to spend
more time on it.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;em&gt;Note&lt;/em&gt;&lt;/strong&gt;: all &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;R&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; and batch files mentioned throughout this blog post are
available online in a separate GitHub
&lt;a href=&quot;https://github.com/gvyshnya/DVC_R_Ensemble&quot;&gt;repository&lt;/a&gt;. You will be also able
to review more details on the implementation of each of the machine learning
prediction models there.&lt;/p&gt;
&lt;h3&gt;Pipeline Configuration Management&lt;/h3&gt;
&lt;p&gt;All of the essential tweaks to conditional machine learning pipeline for this
project is managed by a configuration file. For ease of its use across solution,
it was implemented as an R code module (&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;config.R&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;), to be included to all model
training and forecasting. Thus the respective parameters (assigned as R
variables) will be retrieved by the runnable scripts, and the conditional logic
there will be triggered respectively.&lt;/p&gt;
&lt;p&gt;This file is not intended to run from a command line (unlike the rest of the R
scripts in the project).&lt;/p&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;div id=&quot;gist73938264&quot; class=&quot;gist&quot;&gt;
    &lt;div class=&quot;gist-file&quot;&gt;
      &lt;div class=&quot;gist-data&quot;&gt;
        &lt;div class=&quot;js-gist-file-update-container js-task-list-container file-box&quot;&gt;
  &lt;div id=&quot;file-config-r&quot; class=&quot;file&quot;&gt;
    

  &lt;div itemprop=&quot;text&quot; class=&quot;Box-body p-0 blob-wrapper data type-r&quot;&gt;
      
&lt;table class=&quot;highlight tab-size js-file-line-container&quot; data-tab-size=&quot;8&quot;&gt;
      &lt;tbody&gt;&lt;tr&gt;
        &lt;td id=&quot;file-config-r-L1&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;1&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-config-r-LC1&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;#&lt;/span&gt; Competition: https://inclass.kaggle.com/c/pred-411-2016-04-u3-wine/&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-config-r-L2&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;2&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-config-r-LC2&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;#&lt;/span&gt; This is a configuration file to the entire solution &lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-config-r-L3&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;3&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-config-r-LC3&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-config-r-L4&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;4&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-config-r-LC4&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;#&lt;/span&gt; LR.R specific settings&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-config-r-L5&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;5&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-config-r-LC5&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-smi&quot;&gt;cfg_run_LR&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;&amp;#x3C;-&lt;/span&gt; &lt;span class=&quot;pl-c1&quot;&gt;1&lt;/span&gt; &lt;span class=&quot;pl-c&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;#&lt;/span&gt; if set to 0, LR model will not fit, and its prediction will not be calculated in the batch mode&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-config-r-L6&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;6&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-config-r-LC6&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-config-r-L7&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;7&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-config-r-LC7&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;#&lt;/span&gt; GMB.R specific settings&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-config-r-L8&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;8&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-config-r-LC8&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-smi&quot;&gt;cfg_run_GBM&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;&amp;#x3C;-&lt;/span&gt; &lt;span class=&quot;pl-c1&quot;&gt;1&lt;/span&gt; &lt;span class=&quot;pl-c&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;#&lt;/span&gt; if set to 0, GBM model will not fit, and its prediction will not be calculated in the batch mode&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-config-r-L9&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;9&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-config-r-LC9&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-config-r-L10&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;10&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-config-r-LC10&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;#&lt;/span&gt; xgboost.R specific settings&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-config-r-L11&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;11&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-config-r-LC11&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-smi&quot;&gt;cfg_run_xgboost&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;&amp;#x3C;-&lt;/span&gt; &lt;span class=&quot;pl-c1&quot;&gt;1&lt;/span&gt; &lt;span class=&quot;pl-c&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;#&lt;/span&gt; if set to 0, xgboost model will not fit, and its prediction will not be calculated in the batch mode&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-config-r-L12&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;12&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-config-r-LC12&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-config-r-L13&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;13&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-config-r-LC13&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;#&lt;/span&gt; ensemble.R specific settings&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-config-r-L14&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;14&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-config-r-LC14&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-smi&quot;&gt;cfg_run_ensemble&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;&amp;#x3C;-&lt;/span&gt; &lt;span class=&quot;pl-c1&quot;&gt;1&lt;/span&gt; &lt;span class=&quot;pl-c&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;#&lt;/span&gt; if set to 0, the ensemble will not predict, and ensemble prediction will not be created&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-config-r-L15&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;15&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-config-r-LC15&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-config-r-L16&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;16&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-config-r-LC16&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;#&lt;/span&gt; ensemble components&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-config-r-L17&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;17&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-config-r-LC17&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-smi&quot;&gt;cfg_model_predictions&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;&amp;#x3C;-&lt;/span&gt; c(&lt;span class=&quot;pl-s&quot;&gt;&lt;span class=&quot;pl-pds&quot;&gt;&quot;&lt;/span&gt;data/submission_LR.csv&lt;span class=&quot;pl-pds&quot;&gt;&quot;&lt;/span&gt;&lt;/span&gt;, &lt;span class=&quot;pl-s&quot;&gt;&lt;span class=&quot;pl-pds&quot;&gt;&quot;&lt;/span&gt;data/submission_GBM.csv&lt;span class=&quot;pl-pds&quot;&gt;&quot;&lt;/span&gt;&lt;/span&gt;, &lt;span class=&quot;pl-s&quot;&gt;&lt;span class=&quot;pl-pds&quot;&gt;&quot;&lt;/span&gt;data/submission_XGBOOST.csv&lt;span class=&quot;pl-pds&quot;&gt;&quot;&lt;/span&gt;&lt;/span&gt;)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-config-r-L18&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;18&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-config-r-LC18&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;#&lt;/span&gt; element weights mapped to the cfg_model_predictions elements above&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-config-r-L19&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;19&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-config-r-LC19&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-smi&quot;&gt;cfg_model_weights&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;&amp;#x3C;-&lt;/span&gt; c(&lt;span class=&quot;pl-c1&quot;&gt;1&lt;/span&gt;,&lt;span class=&quot;pl-c1&quot;&gt;1&lt;/span&gt;,&lt;span class=&quot;pl-c1&quot;&gt;1&lt;/span&gt;) &lt;span class=&quot;pl-c&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;#&lt;/span&gt; weights of predictions of the models in the ensemble&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;


  &lt;/div&gt;

  &lt;/div&gt;
&lt;/div&gt;

      &lt;/div&gt;
      &lt;div class=&quot;gist-meta&quot;&gt;
        &lt;a href=&quot;https://gist.github.com/gvyshnya/918e94b06ebf222f6bb56ed26a5f44ee/raw/e274919657607fdfd67a2fb6354e40ff0c4173e9/config.R&quot; style=&quot;float:right&quot;&gt;view raw&lt;/a&gt;
        &lt;a href=&quot;https://gist.github.com/gvyshnya/918e94b06ebf222f6bb56ed26a5f44ee#file-config-r&quot;&gt;config.R&lt;/a&gt;
        hosted with ❤ by &lt;a href=&quot;https://github.com&quot;&gt;GitHub&lt;/a&gt;
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;h3&gt;Why Do We Need DVC?&lt;/h3&gt;
&lt;p&gt;As we all know, there is no way to build the ideal ML model with sound
prediction accuracy from the very beginning. You will have to continuously
adjust your algorithm/model implementations based on the cross-validation
appraisal until you yield the blooming results. This is especially true in the
ensemble learning where you have to constantly tweak not only parameters of the
individual prediction models but also the settings of the ensemble itself&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;changing ensemble composition — adding or removing individual prediction
models&lt;/li&gt;
&lt;li&gt;changing model prediction weights in the resulting ensemble prediction&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Under such a condition, DVC will help you to manage your ensemble ML pipeline in
a really solid manner. Let’s consider the following real-world scenario&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Your team member changes the settings of &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;GBM&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; model and resubmit its
implementation to (this is emulated by the commit
&lt;a href=&quot;https://github.com/gvyshnya/DVC_R_Ensemble/commit/27825d0732f72f07e7e4e48548ddb8a8604103f0&quot;&gt;#8604103f0&lt;/a&gt;,
check sum &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;27825d0&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;)&lt;/li&gt;
&lt;li&gt;You rerun the entire ML pipeline on your computer, to get the newest
predictions from &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;GBM&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; as well as the updated final ensemble prediction&lt;/li&gt;
&lt;li&gt;The results of the prediction appeared to be still not optimal thus someone
changes the weights of individual models in the ensemble, assigning &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;GBM&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;
higher weight vs. &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;xgboost&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; and &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;LR&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;&lt;/li&gt;
&lt;li&gt;After the ensemble setup changes committed (and updated &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;config.R&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; appeared in
the repository, as emulated by the commit
&lt;a href=&quot;https://github.com/gvyshnya/DVC_R_Ensemble/commit/5bcbe115afcb24886abb4734ff2da42eb97612ce&quot;&gt;#eb97612ce&lt;/a&gt;,
check sum &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;5bcbe11&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;), you re-run the model predictions and the final ensemble
prediction on your machine once again&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All that you need to do to handle the changes above is simply to keep running
your &lt;strong&gt;DVC&lt;/strong&gt; commands per the script developed (see the section below). You do
not have to remember or know explicitly the changes being made into the project
codebase or its pipeline configuration. &lt;strong&gt;DVC&lt;/strong&gt; will automatically check out
latest changes from the repo as well as make sure it runs only those steps in
the pipeline that were affected by the recent changes in the code modules.&lt;/p&gt;
&lt;h3&gt;Orchestrating the Pipeline : DVC Command File&lt;/h3&gt;
&lt;p&gt;After we developed individual R scripts needed by different steps of our Machine
Learning pipeline, we orchestrate it together using DVC.&lt;/p&gt;
&lt;p&gt;Below is a batch file illustrating how DVC manages steps of the machine learning
process for this project&lt;/p&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;div id=&quot;gist73940214&quot; class=&quot;gist&quot;&gt;
    &lt;div class=&quot;gist-file&quot;&gt;
      &lt;div class=&quot;gist-data&quot;&gt;
        &lt;div class=&quot;js-gist-file-update-container js-task-list-container file-box&quot;&gt;
  &lt;div id=&quot;file-dvc-bat&quot; class=&quot;file&quot;&gt;
    

  &lt;div itemprop=&quot;text&quot; class=&quot;Box-body p-0 blob-wrapper data type-batchfile&quot;&gt;
      
&lt;table class=&quot;highlight tab-size js-file-line-container&quot; data-tab-size=&quot;8&quot;&gt;
      &lt;tbody&gt;&lt;tr&gt;
        &lt;td id=&quot;file-dvc-bat-L1&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;1&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc-bat-LC1&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;# This is a DVC-based script to manage machine-learning pipeline &lt;span class=&quot;pl-k&quot;&gt;for&lt;/span&gt; a project per&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-dvc-bat-L2&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;2&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc-bat-LC2&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;# https://inclass.kaggle.com/c/pred-411-2016-04-u3-wine/&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-dvc-bat-L3&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;3&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc-bat-LC3&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-dvc-bat-L4&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;4&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc-bat-LC4&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-k&quot;&gt;mkdir&lt;/span&gt; R_DVC_GITHUB_CODE&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-dvc-bat-L5&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;5&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc-bat-LC5&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-k&quot;&gt;cd&lt;/span&gt; R_DVC_GITHUB_CODE&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-dvc-bat-L6&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;6&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc-bat-LC6&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-dvc-bat-L7&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;7&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc-bat-LC7&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;# clone the github repo with the code&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-dvc-bat-L8&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;8&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc-bat-LC8&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;git clone https://github.com/gvyshnya/DVC_R_Ensemble&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-dvc-bat-L9&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;9&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc-bat-LC9&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-dvc-bat-L10&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;10&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc-bat-LC10&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;# initialize DVC&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-dvc-bat-L11&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;11&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc-bat-LC11&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;$ dvc init&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-dvc-bat-L12&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;12&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc-bat-LC12&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-dvc-bat-L13&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;13&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc-bat-LC13&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;# import data&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-dvc-bat-L14&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;14&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc-bat-LC14&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;$ dvc import https://inclass.kaggle.com/c/pred-411-2016-04-u3-wine/download/wine.csv data/&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-dvc-bat-L15&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;15&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc-bat-LC15&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;$ dvc import https://inclass.kaggle.com/c/pred-411-2016-04-u3-wine/download/wine_test.csv data/&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-dvc-bat-L16&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;16&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc-bat-LC16&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-dvc-bat-L17&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;17&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc-bat-LC17&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;# run data pre-processing&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-dvc-bat-L18&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;18&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc-bat-LC18&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;$ dvc run Rscript --vanilla code/preprocessing.R data/wine.csv data/wine_test.csv data/training_imputed.csv data/testing_imputed.csv&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-dvc-bat-L19&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;19&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc-bat-LC19&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-dvc-bat-L20&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;20&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc-bat-LC20&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;# run LR model fit and forecasting&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-dvc-bat-L21&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;21&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc-bat-LC21&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;$ dvc run Rscript --vanilla code/LR.R data/training_imputed.csv data/testing_imputed.csv 0.7 &lt;span class=&quot;pl-c1&quot;&gt;825&lt;/span&gt; data/submission_LR.csv code/config.R&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-dvc-bat-L22&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;22&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc-bat-LC22&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-dvc-bat-L23&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;23&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc-bat-LC23&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;# run GBM model fit and forecasting&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-dvc-bat-L24&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;24&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc-bat-LC24&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;$ dvc run Rscript --vanilla code/GBM.R data/training_imputed.csv data/testing_imputed.csv &lt;span class=&quot;pl-c1&quot;&gt;5000&lt;/span&gt; &lt;span class=&quot;pl-c1&quot;&gt;10&lt;/span&gt; &lt;span class=&quot;pl-c1&quot;&gt;4&lt;/span&gt; &lt;span class=&quot;pl-c1&quot;&gt;25&lt;/span&gt; data/submission_GBM.csv code/config.R&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-dvc-bat-L25&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;25&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc-bat-LC25&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-dvc-bat-L26&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;26&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc-bat-LC26&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;# rum XGBOOST model fit and forecasting&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-dvc-bat-L27&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;27&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc-bat-LC27&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;$ dvc run Rscript --vanilla code/GBM.R data/training_imputed.csv data/testing_imputed.csv &lt;span class=&quot;pl-c1&quot;&gt;1000&lt;/span&gt; &lt;span class=&quot;pl-c1&quot;&gt;10&lt;/span&gt; 0.0001 1.0 data/submission_xgboost.csv code/config.R&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-dvc-bat-L28&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;28&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc-bat-LC28&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-dvc-bat-L29&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;29&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc-bat-LC29&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;# prepare ensemble submission&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-dvc-bat-L30&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;30&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc-bat-LC30&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;# Note: please make sure to &lt;span class=&quot;pl-k&quot;&gt;edit&lt;/span&gt; your code/config.R to &lt;span class=&quot;pl-k&quot;&gt;set&lt;/span&gt; up the references to the predictions from each model according&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-dvc-bat-L31&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;31&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc-bat-LC31&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;# to the names of output files on the steps above&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-dvc-bat-L32&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;32&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc-bat-LC32&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;$ dvc run Rscript --vanilla code/ensemble.R data/submission_ensemble.csv code/config.R&lt;/td&gt;
      &lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;


  &lt;/div&gt;

  &lt;/div&gt;
&lt;/div&gt;

      &lt;/div&gt;
      &lt;div class=&quot;gist-meta&quot;&gt;
        &lt;a href=&quot;https://gist.github.com/gvyshnya/7f1b8262e3eb7a8b3c16dbfd8cf98644/raw/4818eab6c2f99722110a37c7d2c509c78ce4240a/dvc.bat&quot; style=&quot;float:right&quot;&gt;view raw&lt;/a&gt;
        &lt;a href=&quot;https://gist.github.com/gvyshnya/7f1b8262e3eb7a8b3c16dbfd8cf98644#file-dvc-bat&quot;&gt;dvc.bat&lt;/a&gt;
        hosted with ❤ by &lt;a href=&quot;https://github.com&quot;&gt;GitHub&lt;/a&gt;
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;p&gt;If you then further edit ensemble configuration setup in &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;code/config.R&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;, you
can simply leverage the power of DVC as for automatic dependencies resolving and
tracking to rebuild the new ensemble prediction as follows&lt;/p&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;div id=&quot;gist74997297&quot; class=&quot;gist&quot;&gt;
    &lt;div class=&quot;gist-file&quot;&gt;
      &lt;div class=&quot;gist-data&quot;&gt;
        &lt;div class=&quot;js-gist-file-update-container js-task-list-container file-box&quot;&gt;
  &lt;div id=&quot;file-dvc-repro-code&quot; class=&quot;file&quot;&gt;
    

  &lt;div itemprop=&quot;text&quot; class=&quot;Box-body p-0 blob-wrapper data type-text&quot;&gt;
      
&lt;table class=&quot;highlight tab-size js-file-line-container&quot; data-tab-size=&quot;8&quot;&gt;
      &lt;tbody&gt;&lt;tr&gt;
        &lt;td id=&quot;file-dvc-repro-code-L1&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;1&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc-repro-code-LC1&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;# Improve ensemble configuration&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-dvc-repro-code-L2&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;2&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc-repro-code-LC2&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;$ vi code/config.R&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-dvc-repro-code-L3&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;3&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc-repro-code-LC3&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-dvc-repro-code-L4&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;4&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc-repro-code-LC4&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;# Commit all the changes.&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-dvc-repro-code-L5&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;5&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc-repro-code-LC5&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;$ git commit -am &quot;Updated weights of the models in the ensemble&quot;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-dvc-repro-code-L6&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;6&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc-repro-code-LC6&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-dvc-repro-code-L7&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;7&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc-repro-code-LC7&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;# Reproduce the ensemble prediction&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-dvc-repro-code-L8&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;8&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc-repro-code-LC8&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;$ dvc repro data/submission_ensemble.csv&lt;/td&gt;
      &lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;


  &lt;/div&gt;

  &lt;/div&gt;
&lt;/div&gt;

      &lt;/div&gt;
      &lt;div class=&quot;gist-meta&quot;&gt;
        &lt;a href=&quot;https://gist.github.com/gvyshnya/9d80e51ba3d7aa5bd37d100ed82376ee/raw/4367adacf7f6d78ad223289c52737588441fabcb/dvc%20repro%20code&quot; style=&quot;float:right&quot;&gt;view raw&lt;/a&gt;
        &lt;a href=&quot;https://gist.github.com/gvyshnya/9d80e51ba3d7aa5bd37d100ed82376ee#file-dvc-repro-code&quot;&gt;dvc repro code&lt;/a&gt;
        hosted with ❤ by &lt;a href=&quot;https://github.com&quot;&gt;GitHub&lt;/a&gt;
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;h2&gt;Summary&lt;/h2&gt;
&lt;p&gt;In this blog post, we worked through the process of building an ensemble
prediction pipeline using DVC. The essential key features of that pipeline were
as follows&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;em&gt;reproducibility&lt;/em&gt;&lt;/strong&gt; — everybody on a team can run it on his/her premise&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;em&gt;separation of data and code&lt;/em&gt;&lt;/strong&gt; — this ensured everyone always runs the
latest versions of the pipeline jobs with the most up-to-date ‘golden copy’ of
training and testing data sets&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The helpful side effect of using DVC was you stop keeping in mind what was
changed on every step of modifying your project scripts or in the pipeline
configuration. Due to it maintaining the dependencies graph (DAG) automatically,
it automatically triggered the only steps that were affected by the particular
changes, within the pipeline job setup. It, in turn, provides the capability to
quickly iterate through the entire ML pipeline.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;As DVC brings proven engineering practices to often suboptimal and messy ML
processes as well as helps a typical Data Science project team to eliminate a
big chunk of common
&lt;a href=&quot;https://blog.dataversioncontrol.com/data-version-control-in-analytics-devops-paradigm-35a880e99133&quot;&gt;DevOps overheads&lt;/a&gt;,
I found it extremely useful to leverage DVC on the industrial data science and
predictive analytics projects.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;Further Reading&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href=&quot;https://en.wikipedia.org/wiki/Ensemble_learning&quot;&gt;Ensemble Learning and Prediction Introduction&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://blog.dataversioncontrol.com/data-version-control-beta-release-iterative-machine-learning-a7faf7c8be67&quot;&gt;Using DVC in Machine Learning projects in Python&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://blog.dataversioncontrol.com/r-code-and-reproducible-model-development-with-dvc-1507a0e3687b&quot;&gt;Using DVC in Machine Learning projects in R&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://mlwave.com/kaggle-ensembling-guide/&quot;&gt;Kaggle Ensembling Guide&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;</content:encoded></item><item><title><![CDATA[Data Version Control in Analytics DevOps Paradigm]]></title><link>https://blog.dvc.org/data-version-control-in-analytics-devops-paradigm</link><guid isPermaLink="false">https://blog.dvc.org/data-version-control-in-analytics-devops-paradigm</guid><pubDate>Thu, 27 Jul 2017 00:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;Data Science and DevOps Convergence&lt;/h2&gt;
&lt;p&gt;The primary mission of DevOps is to help the teams to resolve various Tech Ops
infrastructure, tools and pipeline issues.&lt;/p&gt;
&lt;p&gt;At the other hand, as mentioned in the conceptual review by
&lt;a href=&quot;https://www.forbes.com/sites/teradata/2016/11/14/devops-for-data-science-why-analytics-ops-is-key-to-value/&quot;&gt;Forbes&lt;/a&gt;
in November 2016, the industrial analytics is no more going to be driven by data
scientists alone. It requires an investment in DevOps skills, practices and
supporting technology to move analytics out of the lab and into the business.
There are even
&lt;a href=&quot;https://www.computing.co.uk/ctg/news/2433095/a-lot-of-companies-will-stop-hiring-data-scientists-when-they-realise-that-the-majority-bring-no-value-says-data-scientist&quot;&gt;voices&lt;/a&gt;
calling Data Scientists to concentrate on agile methodology and DevOps if they
like to retain their jobs in business in the long run.&lt;/p&gt;
&lt;h2&gt;Why DevOps Matters&lt;/h2&gt;
&lt;p&gt;The eternal dream of almost every Data Scientist today is to spend all (well,
almost all) the time in the office exploring new datasets, engineering decisive
new features, inventing and validating cool new algorithms and strategies.
However, reality is often different. One of the unfortunate daily routines of a
Data Scientist work is to do raw data pre-processing. It usually translates to
the challenges to&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Pull all kinds of necessary data from a variety of sources&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Internal data sources like ERP, CRM, POS systems, or data from online
e-commerce platforms&lt;/li&gt;
&lt;li&gt;External data, like weather, public holidays, Google trends etc.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Extract, transform, and load the data&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Relate and join the data sources&lt;/li&gt;
&lt;li&gt;Aggregate and transform the data&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Avoid technical and performance drawbacks&lt;/strong&gt; when everything ends up in
“one big table” at the end&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Facilitate continuous machine learning and decision-making in a
business-ready framework&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Utilize historic data to train the machine learning models and algorithms&lt;/li&gt;
&lt;li&gt;Use the current, up-to-date data for decision-making&lt;/li&gt;
&lt;li&gt;Export back the resulting decisions/recommendations to review by business
stakeholders, either back into the ERP system or some other data warehouse&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Another big challenge is to organize &lt;strong&gt;collaboration and data/model sharing&lt;/strong&gt;
inside and across the boundaries of teams of Data Scientists and Software
Engineers.&lt;/p&gt;
&lt;p&gt;DevOps skills as well as effective instruments will certainly be beneficial for
industrial Data Scientists as they can address the above-mentioned challenges in
a self-service manner.&lt;/p&gt;
&lt;h2&gt;Can DVC Be a Solution?&lt;/h2&gt;
&lt;p&gt;&lt;a href=&quot;https://dvc.org&quot;&gt;Data Version Control&lt;/a&gt; or simply DVC comes to the scene
whenever you start looking for effective DevOps-for-Analytics instruments.&lt;/p&gt;
&lt;p&gt;DVC is an open source tool for data science projects. It makes your data science
projects reproducible by automatically building data dependency graph (DAG).
Your code and the dependencies could be easily shared by Git, and data — through
cloud storage (AWS S3, GCP) in a single DVC environment.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Although DVC was created for machine learning developers and data scientists
&lt;a href=&quot;https://dvc.org/doc/understanding-dvc/what-is-dvc&quot;&gt;originally&lt;/a&gt;, it appeared
to be useful beyond it. Since it brings proven engineering practices to not
well defined ML process, I discovered it to have enormous potential as an
Analytical DevOps instrument.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It clearly helps to manage a big fraction of DevOps issues in daily Data
Scientist routines&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Pull all kinds of necessary data from a variety of sources&lt;/strong&gt;. Once you
configure and script your data extraction jobs with DVC, it will be
persistent and operable across your data and service infrastructure&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Extract, transform, and load the data&lt;/strong&gt;. ETL is going to be easy and
repeatable once you configure it with DVC scripting. It will become a solid
pipeline to operate without major supportive effort. Moreover, it will track
all changes and trigger an alert for updates in the pipeline steps via DAG.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Facilitate continuous machine learning and decision-making.&lt;/strong&gt; The part of
the pipeline facilitated through DVC scripting can be jobs to upload data
back to any transactional system (like ERP, ERM, CRM etc.), warehouse or data
mart. It will then be exposed to business stakeholders to make intelligent
data-driven decisions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Share your algorithms and data&lt;/strong&gt;. Machine Learning modeling is an iterative
process and it is extremely important to keep track of your steps,
dependencies between the steps, dependencies between your code and data files
and all code running arguments. This becomes even more important and
complicated in a team environment where data scientists’ collaboration takes
a serious amount of the team’s effort. DVC will be the arm to help you with
it.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;One of the ‘juicy’ features of DVC is ability to support multiple technology
stacks. Whether you prefer R or use promising Python-based implementations for
your industrial data products, DVC will be able to support your pipeline
properly. You can see it in action for both
&lt;a href=&quot;https://blog.dvc.org/how-a-data-scientist-can-improve-his-productivity&quot;&gt;Python-based&lt;/a&gt;
and
&lt;a href=&quot;https://blog.dvc.org/r-code-and-reproducible-model-development-with-dvc&quot;&gt;R-based&lt;/a&gt;
technical stacks.&lt;/p&gt;
&lt;p&gt;As such, DVC is going to be one of the tools you would enjoy to use if/when you
embark on building continual analytical environment for your system or across
your organization.&lt;/p&gt;
&lt;h2&gt;Continual Analytical Environment and DevOps&lt;/h2&gt;
&lt;p&gt;Building a production pipeline is quite different from building a
machine-learning prototype on a local laptop. Many teams and companies face the
challenges there.&lt;/p&gt;
&lt;p&gt;At the bare minimum, the following requirements shall be met when you move your
solution into production&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Periodic re-training of the models/algorithms&lt;/li&gt;
&lt;li&gt;Ease of re-deployment and configuration changes in the system&lt;/li&gt;
&lt;li&gt;Efficiency and high performance of real-time scoring the new out-of-sample
observations&lt;/li&gt;
&lt;li&gt;Availability of the monitor model performance over time&lt;/li&gt;
&lt;li&gt;Adaptive ETL and ability to manage new data feeds and transactional systems
as data sources for AI and machine learning tools&lt;/li&gt;
&lt;li&gt;Scaling to really big data operations&lt;/li&gt;
&lt;li&gt;Security and Authorized access levels to different areas of the analytical
systems&lt;/li&gt;
&lt;li&gt;Solid backup and recovery processes/tools&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This goes into the territory traditionally inhabited by DevOps. Data Scientists
should ideally learn to handle the part of those requirements themselves or at
least be informative consultants to classical DevOps gurus.&lt;/p&gt;
&lt;p&gt;DVC can help in many aspects of the production scenario above as it can
orchestrate relevant tools and instruments through its scripting. In such a
setup, DVC scripts will be sharable manifestation (and implementation) of your
production pipeline where each step can be transparently reviewed, easily
maintained, and changed as needed over time.&lt;/p&gt;
&lt;h2&gt;Will DevOps Be Captivating?&lt;/h2&gt;
&lt;p&gt;If you are further interested in understanding the ever-proliferating role of
DevOps in the modern Data Science and predictive analytics in business, there
are good resources for your review below&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href=&quot;https://www.forbes.com/sites/teradata/2016/11/14/devops-for-data-science-why-analytics-ops-is-key-to-value/&quot;&gt;DevOps For Data Science: Why Analytics Ops Is Key To Value&lt;/a&gt;
(Forbes, Nov 14, 2016)&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.packtpub.com/books/content/bridging-gap-between-data-science-and-devops&quot;&gt;Bridging the Gap Between Data Science and DevOps&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://devops.com/devops-life-better-data-scientists/&quot;&gt;Is DevOps Making Life Better for Data Scientists?&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;By any mean, DVC is going to be a useful instrument to fill the multiple gaps
between the classical in-lab old-school data science practices and growing
demands of business to build solid DevOps processes and workflows to streamline
mature and persistent data analytics.&lt;/p&gt;</content:encoded></item><item><title><![CDATA[R code and reproducible model development with DVC]]></title><link>https://blog.dvc.org/r-code-and-reproducible-model-development-with-dvc</link><guid isPermaLink="false">https://blog.dvc.org/r-code-and-reproducible-model-development-with-dvc</guid><pubDate>Mon, 24 Jul 2017 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;a href=&quot;https://dvc.org&quot;&gt;DVC&lt;/a&gt; or Data Version Control tool — its idea is to track
files/data dependencies during model development in order to facilitate
reproducibility and track data files versioning. Most of the
&lt;a href=&quot;https://dvc.org/doc/tutorials&quot;&gt;DVC tutorials&lt;/a&gt; provide good examples of using
DVC with Python language. However, I realized that DVC is a
&lt;a href=&quot;https://en.wikipedia.org/wiki/Language-agnostic&quot;&gt;language agnostic&lt;/a&gt; tool and
can be used with any programming language. In this blog post, we will see how to
use DVC in R projects.&lt;/p&gt;
&lt;h2&gt;R coding — keep it simple and readable&lt;/h2&gt;
&lt;p&gt;Each development is always a combination of following steps presented below:&lt;/p&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;span class=&quot;gatsby-resp-image-wrapper&quot; style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto;  max-width: 342px;&quot;&gt;
      &lt;a class=&quot;gatsby-resp-image-link&quot; href=&quot;/static/fdf37f71d0c9ecd4d9f1b7f0ec446abf/3dead/development-steps.png&quot; style=&quot;display: block&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;
    &lt;span class=&quot;gatsby-resp-image-background-image&quot; style=&quot;padding-bottom: 27.046783625730995%; position: relative; bottom: 0; left: 0; background-image: url(&amp;#x27;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAFCAYAAABFA8wzAAAACXBIWXMAAAsSAAALEgHS3X78AAABgUlEQVQY00VPO0sjURgddAXRYlX0f1hoL1usICrYqfgqBNElsCsSxEYLQcFOxEbwFYiCMIKPJDObzNw7c19z585MoojPVBb+B4P5vFZ+cOA7cDgPw2eyruLkI6ncwn/EU0yEj0n5Bu7uH6GIxRgmYvvh6VnzB8A02LddPhonNxCqGJiQL7kSXfApB4+wOqG8bpQcVDs3zXfHcYEE8V/s+feWZYEMQ/B4NEl5sINcF1wNyuUhZmoid30Nx8cZwBg/69CUaZqQyWRqtm3XDBlGYGsDpSIgQi0xLl4rlTJUq1XwmJqmTOzFUQRSSt2CnxARTyVxAhhh4Fy8FUr+P6tQgGz2BBzHAYPwMK/issVlhCyXjWi+L/Svp1Mb8V+IBH+kSpiG79FgUWv6uFRIa1zd+KDoySHCghLymSUClTe+b7lxbJP8MJp2G7oXLpqH19yW1SPZmt6TLb3zZ609c6ed4+t22+xWseN3+rJjcOWqa2Al/3Nmw2rvX853dk3kmr9cPgEq+g6Kf1zjjgAAAABJRU5ErkJggg==&amp;#x27;); background-size: cover; display: block;&quot;&gt;&lt;/span&gt;
  &lt;picture&gt;
        &lt;source srcset=&quot;/static/fdf37f71d0c9ecd4d9f1b7f0ec446abf/c54d4/development-steps.webp 175w, /static/fdf37f71d0c9ecd4d9f1b7f0ec446abf/a3432/development-steps.webp 350w, /static/fdf37f71d0c9ecd4d9f1b7f0ec446abf/e213b/development-steps.webp 684w&quot; sizes=&quot;(max-width: 684px) 100vw, 684px&quot; type=&quot;image/webp&quot;&gt;
        &lt;source srcset=&quot;/static/fdf37f71d0c9ecd4d9f1b7f0ec446abf/17006/development-steps.png 175w, /static/fdf37f71d0c9ecd4d9f1b7f0ec446abf/d6f3f/development-steps.png 350w, /static/fdf37f71d0c9ecd4d9f1b7f0ec446abf/3dead/development-steps.png 684w&quot; sizes=&quot;(max-width: 684px) 100vw, 684px&quot; type=&quot;image/png&quot;&gt;
        &lt;img class=&quot;gatsby-resp-image-image&quot; src=&quot;/static/fdf37f71d0c9ecd4d9f1b7f0ec446abf/3dead/development-steps.png&quot; alt=&quot;Model development process&quot; title=&quot;Model development process&quot; loading=&quot;lazy&quot; style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;&gt;
      &lt;/picture&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/body&gt;&lt;/html&gt;
&lt;em&gt;Model development process&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Because of the specificity of the process — iterative development, it is very
important to improve some coding and organizational skills. For example, instead
of having one big R file with code it is better to split code in several logical
files — each responsible for one small piece of work. It is smart to track
history development with
&lt;a href=&quot;https://git-scm.com/book/en/v2/Getting-Started-About-Version-Control&quot;&gt;git&lt;/a&gt;
tool. Writing “&lt;em&gt;reusable code”&lt;/em&gt; is nice skill to have. Put comments in a code
can make our life easier.&lt;/p&gt;
&lt;p&gt;Beside git, next step in further improvements is to try out and work with DVC.
Every time when a change/commit in some of the codes and data sets is made, DVC
will reproduce new results with just one bash command on a linux (or Win
environment). It memorizes dependencies among files and codes so it can easily
repeat all necessary steps/codes instead of us worrying about the order.&lt;/p&gt;
&lt;h2&gt;R example — data and code clarification&lt;/h2&gt;
&lt;p&gt;We’ll take an Python example from
&lt;a href=&quot;https://dvc.org/doc/tutorials/deep&quot;&gt;DVC tutorial&lt;/a&gt; (written by Dmitry Petrov)
and rewrite that code in R. With an example we’ll show how can DVC help during
development and what are its possibilities.&lt;/p&gt;
&lt;p&gt;Firstly, let’s initialize git and dvc on mentioned example and run our codes for
the first time. After that we will simulate some changes in the codes and see
how DVC works on reproducibility.&lt;/p&gt;
&lt;p&gt;R codes can be downloaded from the
&lt;a href=&quot;https://github.com/Zoldin/R_AND_DVC&quot;&gt;Github repository&lt;/a&gt;. A brief explanation of
the codes is presented below:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;parsingxml.R&lt;/strong&gt; — it takes xml that we downloaded from the web and creates
appropriate csv file.&lt;/p&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;div id=&quot;gist71114089&quot; class=&quot;gist&quot;&gt;
    &lt;div class=&quot;gist-file&quot;&gt;
      &lt;div class=&quot;gist-data&quot;&gt;
        &lt;div class=&quot;js-gist-file-update-container js-task-list-container file-box&quot;&gt;
  &lt;div id=&quot;file-parsingxml-r&quot; class=&quot;file&quot;&gt;
    

  &lt;div itemprop=&quot;text&quot; class=&quot;Box-body p-0 blob-wrapper data type-r&quot;&gt;
      
&lt;table class=&quot;highlight tab-size js-file-line-container&quot; data-tab-size=&quot;8&quot;&gt;
      &lt;tbody&gt;&lt;tr&gt;
        &lt;td id=&quot;file-parsingxml-r-L1&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;1&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-parsingxml-r-LC1&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;#&lt;/span&gt;!/usr/bin/Rscript&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-parsingxml-r-L2&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;2&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-parsingxml-r-LC2&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;library(&lt;span class=&quot;pl-smi&quot;&gt;XML&lt;/span&gt;)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-parsingxml-r-L3&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;3&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-parsingxml-r-LC3&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-parsingxml-r-L4&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;4&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-parsingxml-r-LC4&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-v&quot;&gt;args&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt; commandArgs(&lt;span class=&quot;pl-v&quot;&gt;trailingOnly&lt;/span&gt;&lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;pl-c1&quot;&gt;TRUE&lt;/span&gt;)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-parsingxml-r-L5&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;5&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-parsingxml-r-LC5&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-k&quot;&gt;if&lt;/span&gt; (&lt;span class=&quot;pl-k&quot;&gt;!&lt;/span&gt;length(&lt;span class=&quot;pl-smi&quot;&gt;args&lt;/span&gt;)&lt;span class=&quot;pl-k&quot;&gt;==&lt;/span&gt;&lt;span class=&quot;pl-c1&quot;&gt;2&lt;/span&gt;) {&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-parsingxml-r-L6&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;6&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-parsingxml-r-LC6&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;  stop(&lt;span class=&quot;pl-s&quot;&gt;&lt;span class=&quot;pl-pds&quot;&gt;&quot;&lt;/span&gt;Two arguments must be supplied (input file name ,output file name - csv ext).n&lt;span class=&quot;pl-pds&quot;&gt;&quot;&lt;/span&gt;&lt;/span&gt;, &lt;span class=&quot;pl-v&quot;&gt;call.&lt;/span&gt;&lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;pl-c1&quot;&gt;FALSE&lt;/span&gt;)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-parsingxml-r-L7&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;7&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-parsingxml-r-LC7&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;} &lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-parsingxml-r-L8&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;8&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-parsingxml-r-LC8&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-parsingxml-r-L9&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;9&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-parsingxml-r-LC9&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-parsingxml-r-L10&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;10&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-parsingxml-r-LC10&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;#&lt;/span&gt;read XML line by line&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-parsingxml-r-L11&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;11&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-parsingxml-r-LC11&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-smi&quot;&gt;con&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;&amp;#x3C;-&lt;/span&gt; file(&lt;span class=&quot;pl-smi&quot;&gt;args&lt;/span&gt;[&lt;span class=&quot;pl-c1&quot;&gt;1&lt;/span&gt;], &lt;span class=&quot;pl-s&quot;&gt;&lt;span class=&quot;pl-pds&quot;&gt;&quot;&lt;/span&gt;r&lt;span class=&quot;pl-pds&quot;&gt;&quot;&lt;/span&gt;&lt;/span&gt;)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-parsingxml-r-L12&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;12&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-parsingxml-r-LC12&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-smi&quot;&gt;lines&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;&amp;#x3C;-&lt;/span&gt; readLines(&lt;span class=&quot;pl-smi&quot;&gt;con&lt;/span&gt;, &lt;span class=&quot;pl-k&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;pl-c1&quot;&gt;1&lt;/span&gt;)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-parsingxml-r-L13&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;13&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-parsingxml-r-LC13&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-smi&quot;&gt;test&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;&amp;#x3C;-&lt;/span&gt; lapply(&lt;span class=&quot;pl-smi&quot;&gt;lines&lt;/span&gt;,&lt;span class=&quot;pl-k&quot;&gt;function&lt;/span&gt;(&lt;span class=&quot;pl-smi&quot;&gt;x&lt;/span&gt;){&lt;span class=&quot;pl-k&quot;&gt;return&lt;/span&gt;(xmlTreeParse(&lt;span class=&quot;pl-smi&quot;&gt;x&lt;/span&gt;,&lt;span class=&quot;pl-v&quot;&gt;useInternalNodes&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;pl-c1&quot;&gt;TRUE&lt;/span&gt;))})&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-parsingxml-r-L14&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;14&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-parsingxml-r-LC14&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-parsingxml-r-L15&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;15&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-parsingxml-r-LC15&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;#&lt;/span&gt;parsing XML to get variables&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-parsingxml-r-L16&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;16&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-parsingxml-r-LC16&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-smi&quot;&gt;ID&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;&amp;#x3C;-&lt;/span&gt; as.numeric(sapply(&lt;span class=&quot;pl-smi&quot;&gt;test&lt;/span&gt;,&lt;span class=&quot;pl-k&quot;&gt;function&lt;/span&gt;(&lt;span class=&quot;pl-smi&quot;&gt;x&lt;/span&gt;){&lt;span class=&quot;pl-k&quot;&gt;return&lt;/span&gt;(xpathSApply(&lt;span class=&quot;pl-smi&quot;&gt;x&lt;/span&gt;, &lt;span class=&quot;pl-s&quot;&gt;&lt;span class=&quot;pl-pds&quot;&gt;&quot;&lt;/span&gt;//row&lt;span class=&quot;pl-pds&quot;&gt;&quot;&lt;/span&gt;&lt;/span&gt;,&lt;span class=&quot;pl-smi&quot;&gt;xmlGetAttr&lt;/span&gt;, &lt;span class=&quot;pl-s&quot;&gt;&lt;span class=&quot;pl-pds&quot;&gt;&quot;&lt;/span&gt;Id&lt;span class=&quot;pl-pds&quot;&gt;&quot;&lt;/span&gt;&lt;/span&gt;))}))&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-parsingxml-r-L17&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;17&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-parsingxml-r-LC17&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-smi&quot;&gt;Tags&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;&amp;#x3C;-&lt;/span&gt; sapply(&lt;span class=&quot;pl-smi&quot;&gt;test&lt;/span&gt;,&lt;span class=&quot;pl-k&quot;&gt;function&lt;/span&gt;(&lt;span class=&quot;pl-smi&quot;&gt;x&lt;/span&gt;){&lt;span class=&quot;pl-k&quot;&gt;return&lt;/span&gt;(xpathSApply(&lt;span class=&quot;pl-smi&quot;&gt;x&lt;/span&gt;, &lt;span class=&quot;pl-s&quot;&gt;&lt;span class=&quot;pl-pds&quot;&gt;&quot;&lt;/span&gt;//row&lt;span class=&quot;pl-pds&quot;&gt;&quot;&lt;/span&gt;&lt;/span&gt;,&lt;span class=&quot;pl-smi&quot;&gt;xmlGetAttr&lt;/span&gt;, &lt;span class=&quot;pl-s&quot;&gt;&lt;span class=&quot;pl-pds&quot;&gt;&quot;&lt;/span&gt;Tags&lt;span class=&quot;pl-pds&quot;&gt;&quot;&lt;/span&gt;&lt;/span&gt;))})&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-parsingxml-r-L18&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;18&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-parsingxml-r-LC18&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-smi&quot;&gt;Title&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;&amp;#x3C;-&lt;/span&gt; as.character(sapply(&lt;span class=&quot;pl-smi&quot;&gt;test&lt;/span&gt;,&lt;span class=&quot;pl-k&quot;&gt;function&lt;/span&gt;(&lt;span class=&quot;pl-smi&quot;&gt;x&lt;/span&gt;){&lt;span class=&quot;pl-k&quot;&gt;return&lt;/span&gt;(xpathSApply(&lt;span class=&quot;pl-smi&quot;&gt;x&lt;/span&gt;, &lt;span class=&quot;pl-s&quot;&gt;&lt;span class=&quot;pl-pds&quot;&gt;&quot;&lt;/span&gt;//row&lt;span class=&quot;pl-pds&quot;&gt;&quot;&lt;/span&gt;&lt;/span&gt;,&lt;span class=&quot;pl-smi&quot;&gt;xmlGetAttr&lt;/span&gt;, &lt;span class=&quot;pl-s&quot;&gt;&lt;span class=&quot;pl-pds&quot;&gt;&quot;&lt;/span&gt;Title&lt;span class=&quot;pl-pds&quot;&gt;&quot;&lt;/span&gt;&lt;/span&gt;))}))&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-parsingxml-r-L19&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;19&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-parsingxml-r-LC19&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-smi&quot;&gt;Body&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;&amp;#x3C;-&lt;/span&gt; as.character(sapply(&lt;span class=&quot;pl-smi&quot;&gt;test&lt;/span&gt;,&lt;span class=&quot;pl-k&quot;&gt;function&lt;/span&gt;(&lt;span class=&quot;pl-smi&quot;&gt;x&lt;/span&gt;){&lt;span class=&quot;pl-k&quot;&gt;return&lt;/span&gt;(xpathSApply(&lt;span class=&quot;pl-smi&quot;&gt;x&lt;/span&gt;, &lt;span class=&quot;pl-s&quot;&gt;&lt;span class=&quot;pl-pds&quot;&gt;&quot;&lt;/span&gt;//row&lt;span class=&quot;pl-pds&quot;&gt;&quot;&lt;/span&gt;&lt;/span&gt;,&lt;span class=&quot;pl-smi&quot;&gt;xmlGetAttr&lt;/span&gt;, &lt;span class=&quot;pl-s&quot;&gt;&lt;span class=&quot;pl-pds&quot;&gt;&quot;&lt;/span&gt;Body&lt;span class=&quot;pl-pds&quot;&gt;&quot;&lt;/span&gt;&lt;/span&gt;))}))&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-parsingxml-r-L20&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;20&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-parsingxml-r-LC20&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-v&quot;&gt;text&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt; paste(&lt;span class=&quot;pl-smi&quot;&gt;Title&lt;/span&gt;,&lt;span class=&quot;pl-smi&quot;&gt;Body&lt;/span&gt;)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-parsingxml-r-L21&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;21&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-parsingxml-r-LC21&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-parsingxml-r-L22&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;22&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-parsingxml-r-LC22&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-v&quot;&gt;label&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt; as.numeric(sapply(&lt;span class=&quot;pl-smi&quot;&gt;Tags&lt;/span&gt;,&lt;span class=&quot;pl-k&quot;&gt;function&lt;/span&gt;(&lt;span class=&quot;pl-smi&quot;&gt;x&lt;/span&gt;){&lt;span class=&quot;pl-k&quot;&gt;return&lt;/span&gt;(grep(&lt;span class=&quot;pl-s&quot;&gt;&lt;span class=&quot;pl-pds&quot;&gt;&quot;&lt;/span&gt;python&lt;span class=&quot;pl-pds&quot;&gt;&quot;&lt;/span&gt;&lt;/span&gt;,&lt;span class=&quot;pl-smi&quot;&gt;x&lt;/span&gt;))}))&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-parsingxml-r-L23&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;23&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-parsingxml-r-LC23&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-smi&quot;&gt;label&lt;/span&gt;[is.na(&lt;span class=&quot;pl-smi&quot;&gt;label&lt;/span&gt;)]&lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;pl-c1&quot;&gt;0&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-parsingxml-r-L24&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;24&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-parsingxml-r-LC24&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-parsingxml-r-L25&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;25&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-parsingxml-r-LC25&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;#&lt;/span&gt;final data frame for export&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-parsingxml-r-L26&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;26&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-parsingxml-r-LC26&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-smi&quot;&gt;df&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;&amp;#x3C;-&lt;/span&gt; as.data.frame(cbind(&lt;span class=&quot;pl-smi&quot;&gt;ID&lt;/span&gt;,&lt;span class=&quot;pl-smi&quot;&gt;label&lt;/span&gt;,&lt;span class=&quot;pl-smi&quot;&gt;text&lt;/span&gt;),&lt;span class=&quot;pl-v&quot;&gt;stringsAsFactors&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;pl-c1&quot;&gt;FALSE&lt;/span&gt;)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-parsingxml-r-L27&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;27&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-parsingxml-r-LC27&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-smi&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;pl-k&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;pl-v&quot;&gt;ID&lt;/span&gt;&lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt;as.numeric(&lt;span class=&quot;pl-smi&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;pl-k&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;pl-smi&quot;&gt;ID&lt;/span&gt;)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-parsingxml-r-L28&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;28&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-parsingxml-r-LC28&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-smi&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;pl-k&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;pl-v&quot;&gt;label&lt;/span&gt;&lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt;as.numeric(&lt;span class=&quot;pl-smi&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;pl-k&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;pl-smi&quot;&gt;label&lt;/span&gt;)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-parsingxml-r-L29&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;29&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-parsingxml-r-LC29&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;#&lt;/span&gt;write to csv&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-parsingxml-r-L30&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;30&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-parsingxml-r-LC30&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;write.csv(&lt;span class=&quot;pl-smi&quot;&gt;df&lt;/span&gt;, &lt;span class=&quot;pl-v&quot;&gt;file&lt;/span&gt;&lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;pl-smi&quot;&gt;args&lt;/span&gt;[&lt;span class=&quot;pl-c1&quot;&gt;2&lt;/span&gt;],&lt;span class=&quot;pl-v&quot;&gt;row.names&lt;/span&gt;&lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;pl-c1&quot;&gt;FALSE&lt;/span&gt;)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-parsingxml-r-L31&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;31&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-parsingxml-r-LC31&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;print(&lt;span class=&quot;pl-s&quot;&gt;&lt;span class=&quot;pl-pds&quot;&gt;&quot;&lt;/span&gt;output file created....&lt;span class=&quot;pl-pds&quot;&gt;&quot;&lt;/span&gt;&lt;/span&gt;)&lt;/td&gt;
      &lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;


  &lt;/div&gt;

  &lt;/div&gt;
&lt;/div&gt;

      &lt;/div&gt;
      &lt;div class=&quot;gist-meta&quot;&gt;
        &lt;a href=&quot;https://gist.github.com/Zoldin/47536af63182a0e8daf37a7b989e2e8d/raw/98b259ade11132ad87e9c4f476b7561b184cf041/parsingxml.R&quot; style=&quot;float:right&quot;&gt;view raw&lt;/a&gt;
        &lt;a href=&quot;https://gist.github.com/Zoldin/47536af63182a0e8daf37a7b989e2e8d#file-parsingxml-r&quot;&gt;parsingxml.R&lt;/a&gt;
        hosted with ❤ by &lt;a href=&quot;https://github.com&quot;&gt;GitHub&lt;/a&gt;
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;train&lt;em&gt;test&lt;/em&gt;spliting.R&lt;/strong&gt; — stratified sampling by target variable (here we are
creating test and train data set)&lt;/p&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;div id=&quot;gist71114469&quot; class=&quot;gist&quot;&gt;
    &lt;div class=&quot;gist-file&quot;&gt;
      &lt;div class=&quot;gist-data&quot;&gt;
        &lt;div class=&quot;js-gist-file-update-container js-task-list-container file-box&quot;&gt;
  &lt;div id=&quot;file-train_test_splitting-r&quot; class=&quot;file&quot;&gt;
    

  &lt;div itemprop=&quot;text&quot; class=&quot;Box-body p-0 blob-wrapper data type-r&quot;&gt;
      
&lt;table class=&quot;highlight tab-size js-file-line-container&quot; data-tab-size=&quot;8&quot;&gt;
      &lt;tbody&gt;&lt;tr&gt;
        &lt;td id=&quot;file-train_test_splitting-r-L1&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;1&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_test_splitting-r-LC1&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;#&lt;/span&gt;!/usr/bin/Rscript&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_test_splitting-r-L2&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;2&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_test_splitting-r-LC2&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;library(&lt;span class=&quot;pl-smi&quot;&gt;caret&lt;/span&gt;)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_test_splitting-r-L3&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;3&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_test_splitting-r-LC3&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_test_splitting-r-L4&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;4&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_test_splitting-r-LC4&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-v&quot;&gt;args&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt; commandArgs(&lt;span class=&quot;pl-v&quot;&gt;trailingOnly&lt;/span&gt;&lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;pl-c1&quot;&gt;TRUE&lt;/span&gt;)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_test_splitting-r-L5&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;5&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_test_splitting-r-LC5&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_test_splitting-r-L6&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;6&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_test_splitting-r-LC6&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-k&quot;&gt;if&lt;/span&gt; (&lt;span class=&quot;pl-k&quot;&gt;!&lt;/span&gt;length(&lt;span class=&quot;pl-smi&quot;&gt;args&lt;/span&gt;)&lt;span class=&quot;pl-k&quot;&gt;==&lt;/span&gt;&lt;span class=&quot;pl-c1&quot;&gt;5&lt;/span&gt;) {&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_test_splitting-r-L7&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;7&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_test_splitting-r-LC7&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;  stop(&lt;span class=&quot;pl-s&quot;&gt;&lt;span class=&quot;pl-pds&quot;&gt;&quot;&lt;/span&gt;Five arguments must be supplied (input file name, splitting ratio related to test data set, seed, train output file name, test output file name).n&lt;span class=&quot;pl-pds&quot;&gt;&quot;&lt;/span&gt;&lt;/span&gt;, &lt;span class=&quot;pl-v&quot;&gt;call.&lt;/span&gt;&lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;pl-c1&quot;&gt;FALSE&lt;/span&gt;)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_test_splitting-r-L8&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;8&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_test_splitting-r-LC8&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;} &lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_test_splitting-r-L9&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;9&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_test_splitting-r-LC9&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_test_splitting-r-L10&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;10&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_test_splitting-r-LC10&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_test_splitting-r-L11&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;11&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_test_splitting-r-LC11&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;set.seed(as.numeric(&lt;span class=&quot;pl-smi&quot;&gt;args&lt;/span&gt;[&lt;span class=&quot;pl-c1&quot;&gt;3&lt;/span&gt;]))&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_test_splitting-r-L12&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;12&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_test_splitting-r-LC12&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_test_splitting-r-L13&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;13&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_test_splitting-r-LC13&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-smi&quot;&gt;df&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;&amp;#x3C;-&lt;/span&gt; read.csv(&lt;span class=&quot;pl-smi&quot;&gt;args&lt;/span&gt;[&lt;span class=&quot;pl-c1&quot;&gt;1&lt;/span&gt;],&lt;span class=&quot;pl-v&quot;&gt;stringsAsFactors&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;pl-c1&quot;&gt;FALSE&lt;/span&gt;)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_test_splitting-r-L14&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;14&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_test_splitting-r-LC14&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_test_splitting-r-L15&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;15&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_test_splitting-r-LC15&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-smi&quot;&gt;test.index&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;&amp;#x3C;-&lt;/span&gt; createDataPartition(&lt;span class=&quot;pl-smi&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;pl-k&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;pl-smi&quot;&gt;label&lt;/span&gt;, &lt;span class=&quot;pl-v&quot;&gt;p&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt; as.numeric(&lt;span class=&quot;pl-smi&quot;&gt;args&lt;/span&gt;[&lt;span class=&quot;pl-c1&quot;&gt;2&lt;/span&gt;]), &lt;span class=&quot;pl-v&quot;&gt;list&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;pl-c1&quot;&gt;FALSE&lt;/span&gt;)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_test_splitting-r-L16&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;16&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_test_splitting-r-LC16&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_test_splitting-r-L17&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;17&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_test_splitting-r-LC17&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_test_splitting-r-L18&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;18&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_test_splitting-r-LC18&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-smi&quot;&gt;train&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;&amp;#x3C;-&lt;/span&gt; &lt;span class=&quot;pl-smi&quot;&gt;df&lt;/span&gt;[&lt;span class=&quot;pl-k&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;pl-smi&quot;&gt;test.index&lt;/span&gt;,]&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_test_splitting-r-L19&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;19&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_test_splitting-r-LC19&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-smi&quot;&gt;test&lt;/span&gt;  &lt;span class=&quot;pl-k&quot;&gt;&amp;#x3C;-&lt;/span&gt; &lt;span class=&quot;pl-smi&quot;&gt;df&lt;/span&gt;[&lt;span class=&quot;pl-smi&quot;&gt;test.index&lt;/span&gt;,]&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_test_splitting-r-L20&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;20&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_test_splitting-r-LC20&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_test_splitting-r-L21&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;21&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_test_splitting-r-LC21&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_test_splitting-r-L22&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;22&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_test_splitting-r-LC22&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;write.csv(&lt;span class=&quot;pl-smi&quot;&gt;train&lt;/span&gt;, &lt;span class=&quot;pl-v&quot;&gt;file&lt;/span&gt;&lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;pl-smi&quot;&gt;args&lt;/span&gt;[&lt;span class=&quot;pl-c1&quot;&gt;4&lt;/span&gt;],&lt;span class=&quot;pl-v&quot;&gt;row.names&lt;/span&gt;&lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;pl-c1&quot;&gt;FALSE&lt;/span&gt;)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_test_splitting-r-L23&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;23&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_test_splitting-r-LC23&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;write.csv(&lt;span class=&quot;pl-smi&quot;&gt;test&lt;/span&gt;, &lt;span class=&quot;pl-v&quot;&gt;file&lt;/span&gt;&lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;pl-smi&quot;&gt;args&lt;/span&gt;[&lt;span class=&quot;pl-c1&quot;&gt;5&lt;/span&gt;],&lt;span class=&quot;pl-v&quot;&gt;row.names&lt;/span&gt;&lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;pl-c1&quot;&gt;FALSE&lt;/span&gt;)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_test_splitting-r-L24&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;24&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_test_splitting-r-LC24&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;print(&lt;span class=&quot;pl-s&quot;&gt;&lt;span class=&quot;pl-pds&quot;&gt;&quot;&lt;/span&gt;train/test files created....&lt;span class=&quot;pl-pds&quot;&gt;&quot;&lt;/span&gt;&lt;/span&gt;)&lt;/td&gt;
      &lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;


  &lt;/div&gt;

  &lt;/div&gt;
&lt;/div&gt;

      &lt;/div&gt;
      &lt;div class=&quot;gist-meta&quot;&gt;
        &lt;a href=&quot;https://gist.github.com/Zoldin/7591c47ce5988cbe087e0038c9a850b9/raw/e2106c39bad8a4ae04e41658bd287ea94ff7437a/train_test_splitting.R&quot; style=&quot;float:right&quot;&gt;view raw&lt;/a&gt;
        &lt;a href=&quot;https://gist.github.com/Zoldin/7591c47ce5988cbe087e0038c9a850b9#file-train_test_splitting-r&quot;&gt;train_test_splitting.R&lt;/a&gt;
        hosted with ❤ by &lt;a href=&quot;https://github.com&quot;&gt;GitHub&lt;/a&gt;
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;featurization.R&lt;/strong&gt; — text mining and tf-idf matrix creation. In this part we
are creating predictive variables.&lt;/p&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;div id=&quot;gist71113907&quot; class=&quot;gist&quot;&gt;
    &lt;div class=&quot;gist-file&quot;&gt;
      &lt;div class=&quot;gist-data&quot;&gt;
        &lt;div class=&quot;js-gist-file-update-container js-task-list-container file-box&quot;&gt;
  &lt;div id=&quot;file-featurization-r&quot; class=&quot;file&quot;&gt;
    

  &lt;div itemprop=&quot;text&quot; class=&quot;Box-body p-0 blob-wrapper data type-r&quot;&gt;
      
&lt;table class=&quot;highlight tab-size js-file-line-container&quot; data-tab-size=&quot;8&quot;&gt;
      &lt;tbody&gt;&lt;tr&gt;
        &lt;td id=&quot;file-featurization-r-L1&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;1&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-featurization-r-LC1&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;#&lt;/span&gt;!/usr/bin/Rscript&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-featurization-r-L2&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;2&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-featurization-r-LC2&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;library(&lt;span class=&quot;pl-smi&quot;&gt;text2vec&lt;/span&gt;)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-featurization-r-L3&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;3&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-featurization-r-LC3&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;library(&lt;span class=&quot;pl-smi&quot;&gt;MASS&lt;/span&gt;)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-featurization-r-L4&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;4&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-featurization-r-LC4&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;library(&lt;span class=&quot;pl-smi&quot;&gt;Matrix&lt;/span&gt;)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-featurization-r-L5&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;5&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-featurization-r-LC5&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-featurization-r-L6&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;6&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-featurization-r-LC6&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-v&quot;&gt;args&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt; commandArgs(&lt;span class=&quot;pl-v&quot;&gt;trailingOnly&lt;/span&gt;&lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;pl-c1&quot;&gt;TRUE&lt;/span&gt;)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-featurization-r-L7&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;7&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-featurization-r-LC7&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-featurization-r-L8&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;8&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-featurization-r-LC8&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-k&quot;&gt;if&lt;/span&gt; (&lt;span class=&quot;pl-k&quot;&gt;!&lt;/span&gt;length(&lt;span class=&quot;pl-smi&quot;&gt;args&lt;/span&gt;)&lt;span class=&quot;pl-k&quot;&gt;==&lt;/span&gt;&lt;span class=&quot;pl-c1&quot;&gt;4&lt;/span&gt;) {&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-featurization-r-L9&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;9&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-featurization-r-LC9&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;  stop(&lt;span class=&quot;pl-s&quot;&gt;&lt;span class=&quot;pl-pds&quot;&gt;&quot;&lt;/span&gt;Four arguments must be supplied ( train file (csv format) ,test data set (csv format), train output file name and test output file name - txt files ).n&lt;span class=&quot;pl-pds&quot;&gt;&quot;&lt;/span&gt;&lt;/span&gt;, &lt;span class=&quot;pl-v&quot;&gt;call.&lt;/span&gt;&lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;pl-c1&quot;&gt;FALSE&lt;/span&gt;)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-featurization-r-L10&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;10&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-featurization-r-LC10&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;} &lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-featurization-r-L11&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;11&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-featurization-r-LC11&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-featurization-r-L12&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;12&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-featurization-r-LC12&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;#&lt;/span&gt;read input files&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-featurization-r-L13&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;13&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-featurization-r-LC13&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-v&quot;&gt;df_train&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt; read.csv(&lt;span class=&quot;pl-smi&quot;&gt;args&lt;/span&gt;[&lt;span class=&quot;pl-c1&quot;&gt;1&lt;/span&gt;],&lt;span class=&quot;pl-v&quot;&gt;stringsAsFactors&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;pl-c1&quot;&gt;FALSE&lt;/span&gt;)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-featurization-r-L14&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;14&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-featurization-r-LC14&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-v&quot;&gt;df_test&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt; read.csv(&lt;span class=&quot;pl-smi&quot;&gt;args&lt;/span&gt;[&lt;span class=&quot;pl-c1&quot;&gt;2&lt;/span&gt;],&lt;span class=&quot;pl-v&quot;&gt;stringsAsFactors&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;pl-c1&quot;&gt;FALSE&lt;/span&gt;)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-featurization-r-L15&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;15&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-featurization-r-LC15&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-featurization-r-L16&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;16&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-featurization-r-LC16&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;#&lt;/span&gt;create vocabulary - words&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-featurization-r-L17&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;17&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-featurization-r-LC17&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-v&quot;&gt;prep_fun&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;pl-smi&quot;&gt;tolower&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-featurization-r-L18&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;18&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-featurization-r-LC18&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-v&quot;&gt;tok_fun&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;pl-smi&quot;&gt;word_tokenizer&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-featurization-r-L19&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;19&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-featurization-r-LC19&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-featurization-r-L20&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;20&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-featurization-r-LC20&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-v&quot;&gt;it_train&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt; itoken(&lt;span class=&quot;pl-smi&quot;&gt;df_train&lt;/span&gt;&lt;span class=&quot;pl-k&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;pl-smi&quot;&gt;text&lt;/span&gt;,  &lt;span class=&quot;pl-v&quot;&gt;preprocessor&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;pl-smi&quot;&gt;prep_fun&lt;/span&gt;,  &lt;span class=&quot;pl-v&quot;&gt;tokenizer&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;pl-smi&quot;&gt;tok_fun&lt;/span&gt;,  &lt;span class=&quot;pl-v&quot;&gt;ids&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;pl-smi&quot;&gt;df_train&lt;/span&gt;&lt;span class=&quot;pl-k&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;pl-smi&quot;&gt;ID&lt;/span&gt;, &lt;span class=&quot;pl-v&quot;&gt;progressbar&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;pl-c1&quot;&gt;FALSE&lt;/span&gt;)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-featurization-r-L21&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;21&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-featurization-r-LC21&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-v&quot;&gt;vocab&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt; create_vocabulary(&lt;span class=&quot;pl-smi&quot;&gt;it_train&lt;/span&gt;,&lt;span class=&quot;pl-v&quot;&gt;stopwords&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;pl-smi&quot;&gt;stop_words&lt;/span&gt;)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-featurization-r-L22&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;22&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-featurization-r-LC22&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-featurization-r-L23&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;23&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-featurization-r-LC23&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;#&lt;/span&gt;clean vocabualary - use only 5000 terms&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-featurization-r-L24&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;24&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-featurization-r-LC24&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-smi&quot;&gt;pruned_vocab&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;&amp;#x3C;-&lt;/span&gt; prune_vocabulary(&lt;span class=&quot;pl-smi&quot;&gt;vocab&lt;/span&gt;, &lt;span class=&quot;pl-v&quot;&gt;max_number_of_terms&lt;/span&gt;&lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;pl-c1&quot;&gt;5000&lt;/span&gt;)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-featurization-r-L25&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;25&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-featurization-r-LC25&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-featurization-r-L26&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;26&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-featurization-r-LC26&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-v&quot;&gt;vectorizer&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt; vocab_vectorizer(&lt;span class=&quot;pl-smi&quot;&gt;pruned_vocab&lt;/span&gt;)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-featurization-r-L27&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;27&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-featurization-r-LC27&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-v&quot;&gt;dtm_train&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt; create_dtm(&lt;span class=&quot;pl-smi&quot;&gt;it_train&lt;/span&gt;, &lt;span class=&quot;pl-smi&quot;&gt;vectorizer&lt;/span&gt;)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-featurization-r-L28&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;28&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-featurization-r-LC28&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-featurization-r-L29&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;29&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-featurization-r-LC29&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;#&lt;/span&gt;create tf-idf for train data set&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-featurization-r-L30&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;30&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-featurization-r-LC30&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-v&quot;&gt;tfidf&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;pl-smi&quot;&gt;TfIdf&lt;/span&gt;&lt;span class=&quot;pl-k&quot;&gt;$&lt;/span&gt;new()&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-featurization-r-L31&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;31&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-featurization-r-LC31&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-v&quot;&gt;dtm_train_tfidf&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt; fit_transform(&lt;span class=&quot;pl-smi&quot;&gt;dtm_train&lt;/span&gt;, &lt;span class=&quot;pl-smi&quot;&gt;tfidf&lt;/span&gt;)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-featurization-r-L32&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;32&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-featurization-r-LC32&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-featurization-r-L33&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;33&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-featurization-r-LC33&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;#&lt;/span&gt;create test tf-idf - use vocabulary that is build on train&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-featurization-r-L34&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;34&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-featurization-r-LC34&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-v&quot;&gt;it_test&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt; itoken(&lt;span class=&quot;pl-smi&quot;&gt;df_test&lt;/span&gt;&lt;span class=&quot;pl-k&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;pl-smi&quot;&gt;text&lt;/span&gt;,  &lt;span class=&quot;pl-v&quot;&gt;preprocessor&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;pl-smi&quot;&gt;prep_fun&lt;/span&gt;,  &lt;span class=&quot;pl-v&quot;&gt;tokenizer&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;pl-smi&quot;&gt;tok_fun&lt;/span&gt;,  &lt;span class=&quot;pl-v&quot;&gt;ids&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;pl-smi&quot;&gt;df_test&lt;/span&gt;&lt;span class=&quot;pl-k&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;pl-smi&quot;&gt;ID&lt;/span&gt;, &lt;span class=&quot;pl-v&quot;&gt;progressbar&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;pl-c1&quot;&gt;FALSE&lt;/span&gt;)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-featurization-r-L35&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;35&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-featurization-r-LC35&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-v&quot;&gt;dtm_test_tfidf&lt;/span&gt;  &lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt; create_dtm(&lt;span class=&quot;pl-smi&quot;&gt;it_test&lt;/span&gt;, &lt;span class=&quot;pl-smi&quot;&gt;vectorizer&lt;/span&gt;) %&lt;span class=&quot;pl-k&quot;&gt;&gt;&lt;/span&gt;% &lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-featurization-r-L36&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;36&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-featurization-r-LC36&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;  transform(&lt;span class=&quot;pl-smi&quot;&gt;tfidf&lt;/span&gt;)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-featurization-r-L37&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;37&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-featurization-r-LC37&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;#&lt;/span&gt;add Id as additional column in matrices&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-featurization-r-L38&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;38&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-featurization-r-LC38&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-smi&quot;&gt;dtm_train_tfidf&lt;/span&gt;&lt;span class=&quot;pl-k&quot;&gt;&amp;#x3C;-&lt;/span&gt; Matrix(cbind(&lt;span class=&quot;pl-v&quot;&gt;label&lt;/span&gt;&lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;pl-smi&quot;&gt;df_train&lt;/span&gt;&lt;span class=&quot;pl-k&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;pl-smi&quot;&gt;label&lt;/span&gt;,&lt;span class=&quot;pl-smi&quot;&gt;dtm_train_tfidf&lt;/span&gt;),&lt;span class=&quot;pl-v&quot;&gt;sparse&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;pl-c1&quot;&gt;TRUE&lt;/span&gt;)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-featurization-r-L39&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;39&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-featurization-r-LC39&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-smi&quot;&gt;dtm_test_tfidf&lt;/span&gt;&lt;span class=&quot;pl-k&quot;&gt;&amp;#x3C;-&lt;/span&gt; Matrix(cbind(&lt;span class=&quot;pl-v&quot;&gt;label&lt;/span&gt;&lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;pl-smi&quot;&gt;df_test&lt;/span&gt;&lt;span class=&quot;pl-k&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;pl-smi&quot;&gt;label&lt;/span&gt;,&lt;span class=&quot;pl-smi&quot;&gt;dtm_test_tfidf&lt;/span&gt;),&lt;span class=&quot;pl-v&quot;&gt;sparse&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;pl-c1&quot;&gt;TRUE&lt;/span&gt;)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-featurization-r-L40&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;40&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-featurization-r-LC40&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-featurization-r-L41&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;41&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-featurization-r-LC41&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;#&lt;/span&gt; write output -  tf-idf matrices&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-featurization-r-L42&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;42&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-featurization-r-LC42&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;writeMM(&lt;span class=&quot;pl-smi&quot;&gt;dtm_train_tfidf&lt;/span&gt;,&lt;span class=&quot;pl-smi&quot;&gt;args&lt;/span&gt;[&lt;span class=&quot;pl-c1&quot;&gt;3&lt;/span&gt;])&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-featurization-r-L43&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;43&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-featurization-r-LC43&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;writeMM(&lt;span class=&quot;pl-smi&quot;&gt;dtm_test_tfidf&lt;/span&gt;,&lt;span class=&quot;pl-smi&quot;&gt;args&lt;/span&gt;[&lt;span class=&quot;pl-c1&quot;&gt;4&lt;/span&gt;])&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-featurization-r-L44&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;44&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-featurization-r-LC44&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-featurization-r-L45&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;45&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-featurization-r-LC45&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;print(&lt;span class=&quot;pl-s&quot;&gt;&lt;span class=&quot;pl-pds&quot;&gt;&quot;&lt;/span&gt;Two matrices were created - one for train and one for test data set&lt;span class=&quot;pl-pds&quot;&gt;&quot;&lt;/span&gt;&lt;/span&gt;)&lt;/td&gt;
      &lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;


  &lt;/div&gt;

  &lt;/div&gt;
&lt;/div&gt;

      &lt;/div&gt;
      &lt;div class=&quot;gist-meta&quot;&gt;
        &lt;a href=&quot;https://gist.github.com/Zoldin/9e79c047fd8ad7aa6596b0682aca83c6/raw/2787bc21fa8b2591ca09102f38f544eb5d6cf032/featurization.R&quot; style=&quot;float:right&quot;&gt;view raw&lt;/a&gt;
        &lt;a href=&quot;https://gist.github.com/Zoldin/9e79c047fd8ad7aa6596b0682aca83c6#file-featurization-r&quot;&gt;featurization.R&lt;/a&gt;
        hosted with ❤ by &lt;a href=&quot;https://github.com&quot;&gt;GitHub&lt;/a&gt;
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;train_model.R&lt;/strong&gt; — with created variables we are building logistic regression
(LASSO).&lt;/p&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;div id=&quot;gist71114340&quot; class=&quot;gist&quot;&gt;
    &lt;div class=&quot;gist-file&quot;&gt;
      &lt;div class=&quot;gist-data&quot;&gt;
        &lt;div class=&quot;js-gist-file-update-container js-task-list-container file-box&quot;&gt;
  &lt;div id=&quot;file-train_model-r&quot; class=&quot;file&quot;&gt;
    

  &lt;div itemprop=&quot;text&quot; class=&quot;Box-body p-0 blob-wrapper data type-r&quot;&gt;
      
&lt;table class=&quot;highlight tab-size js-file-line-container&quot; data-tab-size=&quot;8&quot;&gt;
      &lt;tbody&gt;&lt;tr&gt;
        &lt;td id=&quot;file-train_model-r-L1&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;1&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_model-r-LC1&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;#&lt;/span&gt;!/usr/bin/Rscript&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_model-r-L2&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;2&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_model-r-LC2&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;library(&lt;span class=&quot;pl-smi&quot;&gt;Matrix&lt;/span&gt;)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_model-r-L3&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;3&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_model-r-LC3&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;library(&lt;span class=&quot;pl-smi&quot;&gt;glmnet&lt;/span&gt;)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_model-r-L4&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;4&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_model-r-LC4&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_model-r-L5&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;5&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_model-r-LC5&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;    &lt;span class=&quot;pl-c&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;#&lt;/span&gt; three arguments needs to be provided - train file (.txt, matrix), seed and output name for RData file&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_model-r-L6&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;6&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_model-r-LC6&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_model-r-L7&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;7&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_model-r-LC7&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-v&quot;&gt;args&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt; commandArgs(&lt;span class=&quot;pl-v&quot;&gt;trailingOnly&lt;/span&gt;&lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;pl-c1&quot;&gt;TRUE&lt;/span&gt;)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_model-r-L8&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;8&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_model-r-LC8&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_model-r-L9&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;9&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_model-r-LC9&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-k&quot;&gt;if&lt;/span&gt; (&lt;span class=&quot;pl-k&quot;&gt;!&lt;/span&gt;length(&lt;span class=&quot;pl-smi&quot;&gt;args&lt;/span&gt;)&lt;span class=&quot;pl-k&quot;&gt;==&lt;/span&gt;&lt;span class=&quot;pl-c1&quot;&gt;3&lt;/span&gt;) {&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_model-r-L10&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;10&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_model-r-LC10&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;  stop(&lt;span class=&quot;pl-s&quot;&gt;&lt;span class=&quot;pl-pds&quot;&gt;&quot;&lt;/span&gt;Three arguments must be supplied ( train file (.txt, matrix), seed and argument for RData model name).n&lt;span class=&quot;pl-pds&quot;&gt;&quot;&lt;/span&gt;&lt;/span&gt;, &lt;span class=&quot;pl-v&quot;&gt;call.&lt;/span&gt;&lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;pl-c1&quot;&gt;FALSE&lt;/span&gt;)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_model-r-L11&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;11&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_model-r-LC11&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;} &lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_model-r-L12&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;12&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_model-r-LC12&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_model-r-L13&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;13&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_model-r-LC13&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;#&lt;/span&gt;read train data set &lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_model-r-L14&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;14&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_model-r-LC14&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-v&quot;&gt;trainMM&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt; readMM(&lt;span class=&quot;pl-smi&quot;&gt;args&lt;/span&gt;[&lt;span class=&quot;pl-c1&quot;&gt;1&lt;/span&gt;])&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_model-r-L15&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;15&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_model-r-LC15&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;set.seed(as.numeric(&lt;span class=&quot;pl-smi&quot;&gt;args&lt;/span&gt;[&lt;span class=&quot;pl-c1&quot;&gt;2&lt;/span&gt;]))&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_model-r-L16&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;16&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_model-r-LC16&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_model-r-L17&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;17&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_model-r-LC17&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;#&lt;/span&gt;use regular matrix, not sparse&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_model-r-L18&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;18&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_model-r-LC18&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-smi&quot;&gt;trainMM_reg&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;&amp;#x3C;-&lt;/span&gt; as.matrix(&lt;span class=&quot;pl-smi&quot;&gt;trainMM&lt;/span&gt;)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_model-r-L19&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;19&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_model-r-LC19&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_model-r-L20&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;20&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_model-r-LC20&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-v&quot;&gt;t1&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt; Sys.time()&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_model-r-L21&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;21&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_model-r-LC21&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;print(&lt;span class=&quot;pl-s&quot;&gt;&lt;span class=&quot;pl-pds&quot;&gt;&quot;&lt;/span&gt;Started to train the model... &lt;span class=&quot;pl-pds&quot;&gt;&quot;&lt;/span&gt;&lt;/span&gt;)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_model-r-L22&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;22&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_model-r-LC22&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-v&quot;&gt;glmnet_classifier&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt; cv.glmnet(&lt;span class=&quot;pl-v&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;pl-smi&quot;&gt;trainMM_reg&lt;/span&gt;[,&lt;span class=&quot;pl-c1&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;pl-k&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;pl-c1&quot;&gt;500&lt;/span&gt;], &lt;span class=&quot;pl-v&quot;&gt;y&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;pl-smi&quot;&gt;trainMM_reg&lt;/span&gt;[,&lt;span class=&quot;pl-c1&quot;&gt;1&lt;/span&gt;], &lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_model-r-L23&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;23&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_model-r-LC23&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;                              &lt;span class=&quot;pl-v&quot;&gt;family&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;pl-s&quot;&gt;&lt;span class=&quot;pl-pds&quot;&gt;&apos;&lt;/span&gt;binomial&lt;span class=&quot;pl-pds&quot;&gt;&apos;&lt;/span&gt;&lt;/span&gt;, &lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_model-r-L24&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;24&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_model-r-LC24&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;                              &lt;span class=&quot;pl-c&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;#&lt;/span&gt; L1 penalty&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_model-r-L25&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;25&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_model-r-LC25&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;                              &lt;span class=&quot;pl-v&quot;&gt;alpha&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;pl-c1&quot;&gt;1&lt;/span&gt;,&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_model-r-L26&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;26&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_model-r-LC26&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;                              &lt;span class=&quot;pl-c&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;#&lt;/span&gt; interested in the area under ROC curve&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_model-r-L27&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;27&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_model-r-LC27&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;                              &lt;span class=&quot;pl-v&quot;&gt;type.measure&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;pl-s&quot;&gt;&lt;span class=&quot;pl-pds&quot;&gt;&quot;&lt;/span&gt;auc&lt;span class=&quot;pl-pds&quot;&gt;&quot;&lt;/span&gt;&lt;/span&gt;,&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_model-r-L28&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;28&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_model-r-LC28&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;                              &lt;span class=&quot;pl-c&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;#&lt;/span&gt; 5-fold cross-validation&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_model-r-L29&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;29&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_model-r-LC29&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;                              &lt;span class=&quot;pl-v&quot;&gt;nfolds&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;pl-c1&quot;&gt;5&lt;/span&gt;,&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_model-r-L30&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;30&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_model-r-LC30&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;                              &lt;span class=&quot;pl-c&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;#&lt;/span&gt; high value is less accurate, but has faster training&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_model-r-L31&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;31&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_model-r-LC31&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;                              &lt;span class=&quot;pl-v&quot;&gt;thresh&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;pl-c1&quot;&gt;1e-3&lt;/span&gt;,&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_model-r-L32&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;32&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_model-r-LC32&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;                              &lt;span class=&quot;pl-c&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;#&lt;/span&gt; again lower number of iterations for faster training&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_model-r-L33&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;33&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_model-r-LC33&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;                              &lt;span class=&quot;pl-v&quot;&gt;maxit&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;pl-c1&quot;&gt;1e3&lt;/span&gt;)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_model-r-L34&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;34&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_model-r-LC34&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;print(&lt;span class=&quot;pl-s&quot;&gt;&lt;span class=&quot;pl-pds&quot;&gt;&quot;&lt;/span&gt;Model generated...&lt;span class=&quot;pl-pds&quot;&gt;&quot;&lt;/span&gt;&lt;/span&gt;)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_model-r-L35&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;35&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_model-r-LC35&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;print(difftime(Sys.time(), &lt;span class=&quot;pl-smi&quot;&gt;t1&lt;/span&gt;, &lt;span class=&quot;pl-v&quot;&gt;units&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;pl-s&quot;&gt;&lt;span class=&quot;pl-pds&quot;&gt;&apos;&lt;/span&gt;sec&lt;span class=&quot;pl-pds&quot;&gt;&apos;&lt;/span&gt;&lt;/span&gt;))&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_model-r-L36&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;36&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_model-r-LC36&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_model-r-L37&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;37&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_model-r-LC37&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-v&quot;&gt;preds&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt; predict(&lt;span class=&quot;pl-smi&quot;&gt;glmnet_classifier&lt;/span&gt;, &lt;span class=&quot;pl-smi&quot;&gt;trainMM_reg&lt;/span&gt;[,&lt;span class=&quot;pl-c1&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;pl-k&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;pl-c1&quot;&gt;500&lt;/span&gt;], &lt;span class=&quot;pl-v&quot;&gt;type&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;pl-s&quot;&gt;&lt;span class=&quot;pl-pds&quot;&gt;&apos;&lt;/span&gt;response&lt;span class=&quot;pl-pds&quot;&gt;&apos;&lt;/span&gt;&lt;/span&gt;)[,&lt;span class=&quot;pl-c1&quot;&gt;1&lt;/span&gt;]&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_model-r-L38&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;38&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_model-r-LC38&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_model-r-L39&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;39&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_model-r-LC39&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;print(&lt;span class=&quot;pl-s&quot;&gt;&lt;span class=&quot;pl-pds&quot;&gt;&quot;&lt;/span&gt;AUC for the train... &lt;span class=&quot;pl-pds&quot;&gt;&quot;&lt;/span&gt;&lt;/span&gt;)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_model-r-L40&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;40&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_model-r-LC40&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-e&quot;&gt;glmnet&lt;/span&gt;&lt;span class=&quot;pl-k&quot;&gt;:::&lt;/span&gt;auc(&lt;span class=&quot;pl-smi&quot;&gt;trainMM_reg&lt;/span&gt;[,&lt;span class=&quot;pl-c1&quot;&gt;1&lt;/span&gt;], &lt;span class=&quot;pl-smi&quot;&gt;preds&lt;/span&gt;)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_model-r-L41&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;41&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_model-r-LC41&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-train_model-r-L42&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;42&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-train_model-r-LC42&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;save(&lt;span class=&quot;pl-smi&quot;&gt;glmnet_classifier&lt;/span&gt;,&lt;span class=&quot;pl-v&quot;&gt;file&lt;/span&gt;&lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;pl-smi&quot;&gt;args&lt;/span&gt;[&lt;span class=&quot;pl-c1&quot;&gt;3&lt;/span&gt;])&lt;/td&gt;
      &lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;


  &lt;/div&gt;

  &lt;/div&gt;
&lt;/div&gt;

      &lt;/div&gt;
      &lt;div class=&quot;gist-meta&quot;&gt;
        &lt;a href=&quot;https://gist.github.com/Zoldin/1617b39f2acbde3cd486616ac442e7cf/raw/5f12bfcec59aeddd8428f9d9c571a243c2302ae6/train_model.R&quot; style=&quot;float:right&quot;&gt;view raw&lt;/a&gt;
        &lt;a href=&quot;https://gist.github.com/Zoldin/1617b39f2acbde3cd486616ac442e7cf#file-train_model-r&quot;&gt;train_model.R&lt;/a&gt;
        hosted with ❤ by &lt;a href=&quot;https://github.com&quot;&gt;GitHub&lt;/a&gt;
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;evaluate.R&lt;/strong&gt; — with trained model we are predicting target on test data set.
AUC is final output which is used as evaluation metric.&lt;/p&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;div id=&quot;gist71113477&quot; class=&quot;gist&quot;&gt;
    &lt;div class=&quot;gist-file&quot;&gt;
      &lt;div class=&quot;gist-data&quot;&gt;
        &lt;div class=&quot;js-gist-file-update-container js-task-list-container file-box&quot;&gt;
  &lt;div id=&quot;file-evaluate-r&quot; class=&quot;file&quot;&gt;
    

  &lt;div itemprop=&quot;text&quot; class=&quot;Box-body p-0 blob-wrapper data type-r&quot;&gt;
      
&lt;table class=&quot;highlight tab-size js-file-line-container&quot; data-tab-size=&quot;8&quot;&gt;
      &lt;tbody&gt;&lt;tr&gt;
        &lt;td id=&quot;file-evaluate-r-L1&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;1&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-evaluate-r-LC1&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;#&lt;/span&gt;!/usr/bin/Rscript&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-evaluate-r-L2&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;2&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-evaluate-r-LC2&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;library(&lt;span class=&quot;pl-smi&quot;&gt;Matrix&lt;/span&gt;)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-evaluate-r-L3&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;3&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-evaluate-r-LC3&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;library(&lt;span class=&quot;pl-smi&quot;&gt;glmnet&lt;/span&gt;)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-evaluate-r-L4&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;4&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-evaluate-r-LC4&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-evaluate-r-L5&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;5&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-evaluate-r-LC5&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-v&quot;&gt;args&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt; commandArgs(&lt;span class=&quot;pl-v&quot;&gt;trailingOnly&lt;/span&gt;&lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;pl-c1&quot;&gt;TRUE&lt;/span&gt;)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-evaluate-r-L6&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;6&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-evaluate-r-LC6&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-evaluate-r-L7&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;7&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-evaluate-r-LC7&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-k&quot;&gt;if&lt;/span&gt; (&lt;span class=&quot;pl-k&quot;&gt;!&lt;/span&gt;length(&lt;span class=&quot;pl-smi&quot;&gt;args&lt;/span&gt;)&lt;span class=&quot;pl-k&quot;&gt;==&lt;/span&gt;&lt;span class=&quot;pl-c1&quot;&gt;3&lt;/span&gt;) {&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-evaluate-r-L8&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;8&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-evaluate-r-LC8&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;  stop(&lt;span class=&quot;pl-s&quot;&gt;&lt;span class=&quot;pl-pds&quot;&gt;&quot;&lt;/span&gt;Three arguments must be supplied ( file name where model is stored (RDataname), test file (.txt, matrix) and file name for AUC output).n&lt;span class=&quot;pl-pds&quot;&gt;&quot;&lt;/span&gt;&lt;/span&gt;, &lt;span class=&quot;pl-v&quot;&gt;call.&lt;/span&gt;&lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;pl-c1&quot;&gt;FALSE&lt;/span&gt;)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-evaluate-r-L9&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;9&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-evaluate-r-LC9&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;} &lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-evaluate-r-L10&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;10&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-evaluate-r-LC10&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-evaluate-r-L11&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;11&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-evaluate-r-LC11&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;#&lt;/span&gt;read test data set and model &lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-evaluate-r-L12&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;12&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-evaluate-r-LC12&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;load(&lt;span class=&quot;pl-smi&quot;&gt;args&lt;/span&gt;[&lt;span class=&quot;pl-c1&quot;&gt;1&lt;/span&gt;])&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-evaluate-r-L13&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;13&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-evaluate-r-LC13&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-v&quot;&gt;testMM&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt; readMM(&lt;span class=&quot;pl-smi&quot;&gt;args&lt;/span&gt;[&lt;span class=&quot;pl-c1&quot;&gt;2&lt;/span&gt;])&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-evaluate-r-L14&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;14&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-evaluate-r-LC14&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-smi&quot;&gt;testMM_reg&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;&amp;#x3C;-&lt;/span&gt; as.matrix(&lt;span class=&quot;pl-smi&quot;&gt;testMM&lt;/span&gt;)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-evaluate-r-L15&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;15&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-evaluate-r-LC15&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-evaluate-r-L16&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;16&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-evaluate-r-LC16&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;#&lt;/span&gt;predict test data&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-evaluate-r-L17&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;17&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-evaluate-r-LC17&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-v&quot;&gt;preds&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt; predict(&lt;span class=&quot;pl-smi&quot;&gt;glmnet_classifier&lt;/span&gt;, &lt;span class=&quot;pl-smi&quot;&gt;testMM_reg&lt;/span&gt;[,&lt;span class=&quot;pl-c1&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;pl-k&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;pl-c1&quot;&gt;500&lt;/span&gt;] , &lt;span class=&quot;pl-v&quot;&gt;type&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;pl-s&quot;&gt;&lt;span class=&quot;pl-pds&quot;&gt;&apos;&lt;/span&gt;response&lt;span class=&quot;pl-pds&quot;&gt;&apos;&lt;/span&gt;&lt;/span&gt;)[, &lt;span class=&quot;pl-c1&quot;&gt;1&lt;/span&gt;]&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-evaluate-r-L18&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;18&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-evaluate-r-LC18&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt; &lt;span class=&quot;pl-e&quot;&gt;glmnet&lt;/span&gt;&lt;span class=&quot;pl-k&quot;&gt;:::&lt;/span&gt;auc(&lt;span class=&quot;pl-smi&quot;&gt;testMM_reg&lt;/span&gt;[,&lt;span class=&quot;pl-c1&quot;&gt;1&lt;/span&gt;], &lt;span class=&quot;pl-smi&quot;&gt;preds&lt;/span&gt;)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-evaluate-r-L19&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;19&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-evaluate-r-LC19&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-evaluate-r-L20&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;20&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-evaluate-r-LC20&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;#&lt;/span&gt;write AUC into txt file&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-evaluate-r-L21&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;21&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-evaluate-r-LC21&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;write.table(&lt;span class=&quot;pl-v&quot;&gt;file&lt;/span&gt;&lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;pl-smi&quot;&gt;args&lt;/span&gt;[&lt;span class=&quot;pl-c1&quot;&gt;3&lt;/span&gt;],paste(&lt;span class=&quot;pl-s&quot;&gt;&lt;span class=&quot;pl-pds&quot;&gt;&apos;&lt;/span&gt;AUC for the test file is : &lt;span class=&quot;pl-pds&quot;&gt;&apos;&lt;/span&gt;&lt;/span&gt;,&lt;span class=&quot;pl-e&quot;&gt;glmnet&lt;/span&gt;&lt;span class=&quot;pl-k&quot;&gt;:::&lt;/span&gt;auc(&lt;span class=&quot;pl-smi&quot;&gt;testMM_reg&lt;/span&gt;[,&lt;span class=&quot;pl-c1&quot;&gt;1&lt;/span&gt;], &lt;span class=&quot;pl-smi&quot;&gt;preds&lt;/span&gt;)),&lt;span class=&quot;pl-v&quot;&gt;row.names&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;pl-c1&quot;&gt;FALSE&lt;/span&gt;,&lt;span class=&quot;pl-v&quot;&gt;col.names&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;pl-c1&quot;&gt;FALSE&lt;/span&gt;)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-evaluate-r-L22&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;22&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-evaluate-r-LC22&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-evaluate-r-L23&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;23&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-evaluate-r-LC23&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;


  &lt;/div&gt;

  &lt;/div&gt;
&lt;/div&gt;

      &lt;/div&gt;
      &lt;div class=&quot;gist-meta&quot;&gt;
        &lt;a href=&quot;https://gist.github.com/Zoldin/bfc2d4ee449098a9ff64b99c3326e61d/raw/8044bf4a8bf9301113705332f6a26936bd89445b/evaluate.r&quot; style=&quot;float:right&quot;&gt;view raw&lt;/a&gt;
        &lt;a href=&quot;https://gist.github.com/Zoldin/bfc2d4ee449098a9ff64b99c3326e61d#file-evaluate-r&quot;&gt;evaluate.r&lt;/a&gt;
        hosted with ❤ by &lt;a href=&quot;https://github.com&quot;&gt;GitHub&lt;/a&gt;
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;p&gt;Firstly, codes from above we will download into the new folder (clone the
repository):&lt;/p&gt;
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;dvc&quot;&gt;&lt;pre class=&quot;language-dvc&quot;&gt;&lt;code class=&quot;language-dvc&quot;&gt;&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token command&quot;&gt;mkdir&lt;/span&gt; R_DVC_GITHUB_CODE
&lt;/span&gt;&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token command&quot;&gt;cd&lt;/span&gt; R_DVC_GITHUB_CODE
&lt;/span&gt;
&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token git&quot;&gt;git clone&lt;/span&gt; https://github.com/Zoldin/R_AND_DVC&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/body&gt;&lt;/html&gt;
&lt;h2&gt;DVC installation and initialization&lt;/h2&gt;
&lt;p&gt;On the first site it seemed that DVC will not be compatible to work with R
because of the fact that DVC is written in Python and as that needs/requires
Python packages and pip package manager. Nevertheless, the tool can be used with
any programming language, it is language agnostic and as such is excellent for
working with R.&lt;/p&gt;
&lt;p&gt;Dvc installation:&lt;/p&gt;
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;dvc&quot;&gt;&lt;pre class=&quot;language-dvc&quot;&gt;&lt;code class=&quot;language-dvc&quot;&gt;&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token command&quot;&gt;pip3&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;install&lt;/span&gt; dvc
&lt;/span&gt;&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token dvc&quot;&gt;dvc init&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/body&gt;&lt;/html&gt;
&lt;p&gt;With code below 5 R scripts with &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc run&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; are executed. Each script is started
with some arguments — input and output file names and other parameters (seed,
splitting ratio etc). It is important to use &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc run&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; — with this command R
script are entering pipeline (DAG graph).&lt;/p&gt;
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;dvc&quot;&gt;&lt;pre class=&quot;language-dvc&quot;&gt;&lt;code class=&quot;language-dvc&quot;&gt;&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token dvc&quot;&gt;dvc import&lt;/span&gt; https://s3-us-west-2.amazonaws.com/dvc-share/so/25K/Posts.xml.tgz &lt;span class=&quot;token punctuation&quot;&gt;\&lt;/span&gt;
             data/
&lt;/span&gt;
&lt;span class=&quot;token comment&quot;&gt;# Extract XML from the archive.&lt;/span&gt;
&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token dvc&quot;&gt;dvc run&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;tar&lt;/span&gt; zxf data/Posts.xml.tgz -C data/
&lt;/span&gt;
&lt;span class=&quot;token comment&quot;&gt;# Prepare data.&lt;/span&gt;
&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token dvc&quot;&gt;dvc run&lt;/span&gt; Rscript code/parsingxml.R &lt;span class=&quot;token punctuation&quot;&gt;\&lt;/span&gt;
                  data/Posts.xml &lt;span class=&quot;token punctuation&quot;&gt;\&lt;/span&gt;
                  data/Posts.csv
&lt;/span&gt;
&lt;span class=&quot;token comment&quot;&gt;# Split training and testing dataset. Two output files.&lt;/span&gt;
&lt;span class=&quot;token comment&quot;&gt;# 0.33 is the test dataset splitting ratio.&lt;/span&gt;
&lt;span class=&quot;token comment&quot;&gt;# 20170426 is a seed for randomization.&lt;/span&gt;
&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token dvc&quot;&gt;dvc run&lt;/span&gt; Rscript code/train_test_spliting.R &lt;span class=&quot;token punctuation&quot;&gt;\&lt;/span&gt;
                  data/Posts.csv &lt;span class=&quot;token number&quot;&gt;0.33&lt;/span&gt; &lt;span class=&quot;token number&quot;&gt;20170426&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;\&lt;/span&gt;
                  data/train_post.csv &lt;span class=&quot;token punctuation&quot;&gt;\&lt;/span&gt;
                  data/test_post.csv
&lt;/span&gt;
&lt;span class=&quot;token comment&quot;&gt;# Extract features from text data.&lt;/span&gt;
&lt;span class=&quot;token comment&quot;&gt;# Two TSV inputs and two pickle matrixes outputs.&lt;/span&gt;
&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token dvc&quot;&gt;dvc run&lt;/span&gt; Rscript code/featurization.R &lt;span class=&quot;token punctuation&quot;&gt;\&lt;/span&gt;
                  data/train_post.csv &lt;span class=&quot;token punctuation&quot;&gt;\&lt;/span&gt;
                  data/test_post.csv &lt;span class=&quot;token punctuation&quot;&gt;\&lt;/span&gt;
                  data/matrix_train.txt &lt;span class=&quot;token punctuation&quot;&gt;\&lt;/span&gt;
                  data/matrix_test.txt
&lt;/span&gt;
&lt;span class=&quot;token comment&quot;&gt;# Train ML model out of the training dataset.&lt;/span&gt;
&lt;span class=&quot;token comment&quot;&gt;# 20170426 is another seed value.&lt;/span&gt;
&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token dvc&quot;&gt;dvc run&lt;/span&gt; Rscript code/train_model.R &lt;span class=&quot;token punctuation&quot;&gt;\&lt;/span&gt;
                  data/matrix_train.txt &lt;span class=&quot;token number&quot;&gt;20170426&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;\&lt;/span&gt;
                  data/glmnet.Rdata
&lt;/span&gt;
&lt;span class=&quot;token comment&quot;&gt;# Evaluate the model by the testing dataset.&lt;/span&gt;
&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token dvc&quot;&gt;dvc run&lt;/span&gt; Rscript code/evaluate.R &lt;span class=&quot;token punctuation&quot;&gt;\&lt;/span&gt;
                  data/glmnet.Rdata &lt;span class=&quot;token punctuation&quot;&gt;\&lt;/span&gt;
                  data/matrix_test.txt &lt;span class=&quot;token punctuation&quot;&gt;\&lt;/span&gt;
                  data/evaluation.txt
&lt;/span&gt;
&lt;span class=&quot;token comment&quot;&gt;# The result.&lt;/span&gt;
&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token command&quot;&gt;cat&lt;/span&gt; data/evaluation.txt&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/body&gt;&lt;/html&gt;
&lt;h2&gt;Dependency flow graph on R example&lt;/h2&gt;
&lt;p&gt;Dependency graph is shown on picture below:&lt;/p&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;span class=&quot;gatsby-resp-image-wrapper&quot; style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto;  max-width: 256.5px;&quot;&gt;
      &lt;a class=&quot;gatsby-resp-image-link&quot; href=&quot;/static/e9ba609b030acd01d27fcd1ff99a3f7f/4df79/dependency-graph.png&quot; style=&quot;display: block&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;
    &lt;span class=&quot;gatsby-resp-image-background-image&quot; style=&quot;padding-bottom: 190.05847953216372%; position: relative; bottom: 0; left: 0; background-image: url(&amp;#x27;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAmCAYAAADEO7urAAAACXBIWXMAAAsSAAALEgHS3X78AAAIEUlEQVRIx51Vd1CbRxbfD4FBYIoKSAhRQjNNBoyRQF0gEMUChd5sC5BkVSB04kIJptgGJCQkUQ04sccJccEODg4YHwYBJpfEnmTm5ibOH1dyNzdzN5n8f6P79AkwBJKZ3JvZ2be7b3/73r4GwA5lNN8HGvMmwgu0WmQOuVqIDl/IQ9N41BAWVRDGoGaHnRJnY2M+Kjqe1tnr0Gv+D6AaFIisxrQBjiTFiOXA2nXY0RPbDTw8ZMDbvQFgCDecMcdNwHX3PH/o1tFABZ3zQGFYR/gL+nVnlXHdv9a07deo/5Nv35IVzyxVJJ3tvxeoXbZiG4zf+cLnAbBW3jb5c9efH9aQX38PtM58BSFaGtajakc3KUrDGkltXgsoazRRROIrrCJFP7VA0U+rvr4QXDu2TVQbLWy1yeJiuwPzBwHTYMCWXUD9enSNeTNGabAQ1SPr/sW1utgS1TVq4YWepCK1Nv7C0Eqg2rSFAGpMFvSRgCWdC0Cu3zF5eM1FNWIJUBktAQr9auClxz8RMmXdFNmIhdx2/x9Em7nweaDauOFjk8/pXP4NpxgsR+7DdhHgyfGX+8WX/nw0UHbLQ6Dc8fD5kU8gq6MV5I50eArvXsHS0rlkAa8ghJv+bkDKlUov0XQ7UW54hFLpt6Gz45PIHeX+6AgODtnjxe2zBx6ijfBxUQ9DPchygm/wdV8MbZqJSx7n7YXNG3i812GP29bWFgCEsVxkQfAhAqvVivByw4ab0rAeqRrZCqsf+z6o7cMfg9KqL6ae7Z6jXLz9ExxOr4Ph8wj4e4J2gZv6JuwMGQZhCbKBokABmNlix3gAUIqBhUSVaTNG3nXHR3S2hZFzrjUxv6ozLrekkZ1bVsfNkV49eXHubx5y3Uqast3oo4gDDgDwgKyp8eA/euJxGHhyv/HKipHrX9AVw6ux1b2Pkqt65pPFXXOMyu77jKoPPmVKB5YS4ViNb7jzJjYyke6FwbjhbfeDov0gIHj4NdRIxANO2El88kmmDyuBRT5FDMEizln/J3SUA3WwVU8XrSDexc8jMSqJQI/jkJmhbAxymDO5iMoyzYXxFE0BNL6IyMwsILLFGpLQPBeZNzWPZIGo5iJI4aUBfo4IuZOn7UZljjWEsy8I/ampPL/kzFQftiSHkD1eF2HPEvMnjulfvPSg6sZ843QTxMyVL72Ed5eOHRlfO45L+7DVkbMg9Uy5U0FIny0jcR5UYHKn6p1B5tiD/eJQbnwc+gt4zodHIRmg8rwAqswboFhk4MzwB87lGIASkYFjNbA5AoAOX6LjRN4N1IFHz9x8DNgDo6Dy23+DfI0Ikg89xWsMz/00+hWSSrdEMr62YlLLG2O4RTWxNl6tWyap9c9IGsMyWTn81BvEwhhTciAcb95vid2UzJa7DnAu8zVGS0jN6HaI7PqTiCJlf0qR8hqvUNHPK1IPcqp756PqJr4K0Rg33oGLQgac1072ArEvn4PDY/d4pXGTUTP+x/DaiW/CVCPrEefaJunlTWMMuIwxy9tmaNLBZ5Gq0S/DFcbNMI35JbtudAsxWWPegjXS3waspm4ECIvFO8If45rAFrpmnm/DCioaMRkVzV45tTo3fokaK2m5jjM8euXWMnDbq6FvGtPUP4uV1l5yR76ttBlkzXx20IFeeDzOE4v13bdl+3jU7hDSw1Hi4jRUkzOAdmuMf6B/MAj3d9i7UfHDfx0EHVo8I44Zwo2hk/lxLBKXkkwWNF0j9VmtDkdFTvrrn51YRRIcM4L6DjOKRkyOSAhi0TOIgpmF40A59Q2UafrYj9NnCGRcHSSzerVkbq8uIMvwUVDp1CJS/9wTooG3byAQ9RgQQOHMEzTvxmhQUkt7YNL7HQG0tvYAbo8uSHjzgc/BDhcV7OEWHuD5S43kw29rneDWwh4PFxaHwsuX3TmfLzqRdqLE1pC4qpEXcY3T31KKm8c5RY1mXq1xI14x/DxBbdqgyobXfOydcA1kjd2z9/Cb8yBf2YbwqooKF+tuPE/MASAdXssorR9zzcpX0XLKW6Mz85S0FKGMKdIYMSrTVphU9yLOJgzLQf/6698BY3TG/veaGmSukkjQ5+RypIhoSgqRbpVYa96Mqht7Ga3ULsdKrz2Or5/82tZGo2vNG7FwacccKvG2tUq1O6PhAe3f26Mgoid2J0QOEGz2IU9zOJzd2RUe0P69PcJ64/32r6XjW5BCZ/lV0B2y9eW3dTP7xZtj9AQ2nh6dHMqm0P1PRVBjGKe4uNzH20gtLOoaBBKJ5NCgUCjIfRQK5QrBhGSBAxy2uZOf4fja6ROshiskmliJo9dfIQgGJ0/kzS4Sfk2l93dDxJ5Mx0LjU6GH8B7sD7um/KUNB+7nq66crj7/dwd0hMr2zuOKXIGzRCZDw9q8HVIpWlIlRlfyQ53OlxWgh/svOxF8cP4HXhNNPjlclG39ubUTklZVAhgEgsEgqVwNFZ8+Dup6ZxxkulV65cCz05LB56fTK7tEpZc/ZopvrMTBbZUHBKNzIG3yU8AbnQW0W/eA33c/gt8itWnzmMq4kSrT/QFzprSJKiqtZ2aVNSefUQ1j3hvfTkGEysrK7MU1Kwv5cKlUesgJ1T32LPng5gqoGlqNEA88j67SrkeWdD6kVA6tRlYOrkZUa1djwe+hi6ZXR23jwf9Dtlym6m+BaFtM/sUKJQI3J87pdFxMVPwJeiLPh+Ue4nL5Z+vvA00x3bXH7vy2E7N7KDopu/hESgKPyKUL/Dit3THZtx55/w8YH8oPRLNy4wAAAABJRU5ErkJggg==&amp;#x27;); background-size: cover; display: block;&quot;&gt;&lt;/span&gt;
  &lt;picture&gt;
        &lt;source srcset=&quot;/static/e9ba609b030acd01d27fcd1ff99a3f7f/c54d4/dependency-graph.webp 175w, /static/e9ba609b030acd01d27fcd1ff99a3f7f/a3432/dependency-graph.webp 350w, /static/e9ba609b030acd01d27fcd1ff99a3f7f/3be34/dependency-graph.webp 513w&quot; sizes=&quot;(max-width: 513px) 100vw, 513px&quot; type=&quot;image/webp&quot;&gt;
        &lt;source srcset=&quot;/static/e9ba609b030acd01d27fcd1ff99a3f7f/17006/dependency-graph.png 175w, /static/e9ba609b030acd01d27fcd1ff99a3f7f/d6f3f/dependency-graph.png 350w, /static/e9ba609b030acd01d27fcd1ff99a3f7f/4df79/dependency-graph.png 513w&quot; sizes=&quot;(max-width: 513px) 100vw, 513px&quot; type=&quot;image/png&quot;&gt;
        &lt;img class=&quot;gatsby-resp-image-image&quot; src=&quot;/static/e9ba609b030acd01d27fcd1ff99a3f7f/4df79/dependency-graph.png&quot; alt=&quot;Dependency graph&quot; title=&quot;Dependency graph&quot; loading=&quot;lazy&quot; style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;&gt;
      &lt;/picture&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/body&gt;&lt;/html&gt;&lt;em&gt;Dependency
graph&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;DVC memorizes this dependencies and helps us in each moment to reproduce
results.&lt;/p&gt;
&lt;p&gt;For example, lets say that we are changing our training model — using ridge
penalty instead of lasso penalty (changing alpha parameter to &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;0&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;). In that case
will change/modify &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;train_model.R&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; job and if we want to repeat model
development with this algorithm we don’t need to repeat all steps from above,
only steps marked red on a picture below:&lt;/p&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;span class=&quot;gatsby-resp-image-wrapper&quot; style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto;  max-width: 256.5px;&quot;&gt;
      &lt;a class=&quot;gatsby-resp-image-link&quot; href=&quot;/static/da29b8bd00ccba3578fdfe91cd7f34bc/4df79/marked-steps.png&quot; style=&quot;display: block&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;
    &lt;span class=&quot;gatsby-resp-image-background-image&quot; style=&quot;padding-bottom: 190.05847953216372%; position: relative; bottom: 0; left: 0; background-image: url(&amp;#x27;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAmCAYAAADEO7urAAAACXBIWXMAAAsSAAALEgHS3X78AAAIFUlEQVRIx5VWCVCTWRJ+SUBNIHdCkj8/SSAEY7hBEkLIxRUOw4AKqIiKQCAhHBYgIDUKigc6KpEECKgjuroX47nquIu6ByAuzo5j7dZO7VZN1W7N7FG7U+rq1FjK5t+XP4BcU7N2VVf3e//r7/Xr7tfvB2CGsloug9qBSVw39ThwKT1YTA6/uZ6sMqqkOpVJplGuk8WV5bKiLhYFZnR2Ew+7vwZKpw1fW9v/ACxLNtfEgjHF4U9ndgEazQK4gY2AyTu2ghHQByiz3/OPX1geaH3HVWB1+sCqe8dX1vRNBNf3TwmbnZ8LjoxinJTN1qTSo1fEPfcwVmP/HwQ1feMi6BXXu7744F1QP3O6OUpv/Ai0DP+O4NWtznFF/eBklM01jtS6x0Qlzf1RBWX7tEW2buVGW7eq/IPboQ1Dj/hwU32Na2KV1wbqCwEzGi9DwE9wwGrnWETdwGQk9JRvd40FF9c7YoohWGHVwaRC+8k4y8n7YnvfJK/GNe4FJOOAi0IFNu2/5T0qrlscY6tgLEXQQFTd+ytx+42nvGxLV6TFNYG2Xv4b39Y3JoLfIeiDINyZtlug7ruSMgu6mKAbPCj8Fs+bm58sD5Tbem3O7e19PyVgAAP5rk66+cfvs1SZBjTLWCg1mApEqfvKGAXnOvjVvTdItt4pwrahs0urQyoNm9PLOs4v2CjRmcpWXJXRRFa+QHoMZarOpbCTh1LnyuYR5IaOHlxvadkNPYvU4QMBHwEYhgFflh8EwEyvqXE9lO0a/LOk9QdfSTLK29NKD4xE7rn4NLiu/7NQq3NMDtdIZoGbDp/2KXwIos1aB6rXVwHtujK/aABIthO3Eu39k5FV+y8F5Ze2aPK2tSZu2NkZ+97mJl3+ll2GvMqD0XtGvqRZHfczbPv6giyxgAhAGrA0N83AE3yCweGwoKCeeIIxoQfJtt7fxFQcvqHeefi6eseBEc3OriuQL6dYjo8m2pzjcY0//CJGkahhsFiBHK99SISQAEzXPyXsZlOBQRbNUUdrg7QJOjSeL2XhyRn7O2G5BLrgqW5/jIG4VUJaokLNS47Vo9owPQP/mHfmDimnf0RmtDaJVOn5/JScjXzdjlokzz2yZsPZG/gtKKhrB2mpmSDjvQLcZkPPQVL2YFO4rsocrEw3CtXZ6Tx9hZmXO9QgxxdkDoz4ZfxikqZyDAriTp3mZ99/xFj3o9EVy9bXTOIyLrT56W5V0I2XtvIzh7cguqslzPyzu1aC7MFr85cTChLiyL+GchvkzRJAKuYA0jYBIOlFYKVWDFZu5wLSRjHwg02L6DXoFCJ+7vyjpAWbrvvwZ0B73A12/P7foLC+gFB98g7H3nsftZ+6h9h6RoXOzzBmWklTpKGoLsb1GGPWOEYR+6m7iL33Lmo99XMuUMDyO1sFzEPNs0kmztVgTstPiJZTY+nwHktr3VNSy7Hb8kJrd2qR7aix0HrEWGg/oS8/dE1RP/SJ1N43EQI5y+6a8Pfawrs977asjnnbZPsmNdAgvP70YxkElm9rPaPe2jyo2dLYrylpG1ZWnri3xu5+JIPrwmoHfqtrcD/Ej1w78BCA7N6LQNfchQOx2Fw/GBhKvM5Myd7exjKVNDFNW3czzPWOgPRNdlZF6zF27/XHAS3HLzEaj5xjNnefZ1U2vE/Fw7a5GWQP31yYQAaHy6azWIJ5U97Ak2bZrFlN2lGcSWpeBQifzywIFotCgVxMnLMo/WKaaOro4WhiNVJ9pEaYHqtFDFFqNGv3UaQbw4jLVU7mk+f+2qJydopcGZKiUPHV8rUSbXIW3zR8MxD8yZBHyHNeFKZ1nRQbOrvR1APH0Myu46I8xzlJe4fD1//se4AsIBDkH3L6LsPwbbLxA7ckqWWvOLm9U6Rq2ycyHHJIzGevBi16B7JowJBGX1LL/m/7qunCrTldBE9QtHcv1fDxHX9kpkqAh8s1vELRWIzPi5pks/VjXK4REyBxr4RogkcgUGI0Or6rh84AOYNXfG/4h9fBRlubz/nS0lXYbD2fHoG7s1hZGCqkvF4TrcKiEyOxGKXqjSIhBROKmZiAL/NQqbE4YGAg4elfvwQa97Av9nX1uCyvrCRvt1rxJtKwqQgC8nhKD4JEYEJUMR3ES5jmcJMxoVDuESARcD7Gw+EwZ06yIAx2u31WkiET5s/N0TSdrnnFYi+JoYe+ZAro9fpZSYFMmD/3NviBFPmCcbiCgPn77r2HRgPfQeS3LRrS65oDKzTxWk5EgjEMaHKQNdGaqCRlGvufVe14L/yXMgVgYWEL2COVgqsIgttzSX4UQCDigAFEuPlo1V72LzfsXH3emIOcj9dw+1PNQXeKKmR3dx3hgv+PSK10KgErXAtehCM+T7G8YiJWUkF+o4iKw2LjIzCFQoShCOIJC0MxqRTFvBKyb+xlGf+1TCbEQkNRN5WqWQB/u3r/wjh6mexH+CYmnoRJQ4kQhAjBoJSSsBApcVoR7vcynJfyPJSZ+F8ZN2kUZW/5Ryg35VspK/YbGc8IprbUgKnSWvCVKQ98WlL9ved7GYH6Q059EYswvk2IV2JJavV0XJz6WUI485VclOrLoMT3XnuEQjzgeOAXJeKV1JeEr8154EUIQ/4fCS3yWQhN/pxHSX4mokY8l1DlL0NZMeBd6A2dvPTNIlOSMF97ezeCcQN/PLMLavCXxRoOVguiVoSrTOyLIbLI8ugkPkORQcGC3xH0L7Z0n1fd60kf5a8V5icZQlmaXL4yySRw5OrETy0a6v8AG2qzeCeNScoAAAAASUVORK5CYII=&amp;#x27;); background-size: cover; display: block;&quot;&gt;&lt;/span&gt;
  &lt;picture&gt;
        &lt;source srcset=&quot;/static/da29b8bd00ccba3578fdfe91cd7f34bc/c54d4/marked-steps.webp 175w, /static/da29b8bd00ccba3578fdfe91cd7f34bc/a3432/marked-steps.webp 350w, /static/da29b8bd00ccba3578fdfe91cd7f34bc/3be34/marked-steps.webp 513w&quot; sizes=&quot;(max-width: 513px) 100vw, 513px&quot; type=&quot;image/webp&quot;&gt;
        &lt;source srcset=&quot;/static/da29b8bd00ccba3578fdfe91cd7f34bc/17006/marked-steps.png 175w, /static/da29b8bd00ccba3578fdfe91cd7f34bc/d6f3f/marked-steps.png 350w, /static/da29b8bd00ccba3578fdfe91cd7f34bc/4df79/marked-steps.png 513w&quot; sizes=&quot;(max-width: 513px) 100vw, 513px&quot; type=&quot;image/png&quot;&gt;
        &lt;img class=&quot;gatsby-resp-image-image&quot; src=&quot;/static/da29b8bd00ccba3578fdfe91cd7f34bc/4df79/marked-steps.png&quot; alt=&quot;marked steps&quot; title=&quot;marked steps&quot; loading=&quot;lazy&quot; style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;&gt;
      &lt;/picture&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;p&gt;DVC knows based on DAG graph that changed &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;train_model.R&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; file will only change
following files: &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;Glmnet.RData&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; and &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;Evaluation.txt&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;. If we want to see our new
results we need to execute only &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;train_model.R&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; and &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;evaluate.R job&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;. It is cool
that we don’t have to think all the time what we need to repeat (which steps).
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc repro&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; command will do that instead of us. Here is a code example :&lt;/p&gt;
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;dvc&quot;&gt;&lt;pre class=&quot;language-dvc&quot;&gt;&lt;code class=&quot;language-dvc&quot;&gt;&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token command&quot;&gt;vi&lt;/span&gt; train_model.R
&lt;/span&gt;&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token git&quot;&gt;git commit&lt;/span&gt; -am &lt;span class=&quot;token string&quot;&gt;&quot;Ridge penalty instead of lasso&quot;&lt;/span&gt;
&lt;/span&gt;&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token dvc&quot;&gt;dvc repro&lt;/span&gt; data/evaluation.txt
&lt;/span&gt;
Reproducing run command for data item data/glmnet.Rdata. Args: Rscript code/train_model.R data/matrix_train.txt 20170426 data/glmnet.Rdata
Reproducing run command for data item data/evaluation.txt. Args: Rscript code/evaluate.R data/glmnet.Rdata data/matrix_test.txt data/evaluation.txt

&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token command&quot;&gt;cat&lt;/span&gt; data/evaluation.txt
&lt;/span&gt;&quot;AUC for the test file is :  0.947697381983095&quot;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/body&gt;&lt;/html&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc repro&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; always re executes steps which are affected with the latest
developer changes. It knows what needs to be reproduced.&lt;/p&gt;
&lt;p&gt;DVC can also work in an &lt;em&gt;“multi-user environment”&lt;/em&gt; . Pipelines (dependency
graphs) are visible to others colleagues if we are working in a team and using
git as our version control tool. Data files can be shared if we set up a cloud
and with &lt;em&gt;dvc sync&lt;/em&gt; we specify which data can be shared and used for other
users. In that case other users can see the shared data and reproduce results
with those data and their code changes.&lt;/p&gt;
&lt;h2&gt;Summary&lt;/h2&gt;
&lt;p&gt;DVC tool improves and accelerates iterative development and helps to keep track
of ML processes and file dependencies in the simple form. On the R example we
saw how DVC memorizes dependency graph and based on that graph re executes only
jobs that are related to the latest changes. It can also work in multi-user
environment where dependency graphs, codes and data can be shared among multiple
users. Because it is language agnostic, DVC allows us to work with multiple
programming languages within a single data science project.&lt;/p&gt;</content:encoded></item><item><title><![CDATA[How A Data Scientist Can Improve His Productivity]]></title><link>https://blog.dvc.org/how-a-data-scientist-can-improve-his-productivity</link><guid isPermaLink="false">https://blog.dvc.org/how-a-data-scientist-can-improve-his-productivity</guid><pubDate>Mon, 15 May 2017 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Data science and machine learning are iterative processes. It is never possible
to successfully complete a data science project in a single pass. A data
scientist constantly tries new ideas and changes steps of his pipeline:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;extract new features and accidentally find noise in the data;&lt;/li&gt;
&lt;li&gt;clean up the noise, find one more promising feature;&lt;/li&gt;
&lt;li&gt;extract the new feature;&lt;/li&gt;
&lt;li&gt;rebuild and validate the model, realize that the learning algorithm
parameters are not perfect for the new feature set;&lt;/li&gt;
&lt;li&gt;change machine learning algorithm parameters and retrain the model;&lt;/li&gt;
&lt;li&gt;find the ineffective feature subset and remove it from the feature set;&lt;/li&gt;
&lt;li&gt;try a few more new features;&lt;/li&gt;
&lt;li&gt;try another ML algorithm. And then a data format change is required.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This is only a small episode in a data scientist’s daily life and it is what
makes our job different from a regular engineering job.&lt;/p&gt;
&lt;p&gt;Business context, ML algorithm knowledge and intuition all help you to find a
good model faster. But you never know for sure what ideas will bring you the
best value.&lt;/p&gt;
&lt;p&gt;This is why the iteration time is a critical parameter in data science process.
The quicker you iterate, the more you can check ideas and build a better model.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;“A well-engineered pipeline gets data scientists iterating much faster, which
can be a big competitive edge” From
&lt;a href=&quot;http://blog.untrod.com/2012/10/engineering-practices-in-data-science.html&quot;&gt;Engineering Practices in Data Science&lt;/a&gt;
By Chris Clark.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;A data science iteration tool&lt;/h2&gt;
&lt;p&gt;To speed up the iterations in data science projects we have created an open
source tool &lt;a href=&quot;http://dvc.org&quot;&gt;data version control&lt;/a&gt; or &lt;a href=&quot;http://dvc.org&quot;&gt;DVC.org&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;DVC takes care of dependencies between commands that you run, generated data
files, and code files and allows you to easily reproduce any steps of your
research with regards to files changes.&lt;/p&gt;
&lt;p&gt;You can think about DVC as a Makefile for a data science project even though you
do not create a file explicitly. DVC tracks dependencies in your data science
projects when you run data processing or modeling code through a special
command:&lt;/p&gt;
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;dvc&quot;&gt;&lt;pre class=&quot;language-dvc&quot;&gt;&lt;code class=&quot;language-dvc&quot;&gt;&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token dvc&quot;&gt;dvc run&lt;/span&gt; python code/xml_to_tsv.py &lt;span class=&quot;token punctuation&quot;&gt;\&lt;/span&gt;
                 data/Posts.xml data/Posts.tsv&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/body&gt;&lt;/html&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;dvc run&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; works as a proxy for your commands. This allows DVC to track input and
output files, construct the dependency graph
(&lt;a href=&quot;https://en.wikipedia.org/wiki/Directed_acyclic_graph&quot;&gt;DAG&lt;/a&gt;), and store the
command and parameters for a future command reproduction.&lt;/p&gt;
&lt;p&gt;The previous command will be automatically piped with the next command because
of the file &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;data/Posts.tsv&lt;/code&gt;&lt;/body&gt;&lt;/html&gt; is an output for the previous command and the input
for the next one:&lt;/p&gt;
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;dvc&quot;&gt;&lt;pre class=&quot;language-dvc&quot;&gt;&lt;code class=&quot;language-dvc&quot;&gt;&lt;span class=&quot;token comment&quot;&gt;# Split training and testing dataset. Two output files.&lt;/span&gt;
&lt;span class=&quot;token comment&quot;&gt;# 0.33 is the test dataset splitting ratio.&lt;/span&gt;
&lt;span class=&quot;token comment&quot;&gt;# 20170426 is a seed for randomization.&lt;/span&gt;
&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token dvc&quot;&gt;dvc run&lt;/span&gt; python code/split_train_test.py &lt;span class=&quot;token punctuation&quot;&gt;\&lt;/span&gt;
                 data/Posts.tsv &lt;span class=&quot;token number&quot;&gt;0.33&lt;/span&gt; &lt;span class=&quot;token number&quot;&gt;20170426&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;\&lt;/span&gt;
                 data/Posts-train.tsv data/Posts-test.tsv&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/body&gt;&lt;/html&gt;
&lt;p&gt;DVC derives the dependencies automatically by looking to the list of the
parameters (even if your code ignores the parameters) and noting the file
changes before and after running the command.&lt;/p&gt;
&lt;p&gt;If you change one of your dependencies (data or code) then all the affected
steps of the pipeline will be reproduced:&lt;/p&gt;
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;dvc&quot;&gt;&lt;pre class=&quot;language-dvc&quot;&gt;&lt;code class=&quot;language-dvc&quot;&gt;&lt;span class=&quot;token comment&quot;&gt;# Change the data preparation code.&lt;/span&gt;
&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token command&quot;&gt;vi&lt;/span&gt; code/xml_to_tsv.py
&lt;/span&gt;
&lt;span class=&quot;token comment&quot;&gt;# Reproduce.&lt;/span&gt;
&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token dvc&quot;&gt;dvc repro&lt;/span&gt; data/Posts-train.tsv
&lt;/span&gt;Reproducing run command for data item data/Posts.tsv.
Reproducing run command for data item data/Posts-train.tsv.&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/body&gt;&lt;/html&gt;
&lt;p&gt;This is a powerful way of quickly iterating through your pipeline.&lt;/p&gt;
&lt;p&gt;The pipeline might have a lot of steps and forms of acyclic dependencies between
the steps. Below is an example of a canonical machine learning pipeline (more
details in &lt;a href=&quot;https://dvc.org/doc/tutorials&quot;&gt;the DVC tutorials&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;div id=&quot;gist47206784&quot; class=&quot;gist&quot;&gt;
    &lt;div class=&quot;gist-file&quot;&gt;
      &lt;div class=&quot;gist-data&quot;&gt;
        &lt;div class=&quot;js-gist-file-update-container js-task-list-container file-box&quot;&gt;
  &lt;div id=&quot;file-dvc_pipeline-sh&quot; class=&quot;file&quot;&gt;
    

  &lt;div itemprop=&quot;text&quot; class=&quot;Box-body p-0 blob-wrapper data type-shell&quot;&gt;
      
&lt;table class=&quot;highlight tab-size js-file-line-container&quot; data-tab-size=&quot;8&quot;&gt;
      &lt;tbody&gt;&lt;tr&gt;
        &lt;td id=&quot;file-dvc_pipeline-sh-L1&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;1&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc_pipeline-sh-LC1&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;#&lt;/span&gt; Install DVC&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-dvc_pipeline-sh-L2&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;2&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc_pipeline-sh-LC2&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;$ pip install dvc&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-dvc_pipeline-sh-L3&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;3&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc_pipeline-sh-LC3&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-dvc_pipeline-sh-L4&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;4&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc_pipeline-sh-LC4&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;#&lt;/span&gt; Initialize DVC repository&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-dvc_pipeline-sh-L5&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;5&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc_pipeline-sh-LC5&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;$ dvc init&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-dvc_pipeline-sh-L6&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;6&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc_pipeline-sh-LC6&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-dvc_pipeline-sh-L7&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;7&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc_pipeline-sh-LC7&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;#&lt;/span&gt; Download a file and put to data/ directory.&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-dvc_pipeline-sh-L8&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;8&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc_pipeline-sh-LC8&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;$ dvc import https://s3-us-west-2.amazonaws.com/dvc-share/so/25K/Posts.xml.tgz data/&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-dvc_pipeline-sh-L9&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;9&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc_pipeline-sh-LC9&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-dvc_pipeline-sh-L10&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;10&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc_pipeline-sh-LC10&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;#&lt;/span&gt; Extract XML from the archive.&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-dvc_pipeline-sh-L11&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;11&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc_pipeline-sh-LC11&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;$ dvc run tar zxf data/Posts.xml.tgz -C data/&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-dvc_pipeline-sh-L12&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;12&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc_pipeline-sh-LC12&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-dvc_pipeline-sh-L13&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;13&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc_pipeline-sh-LC13&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;#&lt;/span&gt; Prepare data.&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-dvc_pipeline-sh-L14&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;14&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc_pipeline-sh-LC14&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;$ dvc run python code/xml_to_tsv.py data/Posts.xml data/Posts.tsv python&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-dvc_pipeline-sh-L15&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;15&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc_pipeline-sh-LC15&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-dvc_pipeline-sh-L16&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;16&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc_pipeline-sh-LC16&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;#&lt;/span&gt; Split training and testing dataset. Two output files.&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-dvc_pipeline-sh-L17&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;17&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc_pipeline-sh-LC17&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;#&lt;/span&gt; 0.33 is the test dataset splitting ratio. 20170426 is a seed for randomization.&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-dvc_pipeline-sh-L18&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;18&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc_pipeline-sh-LC18&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;$ dvc run python code/split_train_test.py data/Posts.tsv 0.33 20170426 data/Posts-train.tsv data/Posts-test.tsv&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-dvc_pipeline-sh-L19&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;19&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc_pipeline-sh-LC19&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-dvc_pipeline-sh-L20&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;20&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc_pipeline-sh-LC20&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;#&lt;/span&gt; Extract features from text data. Two TSV inputs and two pickle matrixes outputs.&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-dvc_pipeline-sh-L21&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;21&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc_pipeline-sh-LC21&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;$ dvc run python code/featurization.py data/Posts-train.tsv data/Posts-test.tsv data/matrix-train.p data/matrix-test.p&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-dvc_pipeline-sh-L22&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;22&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc_pipeline-sh-LC22&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-dvc_pipeline-sh-L23&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;23&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc_pipeline-sh-LC23&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;#&lt;/span&gt; Train ML model out of the training dataset. 20170426 is another seed value.&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-dvc_pipeline-sh-L24&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;24&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc_pipeline-sh-LC24&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;$ dvc run python code/train_model.py data/matrix-train.p 20170426 data/model.p&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-dvc_pipeline-sh-L25&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;25&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc_pipeline-sh-LC25&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-dvc_pipeline-sh-L26&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;26&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc_pipeline-sh-LC26&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;#&lt;/span&gt; Evaluate the model by the testing dataset.&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-dvc_pipeline-sh-L27&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;27&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc_pipeline-sh-LC27&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;$ dvc run python code/evaluate.py data/model.p data/matrix-test.p data/evaluation.txt&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-dvc_pipeline-sh-L28&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;28&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc_pipeline-sh-LC28&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;
&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-dvc_pipeline-sh-L29&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;29&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc_pipeline-sh-LC29&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;#&lt;/span&gt; The result.&lt;/span&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-dvc_pipeline-sh-L30&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;30&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc_pipeline-sh-LC30&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;$ cat data/evaluation.txt&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td id=&quot;file-dvc_pipeline-sh-L31&quot; class=&quot;blob-num js-line-number&quot; data-line-number=&quot;31&quot;&gt;&lt;/td&gt;
        &lt;td id=&quot;file-dvc_pipeline-sh-LC31&quot; class=&quot;blob-code blob-code-inner js-file-line&quot;&gt;AUC: 0.596182&lt;/td&gt;
      &lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;


  &lt;/div&gt;

  &lt;/div&gt;
&lt;/div&gt;

      &lt;/div&gt;
      &lt;div class=&quot;gist-meta&quot;&gt;
        &lt;a href=&quot;https://gist.github.com/dmpetrov/7704a5156bdc32c7379580a61e2fe3b6/raw/166cf09a233861902f1765e9179c1dce556fdcf5/dvc_pipeline.sh&quot; style=&quot;float:right&quot;&gt;view raw&lt;/a&gt;
        &lt;a href=&quot;https://gist.github.com/dmpetrov/7704a5156bdc32c7379580a61e2fe3b6#file-dvc_pipeline-sh&quot;&gt;dvc_pipeline.sh&lt;/a&gt;
        hosted with ❤ by &lt;a href=&quot;https://github.com&quot;&gt;GitHub&lt;/a&gt;
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;h2&gt;Why are regular pipeline tools not enough?&lt;/h2&gt;
&lt;blockquote&gt;
&lt;p&gt;“Workflows are expected to be mostly static or slowly changing.” (See
&lt;a href=&quot;https://airflow.incubator.apache.org/&quot;&gt;Airflow&lt;/a&gt;.)&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Regular pipeline tools like &lt;a href=&quot;http://airflow.incubator.apache.org&quot;&gt;Airflow&lt;/a&gt; and
&lt;a href=&quot;https://github.com/spotify/luigi&quot;&gt;Luigi&lt;/a&gt; are good for representing static and
fault tolerant workflows. A huge portion of their functionality is created for
monitoring, optimization and fault tolerance. These are very important and
business critical problems. However, these problems are irrelevant to data
scientists’ daily lives.&lt;/p&gt;
&lt;p&gt;Data scientists need a lightweight, dynamic workflow management system. In
contrast to the traditional airflow-like system, DVC reflects the process of
researching and looking for a great model (and pipeline), not optimizing and
monitoring an existing one. This is why DVC is a good fit for iterative machine
learning processes. When a good model was discovered with DVC, the result could
be incorporated into a data engineering pipeline (Luigi or Airflow).&lt;/p&gt;
&lt;h2&gt;Pipelines and data sharing&lt;/h2&gt;
&lt;p&gt;In addition to pipeline description, data reproduction and dynamic nature, DVC
has one more important feature. It was designed in accordance with the best
software engineering practices. DVC is based on Git. It keeps code, and stores
DAG in the Git repository which allows you to share your research results. But
it moves the actual file content outside the Git repository (in &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;.cache&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;
directory which DVC includes in &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;code class=&quot;language-text&quot;&gt;.gitignore&lt;/code&gt;&lt;/body&gt;&lt;/html&gt;) since Git is not designed to
accommodate large data files.&lt;/p&gt;
&lt;p&gt;The data files can be shared between data scientists through cloud storages
using a simple command:&lt;/p&gt;
&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;dvc&quot;&gt;&lt;pre class=&quot;language-dvc&quot;&gt;&lt;code class=&quot;language-dvc&quot;&gt;&lt;span class=&quot;token comment&quot;&gt;# Data scientists 1 syncs data to the cloud.&lt;/span&gt;
&lt;span class=&quot;token line&quot;&gt;&lt;span class=&quot;token input&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;token command&quot;&gt;dvc&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;sync&lt;/span&gt; data/&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/body&gt;&lt;/html&gt;
&lt;p&gt;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;span class=&quot;gatsby-resp-image-wrapper&quot; style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto;  max-width: 307px;&quot;&gt;
      &lt;a class=&quot;gatsby-resp-image-link&quot; href=&quot;/static/6890171452971f3e3cd847014a526e03/937a5/git-server-or-github.jpg&quot; style=&quot;display: block&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;
    &lt;span class=&quot;gatsby-resp-image-background-image&quot; style=&quot;padding-bottom: 58.63192182410424%; position: relative; bottom: 0; left: 0; background-image: url(&amp;#x27;data:image/jpeg;base64,/9j/2wBDABALDA4MChAODQ4SERATGCgaGBYWGDEjJR0oOjM9PDkzODdASFxOQERXRTc4UG1RV19iZ2hnPk1xeXBkeFxlZ2P/2wBDARESEhgVGC8aGi9jQjhCY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2P/wgARCAAMABQDASIAAhEBAxEB/8QAGQAAAQUAAAAAAAAAAAAAAAAAAAECAwQF/8QAFgEBAQEAAAAAAAAAAAAAAAAAAgED/9oADAMBAAIQAxAAAAHSitIdQeUf/8QAGRAAAgMBAAAAAAAAAAAAAAAAAQIDEhMR/9oACAEBAAEFArtqsknImJXNb4oAihR//8QAFxEAAwEAAAAAAAAAAAAAAAAAAAERMf/aAAgBAwEBPwGoen//xAAUEQEAAAAAAAAAAAAAAAAAAAAQ/9oACAECAQE/AT//xAAbEAACAgMBAAAAAAAAAAAAAAAAAREhAjFBEv/aAAgBAQAGPwJrhlLWuFnqLHWyEf/EABoQAQEBAQEBAQAAAAAAAAAAAAERADEhQVH/2gAIAQEAAT8h6V6/JmM6eKNR1b93u6MHBII10Ww3/9oADAMBAAIAAwAAABBLD//EABoRAAICAwAAAAAAAAAAAAAAAAABESExQVH/2gAIAQMBAT8QpxzRBtB//8QAFxEAAwEAAAAAAAAAAAAAAAAAAAERMf/aAAgBAgEBPxCMWH//xAAeEAEAAgEEAwAAAAAAAAAAAAABABEhMUFRYZGhsf/aAAgBAQABPxA0Iml8GO5R6JtUb6fsburTRfqC1O83u4XxK0AB3ED5xb28z//Z&amp;#x27;); background-size: cover; display: block;&quot;&gt;&lt;/span&gt;
  &lt;picture&gt;
        &lt;source srcset=&quot;/static/6890171452971f3e3cd847014a526e03/c54d4/git-server-or-github.webp 175w, /static/6890171452971f3e3cd847014a526e03/a3432/git-server-or-github.webp 350w, /static/6890171452971f3e3cd847014a526e03/5316f/git-server-or-github.webp 614w&quot; sizes=&quot;(max-width: 614px) 100vw, 614px&quot; type=&quot;image/webp&quot;&gt;
        &lt;source srcset=&quot;/static/6890171452971f3e3cd847014a526e03/8dc06/git-server-or-github.jpg 175w, /static/6890171452971f3e3cd847014a526e03/f4417/git-server-or-github.jpg 350w, /static/6890171452971f3e3cd847014a526e03/937a5/git-server-or-github.jpg 614w&quot; sizes=&quot;(max-width: 614px) 100vw, 614px&quot; type=&quot;image/jpeg&quot;&gt;
        &lt;img class=&quot;gatsby-resp-image-image&quot; src=&quot;/static/6890171452971f3e3cd847014a526e03/937a5/git-server-or-github.jpg&quot; alt=&quot;git server or github&quot; title=&quot;git server or github&quot; loading=&quot;lazy&quot; style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;&gt;
      &lt;/picture&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/body&gt;&lt;/html&gt;&lt;/p&gt;
&lt;p&gt;Currently, AWS S3 and GCP storage are supported by DVC.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;The productivity of data scientists can be improved by speeding up iteration
processes and the DVC tool takes care of this.&lt;/p&gt;
&lt;p&gt;We are very interested in your opinion and feedback. Please post your comments
here or contact us on Twitter — &lt;a href=&quot;https://twitter.com/FullStackML&quot;&gt;FullStackML&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;If you found this tool useful, &lt;strong&gt;please “star” the
&lt;a href=&quot;https://github.com/iterative/dvc&quot;&gt;DVC Github repository&lt;/a&gt;&lt;/strong&gt;.&lt;/p&gt;</content:encoded></item></channel></rss>