{"componentChunkName":"component---src-templates-blog-post-tsx","path":"/r-code-and-reproducible-model-development-with-dvc","result":{"data":{"markdownRemark":{"id":"e52e4195-4241-514b-8e1e-fff9130a4038","excerpt":"<p><a href=\"https://dvc.org\">DVC</a> or Data Version Control tool — its idea is to track\nfiles/data dependencies during model development in order to facilitate…</p>","html":"<p><a href=\"https://dvc.org\">DVC</a> or Data Version Control tool — its idea is to track\nfiles/data dependencies during model development in order to facilitate\nreproducibility and track data files versioning. Most of the\n<a href=\"https://dvc.org/doc/tutorials\">DVC tutorials</a> provide good examples of using\nDVC with Python language. However, I realized that DVC is a\n<a href=\"https://en.wikipedia.org/wiki/Language-agnostic\">language agnostic</a> tool and\ncan be used with any programming language. In this blog post, we will see how to\nuse DVC in R projects.</p>\n<h2>R coding — keep it simple and readable</h2>\n<p>Each development is always a combination of following steps presented below:</p>\n<p><html><head></head><body><span class=\"gatsby-resp-image-wrapper\" style=\"position: relative; display: block; margin-left: auto; margin-right: auto;  max-width: 342px;\">\n      <a class=\"gatsby-resp-image-link\" href=\"/static/fdf37f71d0c9ecd4d9f1b7f0ec446abf/3dead/development-steps.png\" style=\"display: block\" target=\"_blank\" rel=\"noopener\">\n    <span class=\"gatsby-resp-image-background-image\" style=\"padding-bottom: 27.046783625730995%; position: relative; bottom: 0; left: 0; background-image: url(&#x27;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAFCAYAAABFA8wzAAAACXBIWXMAAAsSAAALEgHS3X78AAABgUlEQVQY00VPO0sjURgddAXRYlX0f1hoL1usICrYqfgqBNElsCsSxEYLQcFOxEbwFYiCMIKPJDObzNw7c19z585MoojPVBb+B4P5vFZ+cOA7cDgPw2eyruLkI6ncwn/EU0yEj0n5Bu7uH6GIxRgmYvvh6VnzB8A02LddPhonNxCqGJiQL7kSXfApB4+wOqG8bpQcVDs3zXfHcYEE8V/s+feWZYEMQ/B4NEl5sINcF1wNyuUhZmoid30Nx8cZwBg/69CUaZqQyWRqtm3XDBlGYGsDpSIgQi0xLl4rlTJUq1XwmJqmTOzFUQRSSt2CnxARTyVxAhhh4Fy8FUr+P6tQgGz2BBzHAYPwMK/issVlhCyXjWi+L/Svp1Mb8V+IBH+kSpiG79FgUWv6uFRIa1zd+KDoySHCghLymSUClTe+b7lxbJP8MJp2G7oXLpqH19yW1SPZmt6TLb3zZ609c6ed4+t22+xWseN3+rJjcOWqa2Al/3Nmw2rvX853dk3kmr9cPgEq+g6Kf1zjjgAAAABJRU5ErkJggg==&#x27;); background-size: cover; display: block;\"></span>\n  <picture>\n        <source srcset=\"/static/fdf37f71d0c9ecd4d9f1b7f0ec446abf/c54d4/development-steps.webp 175w, /static/fdf37f71d0c9ecd4d9f1b7f0ec446abf/a3432/development-steps.webp 350w, /static/fdf37f71d0c9ecd4d9f1b7f0ec446abf/e213b/development-steps.webp 684w\" sizes=\"(max-width: 684px) 100vw, 684px\" type=\"image/webp\">\n        <source srcset=\"/static/fdf37f71d0c9ecd4d9f1b7f0ec446abf/17006/development-steps.png 175w, /static/fdf37f71d0c9ecd4d9f1b7f0ec446abf/d6f3f/development-steps.png 350w, /static/fdf37f71d0c9ecd4d9f1b7f0ec446abf/3dead/development-steps.png 684w\" sizes=\"(max-width: 684px) 100vw, 684px\" type=\"image/png\">\n        <img class=\"gatsby-resp-image-image\" src=\"/static/fdf37f71d0c9ecd4d9f1b7f0ec446abf/3dead/development-steps.png\" alt=\"Model development process\" title=\"Model development process\" loading=\"lazy\" style=\"width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;\">\n      </picture>\n  </a>\n    </span></body></html>\n<em>Model development process</em></p>\n<p>Because of the specificity of the process — iterative development, it is very\nimportant to improve some coding and organizational skills. For example, instead\nof having one big R file with code it is better to split code in several logical\nfiles — each responsible for one small piece of work. It is smart to track\nhistory development with\n<a href=\"https://git-scm.com/book/en/v2/Getting-Started-About-Version-Control\">git</a>\ntool. Writing “<em>reusable code”</em> is nice skill to have. Put comments in a code\ncan make our life easier.</p>\n<p>Beside git, next step in further improvements is to try out and work with DVC.\nEvery time when a change/commit in some of the codes and data sets is made, DVC\nwill reproduce new results with just one bash command on a linux (or Win\nenvironment). It memorizes dependencies among files and codes so it can easily\nrepeat all necessary steps/codes instead of us worrying about the order.</p>\n<h2>R example — data and code clarification</h2>\n<p>We’ll take an Python example from\n<a href=\"https://dvc.org/doc/tutorials/deep\">DVC tutorial</a> (written by Dmitry Petrov)\nand rewrite that code in R. With an example we’ll show how can DVC help during\ndevelopment and what are its possibilities.</p>\n<p>Firstly, let’s initialize git and dvc on mentioned example and run our codes for\nthe first time. After that we will simulate some changes in the codes and see\nhow DVC works on reproducibility.</p>\n<p>R codes can be downloaded from the\n<a href=\"https://github.com/Zoldin/R_AND_DVC\">Github repository</a>. A brief explanation of\nthe codes is presented below:</p>\n<p><strong>parsingxml.R</strong> — it takes xml that we downloaded from the web and creates\nappropriate csv file.</p>\n<p><html><head></head><body><div id=\"gist71114089\" class=\"gist\">\n    <div class=\"gist-file\">\n      <div class=\"gist-data\">\n        <div class=\"js-gist-file-update-container js-task-list-container file-box\">\n  <div id=\"file-parsingxml-r\" class=\"file\">\n    \n\n  <div itemprop=\"text\" class=\"Box-body p-0 blob-wrapper data type-r\">\n      \n<table class=\"highlight tab-size js-file-line-container\" data-tab-size=\"8\">\n      <tbody><tr>\n        <td id=\"file-parsingxml-r-L1\" class=\"blob-num js-line-number\" data-line-number=\"1\"></td>\n        <td id=\"file-parsingxml-r-LC1\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-c\"><span class=\"pl-c\">#</span>!/usr/bin/Rscript</span></td>\n      </tr>\n      <tr>\n        <td id=\"file-parsingxml-r-L2\" class=\"blob-num js-line-number\" data-line-number=\"2\"></td>\n        <td id=\"file-parsingxml-r-LC2\" class=\"blob-code blob-code-inner js-file-line\">library(<span class=\"pl-smi\">XML</span>)</td>\n      </tr>\n      <tr>\n        <td id=\"file-parsingxml-r-L3\" class=\"blob-num js-line-number\" data-line-number=\"3\"></td>\n        <td id=\"file-parsingxml-r-LC3\" class=\"blob-code blob-code-inner js-file-line\">\n</td>\n      </tr>\n      <tr>\n        <td id=\"file-parsingxml-r-L4\" class=\"blob-num js-line-number\" data-line-number=\"4\"></td>\n        <td id=\"file-parsingxml-r-LC4\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-v\">args</span> <span class=\"pl-k\">=</span> commandArgs(<span class=\"pl-v\">trailingOnly</span><span class=\"pl-k\">=</span><span class=\"pl-c1\">TRUE</span>)</td>\n      </tr>\n      <tr>\n        <td id=\"file-parsingxml-r-L5\" class=\"blob-num js-line-number\" data-line-number=\"5\"></td>\n        <td id=\"file-parsingxml-r-LC5\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-k\">if</span> (<span class=\"pl-k\">!</span>length(<span class=\"pl-smi\">args</span>)<span class=\"pl-k\">==</span><span class=\"pl-c1\">2</span>) {</td>\n      </tr>\n      <tr>\n        <td id=\"file-parsingxml-r-L6\" class=\"blob-num js-line-number\" data-line-number=\"6\"></td>\n        <td id=\"file-parsingxml-r-LC6\" class=\"blob-code blob-code-inner js-file-line\">  stop(<span class=\"pl-s\"><span class=\"pl-pds\">\"</span>Two arguments must be supplied (input file name ,output file name - csv ext).n<span class=\"pl-pds\">\"</span></span>, <span class=\"pl-v\">call.</span><span class=\"pl-k\">=</span><span class=\"pl-c1\">FALSE</span>)</td>\n      </tr>\n      <tr>\n        <td id=\"file-parsingxml-r-L7\" class=\"blob-num js-line-number\" data-line-number=\"7\"></td>\n        <td id=\"file-parsingxml-r-LC7\" class=\"blob-code blob-code-inner js-file-line\">} </td>\n      </tr>\n      <tr>\n        <td id=\"file-parsingxml-r-L8\" class=\"blob-num js-line-number\" data-line-number=\"8\"></td>\n        <td id=\"file-parsingxml-r-LC8\" class=\"blob-code blob-code-inner js-file-line\">\n</td>\n      </tr>\n      <tr>\n        <td id=\"file-parsingxml-r-L9\" class=\"blob-num js-line-number\" data-line-number=\"9\"></td>\n        <td id=\"file-parsingxml-r-LC9\" class=\"blob-code blob-code-inner js-file-line\">\n</td>\n      </tr>\n      <tr>\n        <td id=\"file-parsingxml-r-L10\" class=\"blob-num js-line-number\" data-line-number=\"10\"></td>\n        <td id=\"file-parsingxml-r-LC10\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-c\"><span class=\"pl-c\">#</span>read XML line by line</span></td>\n      </tr>\n      <tr>\n        <td id=\"file-parsingxml-r-L11\" class=\"blob-num js-line-number\" data-line-number=\"11\"></td>\n        <td id=\"file-parsingxml-r-LC11\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-smi\">con</span> <span class=\"pl-k\">&#x3C;-</span> file(<span class=\"pl-smi\">args</span>[<span class=\"pl-c1\">1</span>], <span class=\"pl-s\"><span class=\"pl-pds\">\"</span>r<span class=\"pl-pds\">\"</span></span>)</td>\n      </tr>\n      <tr>\n        <td id=\"file-parsingxml-r-L12\" class=\"blob-num js-line-number\" data-line-number=\"12\"></td>\n        <td id=\"file-parsingxml-r-LC12\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-smi\">lines</span> <span class=\"pl-k\">&#x3C;-</span> readLines(<span class=\"pl-smi\">con</span>, <span class=\"pl-k\">-</span><span class=\"pl-c1\">1</span>)</td>\n      </tr>\n      <tr>\n        <td id=\"file-parsingxml-r-L13\" class=\"blob-num js-line-number\" data-line-number=\"13\"></td>\n        <td id=\"file-parsingxml-r-LC13\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-smi\">test</span> <span class=\"pl-k\">&#x3C;-</span> lapply(<span class=\"pl-smi\">lines</span>,<span class=\"pl-k\">function</span>(<span class=\"pl-smi\">x</span>){<span class=\"pl-k\">return</span>(xmlTreeParse(<span class=\"pl-smi\">x</span>,<span class=\"pl-v\">useInternalNodes</span> <span class=\"pl-k\">=</span> <span class=\"pl-c1\">TRUE</span>))})</td>\n      </tr>\n      <tr>\n        <td id=\"file-parsingxml-r-L14\" class=\"blob-num js-line-number\" data-line-number=\"14\"></td>\n        <td id=\"file-parsingxml-r-LC14\" class=\"blob-code blob-code-inner js-file-line\">\n</td>\n      </tr>\n      <tr>\n        <td id=\"file-parsingxml-r-L15\" class=\"blob-num js-line-number\" data-line-number=\"15\"></td>\n        <td id=\"file-parsingxml-r-LC15\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-c\"><span class=\"pl-c\">#</span>parsing XML to get variables</span></td>\n      </tr>\n      <tr>\n        <td id=\"file-parsingxml-r-L16\" class=\"blob-num js-line-number\" data-line-number=\"16\"></td>\n        <td id=\"file-parsingxml-r-LC16\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-smi\">ID</span> <span class=\"pl-k\">&#x3C;-</span> as.numeric(sapply(<span class=\"pl-smi\">test</span>,<span class=\"pl-k\">function</span>(<span class=\"pl-smi\">x</span>){<span class=\"pl-k\">return</span>(xpathSApply(<span class=\"pl-smi\">x</span>, <span class=\"pl-s\"><span class=\"pl-pds\">\"</span>//row<span class=\"pl-pds\">\"</span></span>,<span class=\"pl-smi\">xmlGetAttr</span>, <span class=\"pl-s\"><span class=\"pl-pds\">\"</span>Id<span class=\"pl-pds\">\"</span></span>))}))</td>\n      </tr>\n      <tr>\n        <td id=\"file-parsingxml-r-L17\" class=\"blob-num js-line-number\" data-line-number=\"17\"></td>\n        <td id=\"file-parsingxml-r-LC17\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-smi\">Tags</span> <span class=\"pl-k\">&#x3C;-</span> sapply(<span class=\"pl-smi\">test</span>,<span class=\"pl-k\">function</span>(<span class=\"pl-smi\">x</span>){<span class=\"pl-k\">return</span>(xpathSApply(<span class=\"pl-smi\">x</span>, <span class=\"pl-s\"><span class=\"pl-pds\">\"</span>//row<span class=\"pl-pds\">\"</span></span>,<span class=\"pl-smi\">xmlGetAttr</span>, <span class=\"pl-s\"><span class=\"pl-pds\">\"</span>Tags<span class=\"pl-pds\">\"</span></span>))})</td>\n      </tr>\n      <tr>\n        <td id=\"file-parsingxml-r-L18\" class=\"blob-num js-line-number\" data-line-number=\"18\"></td>\n        <td id=\"file-parsingxml-r-LC18\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-smi\">Title</span> <span class=\"pl-k\">&#x3C;-</span> as.character(sapply(<span class=\"pl-smi\">test</span>,<span class=\"pl-k\">function</span>(<span class=\"pl-smi\">x</span>){<span class=\"pl-k\">return</span>(xpathSApply(<span class=\"pl-smi\">x</span>, <span class=\"pl-s\"><span class=\"pl-pds\">\"</span>//row<span class=\"pl-pds\">\"</span></span>,<span class=\"pl-smi\">xmlGetAttr</span>, <span class=\"pl-s\"><span class=\"pl-pds\">\"</span>Title<span class=\"pl-pds\">\"</span></span>))}))</td>\n      </tr>\n      <tr>\n        <td id=\"file-parsingxml-r-L19\" class=\"blob-num js-line-number\" data-line-number=\"19\"></td>\n        <td id=\"file-parsingxml-r-LC19\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-smi\">Body</span> <span class=\"pl-k\">&#x3C;-</span> as.character(sapply(<span class=\"pl-smi\">test</span>,<span class=\"pl-k\">function</span>(<span class=\"pl-smi\">x</span>){<span class=\"pl-k\">return</span>(xpathSApply(<span class=\"pl-smi\">x</span>, <span class=\"pl-s\"><span class=\"pl-pds\">\"</span>//row<span class=\"pl-pds\">\"</span></span>,<span class=\"pl-smi\">xmlGetAttr</span>, <span class=\"pl-s\"><span class=\"pl-pds\">\"</span>Body<span class=\"pl-pds\">\"</span></span>))}))</td>\n      </tr>\n      <tr>\n        <td id=\"file-parsingxml-r-L20\" class=\"blob-num js-line-number\" data-line-number=\"20\"></td>\n        <td id=\"file-parsingxml-r-LC20\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-v\">text</span> <span class=\"pl-k\">=</span> paste(<span class=\"pl-smi\">Title</span>,<span class=\"pl-smi\">Body</span>)</td>\n      </tr>\n      <tr>\n        <td id=\"file-parsingxml-r-L21\" class=\"blob-num js-line-number\" data-line-number=\"21\"></td>\n        <td id=\"file-parsingxml-r-LC21\" class=\"blob-code blob-code-inner js-file-line\">\n</td>\n      </tr>\n      <tr>\n        <td id=\"file-parsingxml-r-L22\" class=\"blob-num js-line-number\" data-line-number=\"22\"></td>\n        <td id=\"file-parsingxml-r-LC22\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-v\">label</span> <span class=\"pl-k\">=</span> as.numeric(sapply(<span class=\"pl-smi\">Tags</span>,<span class=\"pl-k\">function</span>(<span class=\"pl-smi\">x</span>){<span class=\"pl-k\">return</span>(grep(<span class=\"pl-s\"><span class=\"pl-pds\">\"</span>python<span class=\"pl-pds\">\"</span></span>,<span class=\"pl-smi\">x</span>))}))</td>\n      </tr>\n      <tr>\n        <td id=\"file-parsingxml-r-L23\" class=\"blob-num js-line-number\" data-line-number=\"23\"></td>\n        <td id=\"file-parsingxml-r-LC23\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-smi\">label</span>[is.na(<span class=\"pl-smi\">label</span>)]<span class=\"pl-k\">=</span><span class=\"pl-c1\">0</span></td>\n      </tr>\n      <tr>\n        <td id=\"file-parsingxml-r-L24\" class=\"blob-num js-line-number\" data-line-number=\"24\"></td>\n        <td id=\"file-parsingxml-r-LC24\" class=\"blob-code blob-code-inner js-file-line\">\n</td>\n      </tr>\n      <tr>\n        <td id=\"file-parsingxml-r-L25\" class=\"blob-num js-line-number\" data-line-number=\"25\"></td>\n        <td id=\"file-parsingxml-r-LC25\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-c\"><span class=\"pl-c\">#</span>final data frame for export</span></td>\n      </tr>\n      <tr>\n        <td id=\"file-parsingxml-r-L26\" class=\"blob-num js-line-number\" data-line-number=\"26\"></td>\n        <td id=\"file-parsingxml-r-LC26\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-smi\">df</span> <span class=\"pl-k\">&#x3C;-</span> as.data.frame(cbind(<span class=\"pl-smi\">ID</span>,<span class=\"pl-smi\">label</span>,<span class=\"pl-smi\">text</span>),<span class=\"pl-v\">stringsAsFactors</span> <span class=\"pl-k\">=</span> <span class=\"pl-c1\">FALSE</span>)</td>\n      </tr>\n      <tr>\n        <td id=\"file-parsingxml-r-L27\" class=\"blob-num js-line-number\" data-line-number=\"27\"></td>\n        <td id=\"file-parsingxml-r-LC27\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-smi\">df</span><span class=\"pl-k\">$</span><span class=\"pl-v\">ID</span><span class=\"pl-k\">=</span>as.numeric(<span class=\"pl-smi\">df</span><span class=\"pl-k\">$</span><span class=\"pl-smi\">ID</span>)</td>\n      </tr>\n      <tr>\n        <td id=\"file-parsingxml-r-L28\" class=\"blob-num js-line-number\" data-line-number=\"28\"></td>\n        <td id=\"file-parsingxml-r-LC28\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-smi\">df</span><span class=\"pl-k\">$</span><span class=\"pl-v\">label</span><span class=\"pl-k\">=</span>as.numeric(<span class=\"pl-smi\">df</span><span class=\"pl-k\">$</span><span class=\"pl-smi\">label</span>)</td>\n      </tr>\n      <tr>\n        <td id=\"file-parsingxml-r-L29\" class=\"blob-num js-line-number\" data-line-number=\"29\"></td>\n        <td id=\"file-parsingxml-r-LC29\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-c\"><span class=\"pl-c\">#</span>write to csv</span></td>\n      </tr>\n      <tr>\n        <td id=\"file-parsingxml-r-L30\" class=\"blob-num js-line-number\" data-line-number=\"30\"></td>\n        <td id=\"file-parsingxml-r-LC30\" class=\"blob-code blob-code-inner js-file-line\">write.csv(<span class=\"pl-smi\">df</span>, <span class=\"pl-v\">file</span><span class=\"pl-k\">=</span><span class=\"pl-smi\">args</span>[<span class=\"pl-c1\">2</span>],<span class=\"pl-v\">row.names</span><span class=\"pl-k\">=</span><span class=\"pl-c1\">FALSE</span>)</td>\n      </tr>\n      <tr>\n        <td id=\"file-parsingxml-r-L31\" class=\"blob-num js-line-number\" data-line-number=\"31\"></td>\n        <td id=\"file-parsingxml-r-LC31\" class=\"blob-code blob-code-inner js-file-line\">print(<span class=\"pl-s\"><span class=\"pl-pds\">\"</span>output file created....<span class=\"pl-pds\">\"</span></span>)</td>\n      </tr>\n</tbody></table>\n\n\n  </div>\n\n  </div>\n</div>\n\n      </div>\n      <div class=\"gist-meta\">\n        <a href=\"https://gist.github.com/Zoldin/47536af63182a0e8daf37a7b989e2e8d/raw/98b259ade11132ad87e9c4f476b7561b184cf041/parsingxml.R\" style=\"float:right\">view raw</a>\n        <a href=\"https://gist.github.com/Zoldin/47536af63182a0e8daf37a7b989e2e8d#file-parsingxml-r\">parsingxml.R</a>\n        hosted with ❤ by <a href=\"https://github.com\">GitHub</a>\n      </div>\n    </div>\n</div></body></html></p>\n<p><strong>train<em>test</em>spliting.R</strong> — stratified sampling by target variable (here we are\ncreating test and train data set)</p>\n<p><html><head></head><body><div id=\"gist71114469\" class=\"gist\">\n    <div class=\"gist-file\">\n      <div class=\"gist-data\">\n        <div class=\"js-gist-file-update-container js-task-list-container file-box\">\n  <div id=\"file-train_test_splitting-r\" class=\"file\">\n    \n\n  <div itemprop=\"text\" class=\"Box-body p-0 blob-wrapper data type-r\">\n      \n<table class=\"highlight tab-size js-file-line-container\" data-tab-size=\"8\">\n      <tbody><tr>\n        <td id=\"file-train_test_splitting-r-L1\" class=\"blob-num js-line-number\" data-line-number=\"1\"></td>\n        <td id=\"file-train_test_splitting-r-LC1\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-c\"><span class=\"pl-c\">#</span>!/usr/bin/Rscript</span></td>\n      </tr>\n      <tr>\n        <td id=\"file-train_test_splitting-r-L2\" class=\"blob-num js-line-number\" data-line-number=\"2\"></td>\n        <td id=\"file-train_test_splitting-r-LC2\" class=\"blob-code blob-code-inner js-file-line\">library(<span class=\"pl-smi\">caret</span>)</td>\n      </tr>\n      <tr>\n        <td id=\"file-train_test_splitting-r-L3\" class=\"blob-num js-line-number\" data-line-number=\"3\"></td>\n        <td id=\"file-train_test_splitting-r-LC3\" class=\"blob-code blob-code-inner js-file-line\">\n</td>\n      </tr>\n      <tr>\n        <td id=\"file-train_test_splitting-r-L4\" class=\"blob-num js-line-number\" data-line-number=\"4\"></td>\n        <td id=\"file-train_test_splitting-r-LC4\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-v\">args</span> <span class=\"pl-k\">=</span> commandArgs(<span class=\"pl-v\">trailingOnly</span><span class=\"pl-k\">=</span><span class=\"pl-c1\">TRUE</span>)</td>\n      </tr>\n      <tr>\n        <td id=\"file-train_test_splitting-r-L5\" class=\"blob-num js-line-number\" data-line-number=\"5\"></td>\n        <td id=\"file-train_test_splitting-r-LC5\" class=\"blob-code blob-code-inner js-file-line\">\n</td>\n      </tr>\n      <tr>\n        <td id=\"file-train_test_splitting-r-L6\" class=\"blob-num js-line-number\" data-line-number=\"6\"></td>\n        <td id=\"file-train_test_splitting-r-LC6\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-k\">if</span> (<span class=\"pl-k\">!</span>length(<span class=\"pl-smi\">args</span>)<span class=\"pl-k\">==</span><span class=\"pl-c1\">5</span>) {</td>\n      </tr>\n      <tr>\n        <td id=\"file-train_test_splitting-r-L7\" class=\"blob-num js-line-number\" data-line-number=\"7\"></td>\n        <td id=\"file-train_test_splitting-r-LC7\" class=\"blob-code blob-code-inner js-file-line\">  stop(<span class=\"pl-s\"><span class=\"pl-pds\">\"</span>Five arguments must be supplied (input file name, splitting ratio related to test data set, seed, train output file name, test output file name).n<span class=\"pl-pds\">\"</span></span>, <span class=\"pl-v\">call.</span><span class=\"pl-k\">=</span><span class=\"pl-c1\">FALSE</span>)</td>\n      </tr>\n      <tr>\n        <td id=\"file-train_test_splitting-r-L8\" class=\"blob-num js-line-number\" data-line-number=\"8\"></td>\n        <td id=\"file-train_test_splitting-r-LC8\" class=\"blob-code blob-code-inner js-file-line\">} </td>\n      </tr>\n      <tr>\n        <td id=\"file-train_test_splitting-r-L9\" class=\"blob-num js-line-number\" data-line-number=\"9\"></td>\n        <td id=\"file-train_test_splitting-r-LC9\" class=\"blob-code blob-code-inner js-file-line\">\n</td>\n      </tr>\n      <tr>\n        <td id=\"file-train_test_splitting-r-L10\" class=\"blob-num js-line-number\" data-line-number=\"10\"></td>\n        <td id=\"file-train_test_splitting-r-LC10\" class=\"blob-code blob-code-inner js-file-line\">\n</td>\n      </tr>\n      <tr>\n        <td id=\"file-train_test_splitting-r-L11\" class=\"blob-num js-line-number\" data-line-number=\"11\"></td>\n        <td id=\"file-train_test_splitting-r-LC11\" class=\"blob-code blob-code-inner js-file-line\">set.seed(as.numeric(<span class=\"pl-smi\">args</span>[<span class=\"pl-c1\">3</span>]))</td>\n      </tr>\n      <tr>\n        <td id=\"file-train_test_splitting-r-L12\" class=\"blob-num js-line-number\" data-line-number=\"12\"></td>\n        <td id=\"file-train_test_splitting-r-LC12\" class=\"blob-code blob-code-inner js-file-line\">\n</td>\n      </tr>\n      <tr>\n        <td id=\"file-train_test_splitting-r-L13\" class=\"blob-num js-line-number\" data-line-number=\"13\"></td>\n        <td id=\"file-train_test_splitting-r-LC13\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-smi\">df</span> <span class=\"pl-k\">&#x3C;-</span> read.csv(<span class=\"pl-smi\">args</span>[<span class=\"pl-c1\">1</span>],<span class=\"pl-v\">stringsAsFactors</span> <span class=\"pl-k\">=</span> <span class=\"pl-c1\">FALSE</span>)</td>\n      </tr>\n      <tr>\n        <td id=\"file-train_test_splitting-r-L14\" class=\"blob-num js-line-number\" data-line-number=\"14\"></td>\n        <td id=\"file-train_test_splitting-r-LC14\" class=\"blob-code blob-code-inner js-file-line\">\n</td>\n      </tr>\n      <tr>\n        <td id=\"file-train_test_splitting-r-L15\" class=\"blob-num js-line-number\" data-line-number=\"15\"></td>\n        <td id=\"file-train_test_splitting-r-LC15\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-smi\">test.index</span> <span class=\"pl-k\">&#x3C;-</span> createDataPartition(<span class=\"pl-smi\">df</span><span class=\"pl-k\">$</span><span class=\"pl-smi\">label</span>, <span class=\"pl-v\">p</span> <span class=\"pl-k\">=</span> as.numeric(<span class=\"pl-smi\">args</span>[<span class=\"pl-c1\">2</span>]), <span class=\"pl-v\">list</span> <span class=\"pl-k\">=</span> <span class=\"pl-c1\">FALSE</span>)</td>\n      </tr>\n      <tr>\n        <td id=\"file-train_test_splitting-r-L16\" class=\"blob-num js-line-number\" data-line-number=\"16\"></td>\n        <td id=\"file-train_test_splitting-r-LC16\" class=\"blob-code blob-code-inner js-file-line\">\n</td>\n      </tr>\n      <tr>\n        <td id=\"file-train_test_splitting-r-L17\" class=\"blob-num js-line-number\" data-line-number=\"17\"></td>\n        <td id=\"file-train_test_splitting-r-LC17\" class=\"blob-code blob-code-inner js-file-line\">\n</td>\n      </tr>\n      <tr>\n        <td id=\"file-train_test_splitting-r-L18\" class=\"blob-num js-line-number\" data-line-number=\"18\"></td>\n        <td id=\"file-train_test_splitting-r-LC18\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-smi\">train</span> <span class=\"pl-k\">&#x3C;-</span> <span class=\"pl-smi\">df</span>[<span class=\"pl-k\">-</span><span class=\"pl-smi\">test.index</span>,]</td>\n      </tr>\n      <tr>\n        <td id=\"file-train_test_splitting-r-L19\" class=\"blob-num js-line-number\" data-line-number=\"19\"></td>\n        <td id=\"file-train_test_splitting-r-LC19\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-smi\">test</span>  <span class=\"pl-k\">&#x3C;-</span> <span class=\"pl-smi\">df</span>[<span class=\"pl-smi\">test.index</span>,]</td>\n      </tr>\n      <tr>\n        <td id=\"file-train_test_splitting-r-L20\" class=\"blob-num js-line-number\" data-line-number=\"20\"></td>\n        <td id=\"file-train_test_splitting-r-LC20\" class=\"blob-code blob-code-inner js-file-line\">\n</td>\n      </tr>\n      <tr>\n        <td id=\"file-train_test_splitting-r-L21\" class=\"blob-num js-line-number\" data-line-number=\"21\"></td>\n        <td id=\"file-train_test_splitting-r-LC21\" class=\"blob-code blob-code-inner js-file-line\">\n</td>\n      </tr>\n      <tr>\n        <td id=\"file-train_test_splitting-r-L22\" class=\"blob-num js-line-number\" data-line-number=\"22\"></td>\n        <td id=\"file-train_test_splitting-r-LC22\" class=\"blob-code blob-code-inner js-file-line\">write.csv(<span class=\"pl-smi\">train</span>, <span class=\"pl-v\">file</span><span class=\"pl-k\">=</span><span class=\"pl-smi\">args</span>[<span class=\"pl-c1\">4</span>],<span class=\"pl-v\">row.names</span><span class=\"pl-k\">=</span><span class=\"pl-c1\">FALSE</span>)</td>\n      </tr>\n      <tr>\n        <td id=\"file-train_test_splitting-r-L23\" class=\"blob-num js-line-number\" data-line-number=\"23\"></td>\n        <td id=\"file-train_test_splitting-r-LC23\" class=\"blob-code blob-code-inner js-file-line\">write.csv(<span class=\"pl-smi\">test</span>, <span class=\"pl-v\">file</span><span class=\"pl-k\">=</span><span class=\"pl-smi\">args</span>[<span class=\"pl-c1\">5</span>],<span class=\"pl-v\">row.names</span><span class=\"pl-k\">=</span><span class=\"pl-c1\">FALSE</span>)</td>\n      </tr>\n      <tr>\n        <td id=\"file-train_test_splitting-r-L24\" class=\"blob-num js-line-number\" data-line-number=\"24\"></td>\n        <td id=\"file-train_test_splitting-r-LC24\" class=\"blob-code blob-code-inner js-file-line\">print(<span class=\"pl-s\"><span class=\"pl-pds\">\"</span>train/test files created....<span class=\"pl-pds\">\"</span></span>)</td>\n      </tr>\n</tbody></table>\n\n\n  </div>\n\n  </div>\n</div>\n\n      </div>\n      <div class=\"gist-meta\">\n        <a href=\"https://gist.github.com/Zoldin/7591c47ce5988cbe087e0038c9a850b9/raw/e2106c39bad8a4ae04e41658bd287ea94ff7437a/train_test_splitting.R\" style=\"float:right\">view raw</a>\n        <a href=\"https://gist.github.com/Zoldin/7591c47ce5988cbe087e0038c9a850b9#file-train_test_splitting-r\">train_test_splitting.R</a>\n        hosted with ❤ by <a href=\"https://github.com\">GitHub</a>\n      </div>\n    </div>\n</div></body></html></p>\n<p><strong>featurization.R</strong> — text mining and tf-idf matrix creation. In this part we\nare creating predictive variables.</p>\n<p><html><head></head><body><div id=\"gist71113907\" class=\"gist\">\n    <div class=\"gist-file\">\n      <div class=\"gist-data\">\n        <div class=\"js-gist-file-update-container js-task-list-container file-box\">\n  <div id=\"file-featurization-r\" class=\"file\">\n    \n\n  <div itemprop=\"text\" class=\"Box-body p-0 blob-wrapper data type-r\">\n      \n<table class=\"highlight tab-size js-file-line-container\" data-tab-size=\"8\">\n      <tbody><tr>\n        <td id=\"file-featurization-r-L1\" class=\"blob-num js-line-number\" data-line-number=\"1\"></td>\n        <td id=\"file-featurization-r-LC1\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-c\"><span class=\"pl-c\">#</span>!/usr/bin/Rscript</span></td>\n      </tr>\n      <tr>\n        <td id=\"file-featurization-r-L2\" class=\"blob-num js-line-number\" data-line-number=\"2\"></td>\n        <td id=\"file-featurization-r-LC2\" class=\"blob-code blob-code-inner js-file-line\">library(<span class=\"pl-smi\">text2vec</span>)</td>\n      </tr>\n      <tr>\n        <td id=\"file-featurization-r-L3\" class=\"blob-num js-line-number\" data-line-number=\"3\"></td>\n        <td id=\"file-featurization-r-LC3\" class=\"blob-code blob-code-inner js-file-line\">library(<span class=\"pl-smi\">MASS</span>)</td>\n      </tr>\n      <tr>\n        <td id=\"file-featurization-r-L4\" class=\"blob-num js-line-number\" data-line-number=\"4\"></td>\n        <td id=\"file-featurization-r-LC4\" class=\"blob-code blob-code-inner js-file-line\">library(<span class=\"pl-smi\">Matrix</span>)</td>\n      </tr>\n      <tr>\n        <td id=\"file-featurization-r-L5\" class=\"blob-num js-line-number\" data-line-number=\"5\"></td>\n        <td id=\"file-featurization-r-LC5\" class=\"blob-code blob-code-inner js-file-line\">\n</td>\n      </tr>\n      <tr>\n        <td id=\"file-featurization-r-L6\" class=\"blob-num js-line-number\" data-line-number=\"6\"></td>\n        <td id=\"file-featurization-r-LC6\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-v\">args</span> <span class=\"pl-k\">=</span> commandArgs(<span class=\"pl-v\">trailingOnly</span><span class=\"pl-k\">=</span><span class=\"pl-c1\">TRUE</span>)</td>\n      </tr>\n      <tr>\n        <td id=\"file-featurization-r-L7\" class=\"blob-num js-line-number\" data-line-number=\"7\"></td>\n        <td id=\"file-featurization-r-LC7\" class=\"blob-code blob-code-inner js-file-line\">\n</td>\n      </tr>\n      <tr>\n        <td id=\"file-featurization-r-L8\" class=\"blob-num js-line-number\" data-line-number=\"8\"></td>\n        <td id=\"file-featurization-r-LC8\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-k\">if</span> (<span class=\"pl-k\">!</span>length(<span class=\"pl-smi\">args</span>)<span class=\"pl-k\">==</span><span class=\"pl-c1\">4</span>) {</td>\n      </tr>\n      <tr>\n        <td id=\"file-featurization-r-L9\" class=\"blob-num js-line-number\" data-line-number=\"9\"></td>\n        <td id=\"file-featurization-r-LC9\" class=\"blob-code blob-code-inner js-file-line\">  stop(<span class=\"pl-s\"><span class=\"pl-pds\">\"</span>Four arguments must be supplied ( train file (csv format) ,test data set (csv format), train output file name and test output file name - txt files ).n<span class=\"pl-pds\">\"</span></span>, <span class=\"pl-v\">call.</span><span class=\"pl-k\">=</span><span class=\"pl-c1\">FALSE</span>)</td>\n      </tr>\n      <tr>\n        <td id=\"file-featurization-r-L10\" class=\"blob-num js-line-number\" data-line-number=\"10\"></td>\n        <td id=\"file-featurization-r-LC10\" class=\"blob-code blob-code-inner js-file-line\">} </td>\n      </tr>\n      <tr>\n        <td id=\"file-featurization-r-L11\" class=\"blob-num js-line-number\" data-line-number=\"11\"></td>\n        <td id=\"file-featurization-r-LC11\" class=\"blob-code blob-code-inner js-file-line\">\n</td>\n      </tr>\n      <tr>\n        <td id=\"file-featurization-r-L12\" class=\"blob-num js-line-number\" data-line-number=\"12\"></td>\n        <td id=\"file-featurization-r-LC12\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-c\"><span class=\"pl-c\">#</span>read input files</span></td>\n      </tr>\n      <tr>\n        <td id=\"file-featurization-r-L13\" class=\"blob-num js-line-number\" data-line-number=\"13\"></td>\n        <td id=\"file-featurization-r-LC13\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-v\">df_train</span> <span class=\"pl-k\">=</span> read.csv(<span class=\"pl-smi\">args</span>[<span class=\"pl-c1\">1</span>],<span class=\"pl-v\">stringsAsFactors</span> <span class=\"pl-k\">=</span> <span class=\"pl-c1\">FALSE</span>)</td>\n      </tr>\n      <tr>\n        <td id=\"file-featurization-r-L14\" class=\"blob-num js-line-number\" data-line-number=\"14\"></td>\n        <td id=\"file-featurization-r-LC14\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-v\">df_test</span> <span class=\"pl-k\">=</span> read.csv(<span class=\"pl-smi\">args</span>[<span class=\"pl-c1\">2</span>],<span class=\"pl-v\">stringsAsFactors</span> <span class=\"pl-k\">=</span> <span class=\"pl-c1\">FALSE</span>)</td>\n      </tr>\n      <tr>\n        <td id=\"file-featurization-r-L15\" class=\"blob-num js-line-number\" data-line-number=\"15\"></td>\n        <td id=\"file-featurization-r-LC15\" class=\"blob-code blob-code-inner js-file-line\">\n</td>\n      </tr>\n      <tr>\n        <td id=\"file-featurization-r-L16\" class=\"blob-num js-line-number\" data-line-number=\"16\"></td>\n        <td id=\"file-featurization-r-LC16\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-c\"><span class=\"pl-c\">#</span>create vocabulary - words</span></td>\n      </tr>\n      <tr>\n        <td id=\"file-featurization-r-L17\" class=\"blob-num js-line-number\" data-line-number=\"17\"></td>\n        <td id=\"file-featurization-r-LC17\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-v\">prep_fun</span> <span class=\"pl-k\">=</span> <span class=\"pl-smi\">tolower</span></td>\n      </tr>\n      <tr>\n        <td id=\"file-featurization-r-L18\" class=\"blob-num js-line-number\" data-line-number=\"18\"></td>\n        <td id=\"file-featurization-r-LC18\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-v\">tok_fun</span> <span class=\"pl-k\">=</span> <span class=\"pl-smi\">word_tokenizer</span></td>\n      </tr>\n      <tr>\n        <td id=\"file-featurization-r-L19\" class=\"blob-num js-line-number\" data-line-number=\"19\"></td>\n        <td id=\"file-featurization-r-LC19\" class=\"blob-code blob-code-inner js-file-line\">\n</td>\n      </tr>\n      <tr>\n        <td id=\"file-featurization-r-L20\" class=\"blob-num js-line-number\" data-line-number=\"20\"></td>\n        <td id=\"file-featurization-r-LC20\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-v\">it_train</span> <span class=\"pl-k\">=</span> itoken(<span class=\"pl-smi\">df_train</span><span class=\"pl-k\">$</span><span class=\"pl-smi\">text</span>,  <span class=\"pl-v\">preprocessor</span> <span class=\"pl-k\">=</span> <span class=\"pl-smi\">prep_fun</span>,  <span class=\"pl-v\">tokenizer</span> <span class=\"pl-k\">=</span> <span class=\"pl-smi\">tok_fun</span>,  <span class=\"pl-v\">ids</span> <span class=\"pl-k\">=</span> <span class=\"pl-smi\">df_train</span><span class=\"pl-k\">$</span><span class=\"pl-smi\">ID</span>, <span class=\"pl-v\">progressbar</span> <span class=\"pl-k\">=</span> <span class=\"pl-c1\">FALSE</span>)</td>\n      </tr>\n      <tr>\n        <td id=\"file-featurization-r-L21\" class=\"blob-num js-line-number\" data-line-number=\"21\"></td>\n        <td id=\"file-featurization-r-LC21\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-v\">vocab</span> <span class=\"pl-k\">=</span> create_vocabulary(<span class=\"pl-smi\">it_train</span>,<span class=\"pl-v\">stopwords</span> <span class=\"pl-k\">=</span> <span class=\"pl-smi\">stop_words</span>)</td>\n      </tr>\n      <tr>\n        <td id=\"file-featurization-r-L22\" class=\"blob-num js-line-number\" data-line-number=\"22\"></td>\n        <td id=\"file-featurization-r-LC22\" class=\"blob-code blob-code-inner js-file-line\">\n</td>\n      </tr>\n      <tr>\n        <td id=\"file-featurization-r-L23\" class=\"blob-num js-line-number\" data-line-number=\"23\"></td>\n        <td id=\"file-featurization-r-LC23\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-c\"><span class=\"pl-c\">#</span>clean vocabualary - use only 5000 terms</span></td>\n      </tr>\n      <tr>\n        <td id=\"file-featurization-r-L24\" class=\"blob-num js-line-number\" data-line-number=\"24\"></td>\n        <td id=\"file-featurization-r-LC24\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-smi\">pruned_vocab</span> <span class=\"pl-k\">&#x3C;-</span> prune_vocabulary(<span class=\"pl-smi\">vocab</span>, <span class=\"pl-v\">max_number_of_terms</span><span class=\"pl-k\">=</span><span class=\"pl-c1\">5000</span>)</td>\n      </tr>\n      <tr>\n        <td id=\"file-featurization-r-L25\" class=\"blob-num js-line-number\" data-line-number=\"25\"></td>\n        <td id=\"file-featurization-r-LC25\" class=\"blob-code blob-code-inner js-file-line\">\n</td>\n      </tr>\n      <tr>\n        <td id=\"file-featurization-r-L26\" class=\"blob-num js-line-number\" data-line-number=\"26\"></td>\n        <td id=\"file-featurization-r-LC26\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-v\">vectorizer</span> <span class=\"pl-k\">=</span> vocab_vectorizer(<span class=\"pl-smi\">pruned_vocab</span>)</td>\n      </tr>\n      <tr>\n        <td id=\"file-featurization-r-L27\" class=\"blob-num js-line-number\" data-line-number=\"27\"></td>\n        <td id=\"file-featurization-r-LC27\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-v\">dtm_train</span> <span class=\"pl-k\">=</span> create_dtm(<span class=\"pl-smi\">it_train</span>, <span class=\"pl-smi\">vectorizer</span>)</td>\n      </tr>\n      <tr>\n        <td id=\"file-featurization-r-L28\" class=\"blob-num js-line-number\" data-line-number=\"28\"></td>\n        <td id=\"file-featurization-r-LC28\" class=\"blob-code blob-code-inner js-file-line\">\n</td>\n      </tr>\n      <tr>\n        <td id=\"file-featurization-r-L29\" class=\"blob-num js-line-number\" data-line-number=\"29\"></td>\n        <td id=\"file-featurization-r-LC29\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-c\"><span class=\"pl-c\">#</span>create tf-idf for train data set</span></td>\n      </tr>\n      <tr>\n        <td id=\"file-featurization-r-L30\" class=\"blob-num js-line-number\" data-line-number=\"30\"></td>\n        <td id=\"file-featurization-r-LC30\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-v\">tfidf</span> <span class=\"pl-k\">=</span> <span class=\"pl-smi\">TfIdf</span><span class=\"pl-k\">$</span>new()</td>\n      </tr>\n      <tr>\n        <td id=\"file-featurization-r-L31\" class=\"blob-num js-line-number\" data-line-number=\"31\"></td>\n        <td id=\"file-featurization-r-LC31\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-v\">dtm_train_tfidf</span> <span class=\"pl-k\">=</span> fit_transform(<span class=\"pl-smi\">dtm_train</span>, <span class=\"pl-smi\">tfidf</span>)</td>\n      </tr>\n      <tr>\n        <td id=\"file-featurization-r-L32\" class=\"blob-num js-line-number\" data-line-number=\"32\"></td>\n        <td id=\"file-featurization-r-LC32\" class=\"blob-code blob-code-inner js-file-line\">\n</td>\n      </tr>\n      <tr>\n        <td id=\"file-featurization-r-L33\" class=\"blob-num js-line-number\" data-line-number=\"33\"></td>\n        <td id=\"file-featurization-r-LC33\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-c\"><span class=\"pl-c\">#</span>create test tf-idf - use vocabulary that is build on train</span></td>\n      </tr>\n      <tr>\n        <td id=\"file-featurization-r-L34\" class=\"blob-num js-line-number\" data-line-number=\"34\"></td>\n        <td id=\"file-featurization-r-LC34\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-v\">it_test</span> <span class=\"pl-k\">=</span> itoken(<span class=\"pl-smi\">df_test</span><span class=\"pl-k\">$</span><span class=\"pl-smi\">text</span>,  <span class=\"pl-v\">preprocessor</span> <span class=\"pl-k\">=</span> <span class=\"pl-smi\">prep_fun</span>,  <span class=\"pl-v\">tokenizer</span> <span class=\"pl-k\">=</span> <span class=\"pl-smi\">tok_fun</span>,  <span class=\"pl-v\">ids</span> <span class=\"pl-k\">=</span> <span class=\"pl-smi\">df_test</span><span class=\"pl-k\">$</span><span class=\"pl-smi\">ID</span>, <span class=\"pl-v\">progressbar</span> <span class=\"pl-k\">=</span> <span class=\"pl-c1\">FALSE</span>)</td>\n      </tr>\n      <tr>\n        <td id=\"file-featurization-r-L35\" class=\"blob-num js-line-number\" data-line-number=\"35\"></td>\n        <td id=\"file-featurization-r-LC35\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-v\">dtm_test_tfidf</span>  <span class=\"pl-k\">=</span> create_dtm(<span class=\"pl-smi\">it_test</span>, <span class=\"pl-smi\">vectorizer</span>) %<span class=\"pl-k\">></span>% </td>\n      </tr>\n      <tr>\n        <td id=\"file-featurization-r-L36\" class=\"blob-num js-line-number\" data-line-number=\"36\"></td>\n        <td id=\"file-featurization-r-LC36\" class=\"blob-code blob-code-inner js-file-line\">  transform(<span class=\"pl-smi\">tfidf</span>)</td>\n      </tr>\n      <tr>\n        <td id=\"file-featurization-r-L37\" class=\"blob-num js-line-number\" data-line-number=\"37\"></td>\n        <td id=\"file-featurization-r-LC37\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-c\"><span class=\"pl-c\">#</span>add Id as additional column in matrices</span></td>\n      </tr>\n      <tr>\n        <td id=\"file-featurization-r-L38\" class=\"blob-num js-line-number\" data-line-number=\"38\"></td>\n        <td id=\"file-featurization-r-LC38\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-smi\">dtm_train_tfidf</span><span class=\"pl-k\">&#x3C;-</span> Matrix(cbind(<span class=\"pl-v\">label</span><span class=\"pl-k\">=</span><span class=\"pl-smi\">df_train</span><span class=\"pl-k\">$</span><span class=\"pl-smi\">label</span>,<span class=\"pl-smi\">dtm_train_tfidf</span>),<span class=\"pl-v\">sparse</span> <span class=\"pl-k\">=</span> <span class=\"pl-c1\">TRUE</span>)</td>\n      </tr>\n      <tr>\n        <td id=\"file-featurization-r-L39\" class=\"blob-num js-line-number\" data-line-number=\"39\"></td>\n        <td id=\"file-featurization-r-LC39\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-smi\">dtm_test_tfidf</span><span class=\"pl-k\">&#x3C;-</span> Matrix(cbind(<span class=\"pl-v\">label</span><span class=\"pl-k\">=</span><span class=\"pl-smi\">df_test</span><span class=\"pl-k\">$</span><span class=\"pl-smi\">label</span>,<span class=\"pl-smi\">dtm_test_tfidf</span>),<span class=\"pl-v\">sparse</span> <span class=\"pl-k\">=</span> <span class=\"pl-c1\">TRUE</span>)</td>\n      </tr>\n      <tr>\n        <td id=\"file-featurization-r-L40\" class=\"blob-num js-line-number\" data-line-number=\"40\"></td>\n        <td id=\"file-featurization-r-LC40\" class=\"blob-code blob-code-inner js-file-line\">\n</td>\n      </tr>\n      <tr>\n        <td id=\"file-featurization-r-L41\" class=\"blob-num js-line-number\" data-line-number=\"41\"></td>\n        <td id=\"file-featurization-r-LC41\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-c\"><span class=\"pl-c\">#</span> write output -  tf-idf matrices</span></td>\n      </tr>\n      <tr>\n        <td id=\"file-featurization-r-L42\" class=\"blob-num js-line-number\" data-line-number=\"42\"></td>\n        <td id=\"file-featurization-r-LC42\" class=\"blob-code blob-code-inner js-file-line\">writeMM(<span class=\"pl-smi\">dtm_train_tfidf</span>,<span class=\"pl-smi\">args</span>[<span class=\"pl-c1\">3</span>])</td>\n      </tr>\n      <tr>\n        <td id=\"file-featurization-r-L43\" class=\"blob-num js-line-number\" data-line-number=\"43\"></td>\n        <td id=\"file-featurization-r-LC43\" class=\"blob-code blob-code-inner js-file-line\">writeMM(<span class=\"pl-smi\">dtm_test_tfidf</span>,<span class=\"pl-smi\">args</span>[<span class=\"pl-c1\">4</span>])</td>\n      </tr>\n      <tr>\n        <td id=\"file-featurization-r-L44\" class=\"blob-num js-line-number\" data-line-number=\"44\"></td>\n        <td id=\"file-featurization-r-LC44\" class=\"blob-code blob-code-inner js-file-line\">\n</td>\n      </tr>\n      <tr>\n        <td id=\"file-featurization-r-L45\" class=\"blob-num js-line-number\" data-line-number=\"45\"></td>\n        <td id=\"file-featurization-r-LC45\" class=\"blob-code blob-code-inner js-file-line\">print(<span class=\"pl-s\"><span class=\"pl-pds\">\"</span>Two matrices were created - one for train and one for test data set<span class=\"pl-pds\">\"</span></span>)</td>\n      </tr>\n</tbody></table>\n\n\n  </div>\n\n  </div>\n</div>\n\n      </div>\n      <div class=\"gist-meta\">\n        <a href=\"https://gist.github.com/Zoldin/9e79c047fd8ad7aa6596b0682aca83c6/raw/2787bc21fa8b2591ca09102f38f544eb5d6cf032/featurization.R\" style=\"float:right\">view raw</a>\n        <a href=\"https://gist.github.com/Zoldin/9e79c047fd8ad7aa6596b0682aca83c6#file-featurization-r\">featurization.R</a>\n        hosted with ❤ by <a href=\"https://github.com\">GitHub</a>\n      </div>\n    </div>\n</div></body></html></p>\n<p><strong>train_model.R</strong> — with created variables we are building logistic regression\n(LASSO).</p>\n<p><html><head></head><body><div id=\"gist71114340\" class=\"gist\">\n    <div class=\"gist-file\">\n      <div class=\"gist-data\">\n        <div class=\"js-gist-file-update-container js-task-list-container file-box\">\n  <div id=\"file-train_model-r\" class=\"file\">\n    \n\n  <div itemprop=\"text\" class=\"Box-body p-0 blob-wrapper data type-r\">\n      \n<table class=\"highlight tab-size js-file-line-container\" data-tab-size=\"8\">\n      <tbody><tr>\n        <td id=\"file-train_model-r-L1\" class=\"blob-num js-line-number\" data-line-number=\"1\"></td>\n        <td id=\"file-train_model-r-LC1\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-c\"><span class=\"pl-c\">#</span>!/usr/bin/Rscript</span></td>\n      </tr>\n      <tr>\n        <td id=\"file-train_model-r-L2\" class=\"blob-num js-line-number\" data-line-number=\"2\"></td>\n        <td id=\"file-train_model-r-LC2\" class=\"blob-code blob-code-inner js-file-line\">library(<span class=\"pl-smi\">Matrix</span>)</td>\n      </tr>\n      <tr>\n        <td id=\"file-train_model-r-L3\" class=\"blob-num js-line-number\" data-line-number=\"3\"></td>\n        <td id=\"file-train_model-r-LC3\" class=\"blob-code blob-code-inner js-file-line\">library(<span class=\"pl-smi\">glmnet</span>)</td>\n      </tr>\n      <tr>\n        <td id=\"file-train_model-r-L4\" class=\"blob-num js-line-number\" data-line-number=\"4\"></td>\n        <td id=\"file-train_model-r-LC4\" class=\"blob-code blob-code-inner js-file-line\">\n</td>\n      </tr>\n      <tr>\n        <td id=\"file-train_model-r-L5\" class=\"blob-num js-line-number\" data-line-number=\"5\"></td>\n        <td id=\"file-train_model-r-LC5\" class=\"blob-code blob-code-inner js-file-line\">    <span class=\"pl-c\"><span class=\"pl-c\">#</span> three arguments needs to be provided - train file (.txt, matrix), seed and output name for RData file</span></td>\n      </tr>\n      <tr>\n        <td id=\"file-train_model-r-L6\" class=\"blob-num js-line-number\" data-line-number=\"6\"></td>\n        <td id=\"file-train_model-r-LC6\" class=\"blob-code blob-code-inner js-file-line\">\n</td>\n      </tr>\n      <tr>\n        <td id=\"file-train_model-r-L7\" class=\"blob-num js-line-number\" data-line-number=\"7\"></td>\n        <td id=\"file-train_model-r-LC7\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-v\">args</span> <span class=\"pl-k\">=</span> commandArgs(<span class=\"pl-v\">trailingOnly</span><span class=\"pl-k\">=</span><span class=\"pl-c1\">TRUE</span>)</td>\n      </tr>\n      <tr>\n        <td id=\"file-train_model-r-L8\" class=\"blob-num js-line-number\" data-line-number=\"8\"></td>\n        <td id=\"file-train_model-r-LC8\" class=\"blob-code blob-code-inner js-file-line\">\n</td>\n      </tr>\n      <tr>\n        <td id=\"file-train_model-r-L9\" class=\"blob-num js-line-number\" data-line-number=\"9\"></td>\n        <td id=\"file-train_model-r-LC9\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-k\">if</span> (<span class=\"pl-k\">!</span>length(<span class=\"pl-smi\">args</span>)<span class=\"pl-k\">==</span><span class=\"pl-c1\">3</span>) {</td>\n      </tr>\n      <tr>\n        <td id=\"file-train_model-r-L10\" class=\"blob-num js-line-number\" data-line-number=\"10\"></td>\n        <td id=\"file-train_model-r-LC10\" class=\"blob-code blob-code-inner js-file-line\">  stop(<span class=\"pl-s\"><span class=\"pl-pds\">\"</span>Three arguments must be supplied ( train file (.txt, matrix), seed and argument for RData model name).n<span class=\"pl-pds\">\"</span></span>, <span class=\"pl-v\">call.</span><span class=\"pl-k\">=</span><span class=\"pl-c1\">FALSE</span>)</td>\n      </tr>\n      <tr>\n        <td id=\"file-train_model-r-L11\" class=\"blob-num js-line-number\" data-line-number=\"11\"></td>\n        <td id=\"file-train_model-r-LC11\" class=\"blob-code blob-code-inner js-file-line\">} </td>\n      </tr>\n      <tr>\n        <td id=\"file-train_model-r-L12\" class=\"blob-num js-line-number\" data-line-number=\"12\"></td>\n        <td id=\"file-train_model-r-LC12\" class=\"blob-code blob-code-inner js-file-line\">\n</td>\n      </tr>\n      <tr>\n        <td id=\"file-train_model-r-L13\" class=\"blob-num js-line-number\" data-line-number=\"13\"></td>\n        <td id=\"file-train_model-r-LC13\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-c\"><span class=\"pl-c\">#</span>read train data set </span></td>\n      </tr>\n      <tr>\n        <td id=\"file-train_model-r-L14\" class=\"blob-num js-line-number\" data-line-number=\"14\"></td>\n        <td id=\"file-train_model-r-LC14\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-v\">trainMM</span> <span class=\"pl-k\">=</span> readMM(<span class=\"pl-smi\">args</span>[<span class=\"pl-c1\">1</span>])</td>\n      </tr>\n      <tr>\n        <td id=\"file-train_model-r-L15\" class=\"blob-num js-line-number\" data-line-number=\"15\"></td>\n        <td id=\"file-train_model-r-LC15\" class=\"blob-code blob-code-inner js-file-line\">set.seed(as.numeric(<span class=\"pl-smi\">args</span>[<span class=\"pl-c1\">2</span>]))</td>\n      </tr>\n      <tr>\n        <td id=\"file-train_model-r-L16\" class=\"blob-num js-line-number\" data-line-number=\"16\"></td>\n        <td id=\"file-train_model-r-LC16\" class=\"blob-code blob-code-inner js-file-line\">\n</td>\n      </tr>\n      <tr>\n        <td id=\"file-train_model-r-L17\" class=\"blob-num js-line-number\" data-line-number=\"17\"></td>\n        <td id=\"file-train_model-r-LC17\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-c\"><span class=\"pl-c\">#</span>use regular matrix, not sparse</span></td>\n      </tr>\n      <tr>\n        <td id=\"file-train_model-r-L18\" class=\"blob-num js-line-number\" data-line-number=\"18\"></td>\n        <td id=\"file-train_model-r-LC18\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-smi\">trainMM_reg</span> <span class=\"pl-k\">&#x3C;-</span> as.matrix(<span class=\"pl-smi\">trainMM</span>)</td>\n      </tr>\n      <tr>\n        <td id=\"file-train_model-r-L19\" class=\"blob-num js-line-number\" data-line-number=\"19\"></td>\n        <td id=\"file-train_model-r-LC19\" class=\"blob-code blob-code-inner js-file-line\">\n</td>\n      </tr>\n      <tr>\n        <td id=\"file-train_model-r-L20\" class=\"blob-num js-line-number\" data-line-number=\"20\"></td>\n        <td id=\"file-train_model-r-LC20\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-v\">t1</span> <span class=\"pl-k\">=</span> Sys.time()</td>\n      </tr>\n      <tr>\n        <td id=\"file-train_model-r-L21\" class=\"blob-num js-line-number\" data-line-number=\"21\"></td>\n        <td id=\"file-train_model-r-LC21\" class=\"blob-code blob-code-inner js-file-line\">print(<span class=\"pl-s\"><span class=\"pl-pds\">\"</span>Started to train the model... <span class=\"pl-pds\">\"</span></span>)</td>\n      </tr>\n      <tr>\n        <td id=\"file-train_model-r-L22\" class=\"blob-num js-line-number\" data-line-number=\"22\"></td>\n        <td id=\"file-train_model-r-LC22\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-v\">glmnet_classifier</span> <span class=\"pl-k\">=</span> cv.glmnet(<span class=\"pl-v\">x</span> <span class=\"pl-k\">=</span> <span class=\"pl-smi\">trainMM_reg</span>[,<span class=\"pl-c1\">2</span><span class=\"pl-k\">:</span><span class=\"pl-c1\">500</span>], <span class=\"pl-v\">y</span> <span class=\"pl-k\">=</span> <span class=\"pl-smi\">trainMM_reg</span>[,<span class=\"pl-c1\">1</span>], </td>\n      </tr>\n      <tr>\n        <td id=\"file-train_model-r-L23\" class=\"blob-num js-line-number\" data-line-number=\"23\"></td>\n        <td id=\"file-train_model-r-LC23\" class=\"blob-code blob-code-inner js-file-line\">                              <span class=\"pl-v\">family</span> <span class=\"pl-k\">=</span> <span class=\"pl-s\"><span class=\"pl-pds\">'</span>binomial<span class=\"pl-pds\">'</span></span>, </td>\n      </tr>\n      <tr>\n        <td id=\"file-train_model-r-L24\" class=\"blob-num js-line-number\" data-line-number=\"24\"></td>\n        <td id=\"file-train_model-r-LC24\" class=\"blob-code blob-code-inner js-file-line\">                              <span class=\"pl-c\"><span class=\"pl-c\">#</span> L1 penalty</span></td>\n      </tr>\n      <tr>\n        <td id=\"file-train_model-r-L25\" class=\"blob-num js-line-number\" data-line-number=\"25\"></td>\n        <td id=\"file-train_model-r-LC25\" class=\"blob-code blob-code-inner js-file-line\">                              <span class=\"pl-v\">alpha</span> <span class=\"pl-k\">=</span> <span class=\"pl-c1\">1</span>,</td>\n      </tr>\n      <tr>\n        <td id=\"file-train_model-r-L26\" class=\"blob-num js-line-number\" data-line-number=\"26\"></td>\n        <td id=\"file-train_model-r-LC26\" class=\"blob-code blob-code-inner js-file-line\">                              <span class=\"pl-c\"><span class=\"pl-c\">#</span> interested in the area under ROC curve</span></td>\n      </tr>\n      <tr>\n        <td id=\"file-train_model-r-L27\" class=\"blob-num js-line-number\" data-line-number=\"27\"></td>\n        <td id=\"file-train_model-r-LC27\" class=\"blob-code blob-code-inner js-file-line\">                              <span class=\"pl-v\">type.measure</span> <span class=\"pl-k\">=</span> <span class=\"pl-s\"><span class=\"pl-pds\">\"</span>auc<span class=\"pl-pds\">\"</span></span>,</td>\n      </tr>\n      <tr>\n        <td id=\"file-train_model-r-L28\" class=\"blob-num js-line-number\" data-line-number=\"28\"></td>\n        <td id=\"file-train_model-r-LC28\" class=\"blob-code blob-code-inner js-file-line\">                              <span class=\"pl-c\"><span class=\"pl-c\">#</span> 5-fold cross-validation</span></td>\n      </tr>\n      <tr>\n        <td id=\"file-train_model-r-L29\" class=\"blob-num js-line-number\" data-line-number=\"29\"></td>\n        <td id=\"file-train_model-r-LC29\" class=\"blob-code blob-code-inner js-file-line\">                              <span class=\"pl-v\">nfolds</span> <span class=\"pl-k\">=</span> <span class=\"pl-c1\">5</span>,</td>\n      </tr>\n      <tr>\n        <td id=\"file-train_model-r-L30\" class=\"blob-num js-line-number\" data-line-number=\"30\"></td>\n        <td id=\"file-train_model-r-LC30\" class=\"blob-code blob-code-inner js-file-line\">                              <span class=\"pl-c\"><span class=\"pl-c\">#</span> high value is less accurate, but has faster training</span></td>\n      </tr>\n      <tr>\n        <td id=\"file-train_model-r-L31\" class=\"blob-num js-line-number\" data-line-number=\"31\"></td>\n        <td id=\"file-train_model-r-LC31\" class=\"blob-code blob-code-inner js-file-line\">                              <span class=\"pl-v\">thresh</span> <span class=\"pl-k\">=</span> <span class=\"pl-c1\">1e-3</span>,</td>\n      </tr>\n      <tr>\n        <td id=\"file-train_model-r-L32\" class=\"blob-num js-line-number\" data-line-number=\"32\"></td>\n        <td id=\"file-train_model-r-LC32\" class=\"blob-code blob-code-inner js-file-line\">                              <span class=\"pl-c\"><span class=\"pl-c\">#</span> again lower number of iterations for faster training</span></td>\n      </tr>\n      <tr>\n        <td id=\"file-train_model-r-L33\" class=\"blob-num js-line-number\" data-line-number=\"33\"></td>\n        <td id=\"file-train_model-r-LC33\" class=\"blob-code blob-code-inner js-file-line\">                              <span class=\"pl-v\">maxit</span> <span class=\"pl-k\">=</span> <span class=\"pl-c1\">1e3</span>)</td>\n      </tr>\n      <tr>\n        <td id=\"file-train_model-r-L34\" class=\"blob-num js-line-number\" data-line-number=\"34\"></td>\n        <td id=\"file-train_model-r-LC34\" class=\"blob-code blob-code-inner js-file-line\">print(<span class=\"pl-s\"><span class=\"pl-pds\">\"</span>Model generated...<span class=\"pl-pds\">\"</span></span>)</td>\n      </tr>\n      <tr>\n        <td id=\"file-train_model-r-L35\" class=\"blob-num js-line-number\" data-line-number=\"35\"></td>\n        <td id=\"file-train_model-r-LC35\" class=\"blob-code blob-code-inner js-file-line\">print(difftime(Sys.time(), <span class=\"pl-smi\">t1</span>, <span class=\"pl-v\">units</span> <span class=\"pl-k\">=</span> <span class=\"pl-s\"><span class=\"pl-pds\">'</span>sec<span class=\"pl-pds\">'</span></span>))</td>\n      </tr>\n      <tr>\n        <td id=\"file-train_model-r-L36\" class=\"blob-num js-line-number\" data-line-number=\"36\"></td>\n        <td id=\"file-train_model-r-LC36\" class=\"blob-code blob-code-inner js-file-line\">\n</td>\n      </tr>\n      <tr>\n        <td id=\"file-train_model-r-L37\" class=\"blob-num js-line-number\" data-line-number=\"37\"></td>\n        <td id=\"file-train_model-r-LC37\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-v\">preds</span> <span class=\"pl-k\">=</span> predict(<span class=\"pl-smi\">glmnet_classifier</span>, <span class=\"pl-smi\">trainMM_reg</span>[,<span class=\"pl-c1\">2</span><span class=\"pl-k\">:</span><span class=\"pl-c1\">500</span>], <span class=\"pl-v\">type</span> <span class=\"pl-k\">=</span> <span class=\"pl-s\"><span class=\"pl-pds\">'</span>response<span class=\"pl-pds\">'</span></span>)[,<span class=\"pl-c1\">1</span>]</td>\n      </tr>\n      <tr>\n        <td id=\"file-train_model-r-L38\" class=\"blob-num js-line-number\" data-line-number=\"38\"></td>\n        <td id=\"file-train_model-r-LC38\" class=\"blob-code blob-code-inner js-file-line\">\n</td>\n      </tr>\n      <tr>\n        <td id=\"file-train_model-r-L39\" class=\"blob-num js-line-number\" data-line-number=\"39\"></td>\n        <td id=\"file-train_model-r-LC39\" class=\"blob-code blob-code-inner js-file-line\">print(<span class=\"pl-s\"><span class=\"pl-pds\">\"</span>AUC for the train... <span class=\"pl-pds\">\"</span></span>)</td>\n      </tr>\n      <tr>\n        <td id=\"file-train_model-r-L40\" class=\"blob-num js-line-number\" data-line-number=\"40\"></td>\n        <td id=\"file-train_model-r-LC40\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-e\">glmnet</span><span class=\"pl-k\">:::</span>auc(<span class=\"pl-smi\">trainMM_reg</span>[,<span class=\"pl-c1\">1</span>], <span class=\"pl-smi\">preds</span>)</td>\n      </tr>\n      <tr>\n        <td id=\"file-train_model-r-L41\" class=\"blob-num js-line-number\" data-line-number=\"41\"></td>\n        <td id=\"file-train_model-r-LC41\" class=\"blob-code blob-code-inner js-file-line\">\n</td>\n      </tr>\n      <tr>\n        <td id=\"file-train_model-r-L42\" class=\"blob-num js-line-number\" data-line-number=\"42\"></td>\n        <td id=\"file-train_model-r-LC42\" class=\"blob-code blob-code-inner js-file-line\">save(<span class=\"pl-smi\">glmnet_classifier</span>,<span class=\"pl-v\">file</span><span class=\"pl-k\">=</span><span class=\"pl-smi\">args</span>[<span class=\"pl-c1\">3</span>])</td>\n      </tr>\n</tbody></table>\n\n\n  </div>\n\n  </div>\n</div>\n\n      </div>\n      <div class=\"gist-meta\">\n        <a href=\"https://gist.github.com/Zoldin/1617b39f2acbde3cd486616ac442e7cf/raw/5f12bfcec59aeddd8428f9d9c571a243c2302ae6/train_model.R\" style=\"float:right\">view raw</a>\n        <a href=\"https://gist.github.com/Zoldin/1617b39f2acbde3cd486616ac442e7cf#file-train_model-r\">train_model.R</a>\n        hosted with ❤ by <a href=\"https://github.com\">GitHub</a>\n      </div>\n    </div>\n</div></body></html></p>\n<p><strong>evaluate.R</strong> — with trained model we are predicting target on test data set.\nAUC is final output which is used as evaluation metric.</p>\n<p><html><head></head><body><div id=\"gist71113477\" class=\"gist\">\n    <div class=\"gist-file\">\n      <div class=\"gist-data\">\n        <div class=\"js-gist-file-update-container js-task-list-container file-box\">\n  <div id=\"file-evaluate-r\" class=\"file\">\n    \n\n  <div itemprop=\"text\" class=\"Box-body p-0 blob-wrapper data type-r\">\n      \n<table class=\"highlight tab-size js-file-line-container\" data-tab-size=\"8\">\n      <tbody><tr>\n        <td id=\"file-evaluate-r-L1\" class=\"blob-num js-line-number\" data-line-number=\"1\"></td>\n        <td id=\"file-evaluate-r-LC1\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-c\"><span class=\"pl-c\">#</span>!/usr/bin/Rscript</span></td>\n      </tr>\n      <tr>\n        <td id=\"file-evaluate-r-L2\" class=\"blob-num js-line-number\" data-line-number=\"2\"></td>\n        <td id=\"file-evaluate-r-LC2\" class=\"blob-code blob-code-inner js-file-line\">library(<span class=\"pl-smi\">Matrix</span>)</td>\n      </tr>\n      <tr>\n        <td id=\"file-evaluate-r-L3\" class=\"blob-num js-line-number\" data-line-number=\"3\"></td>\n        <td id=\"file-evaluate-r-LC3\" class=\"blob-code blob-code-inner js-file-line\">library(<span class=\"pl-smi\">glmnet</span>)</td>\n      </tr>\n      <tr>\n        <td id=\"file-evaluate-r-L4\" class=\"blob-num js-line-number\" data-line-number=\"4\"></td>\n        <td id=\"file-evaluate-r-LC4\" class=\"blob-code blob-code-inner js-file-line\">\n</td>\n      </tr>\n      <tr>\n        <td id=\"file-evaluate-r-L5\" class=\"blob-num js-line-number\" data-line-number=\"5\"></td>\n        <td id=\"file-evaluate-r-LC5\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-v\">args</span> <span class=\"pl-k\">=</span> commandArgs(<span class=\"pl-v\">trailingOnly</span><span class=\"pl-k\">=</span><span class=\"pl-c1\">TRUE</span>)</td>\n      </tr>\n      <tr>\n        <td id=\"file-evaluate-r-L6\" class=\"blob-num js-line-number\" data-line-number=\"6\"></td>\n        <td id=\"file-evaluate-r-LC6\" class=\"blob-code blob-code-inner js-file-line\">\n</td>\n      </tr>\n      <tr>\n        <td id=\"file-evaluate-r-L7\" class=\"blob-num js-line-number\" data-line-number=\"7\"></td>\n        <td id=\"file-evaluate-r-LC7\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-k\">if</span> (<span class=\"pl-k\">!</span>length(<span class=\"pl-smi\">args</span>)<span class=\"pl-k\">==</span><span class=\"pl-c1\">3</span>) {</td>\n      </tr>\n      <tr>\n        <td id=\"file-evaluate-r-L8\" class=\"blob-num js-line-number\" data-line-number=\"8\"></td>\n        <td id=\"file-evaluate-r-LC8\" class=\"blob-code blob-code-inner js-file-line\">  stop(<span class=\"pl-s\"><span class=\"pl-pds\">\"</span>Three arguments must be supplied ( file name where model is stored (RDataname), test file (.txt, matrix) and file name for AUC output).n<span class=\"pl-pds\">\"</span></span>, <span class=\"pl-v\">call.</span><span class=\"pl-k\">=</span><span class=\"pl-c1\">FALSE</span>)</td>\n      </tr>\n      <tr>\n        <td id=\"file-evaluate-r-L9\" class=\"blob-num js-line-number\" data-line-number=\"9\"></td>\n        <td id=\"file-evaluate-r-LC9\" class=\"blob-code blob-code-inner js-file-line\">} </td>\n      </tr>\n      <tr>\n        <td id=\"file-evaluate-r-L10\" class=\"blob-num js-line-number\" data-line-number=\"10\"></td>\n        <td id=\"file-evaluate-r-LC10\" class=\"blob-code blob-code-inner js-file-line\">\n</td>\n      </tr>\n      <tr>\n        <td id=\"file-evaluate-r-L11\" class=\"blob-num js-line-number\" data-line-number=\"11\"></td>\n        <td id=\"file-evaluate-r-LC11\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-c\"><span class=\"pl-c\">#</span>read test data set and model </span></td>\n      </tr>\n      <tr>\n        <td id=\"file-evaluate-r-L12\" class=\"blob-num js-line-number\" data-line-number=\"12\"></td>\n        <td id=\"file-evaluate-r-LC12\" class=\"blob-code blob-code-inner js-file-line\">load(<span class=\"pl-smi\">args</span>[<span class=\"pl-c1\">1</span>])</td>\n      </tr>\n      <tr>\n        <td id=\"file-evaluate-r-L13\" class=\"blob-num js-line-number\" data-line-number=\"13\"></td>\n        <td id=\"file-evaluate-r-LC13\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-v\">testMM</span> <span class=\"pl-k\">=</span> readMM(<span class=\"pl-smi\">args</span>[<span class=\"pl-c1\">2</span>])</td>\n      </tr>\n      <tr>\n        <td id=\"file-evaluate-r-L14\" class=\"blob-num js-line-number\" data-line-number=\"14\"></td>\n        <td id=\"file-evaluate-r-LC14\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-smi\">testMM_reg</span> <span class=\"pl-k\">&#x3C;-</span> as.matrix(<span class=\"pl-smi\">testMM</span>)</td>\n      </tr>\n      <tr>\n        <td id=\"file-evaluate-r-L15\" class=\"blob-num js-line-number\" data-line-number=\"15\"></td>\n        <td id=\"file-evaluate-r-LC15\" class=\"blob-code blob-code-inner js-file-line\">\n</td>\n      </tr>\n      <tr>\n        <td id=\"file-evaluate-r-L16\" class=\"blob-num js-line-number\" data-line-number=\"16\"></td>\n        <td id=\"file-evaluate-r-LC16\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-c\"><span class=\"pl-c\">#</span>predict test data</span></td>\n      </tr>\n      <tr>\n        <td id=\"file-evaluate-r-L17\" class=\"blob-num js-line-number\" data-line-number=\"17\"></td>\n        <td id=\"file-evaluate-r-LC17\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-v\">preds</span> <span class=\"pl-k\">=</span> predict(<span class=\"pl-smi\">glmnet_classifier</span>, <span class=\"pl-smi\">testMM_reg</span>[,<span class=\"pl-c1\">2</span><span class=\"pl-k\">:</span><span class=\"pl-c1\">500</span>] , <span class=\"pl-v\">type</span> <span class=\"pl-k\">=</span> <span class=\"pl-s\"><span class=\"pl-pds\">'</span>response<span class=\"pl-pds\">'</span></span>)[, <span class=\"pl-c1\">1</span>]</td>\n      </tr>\n      <tr>\n        <td id=\"file-evaluate-r-L18\" class=\"blob-num js-line-number\" data-line-number=\"18\"></td>\n        <td id=\"file-evaluate-r-LC18\" class=\"blob-code blob-code-inner js-file-line\"> <span class=\"pl-e\">glmnet</span><span class=\"pl-k\">:::</span>auc(<span class=\"pl-smi\">testMM_reg</span>[,<span class=\"pl-c1\">1</span>], <span class=\"pl-smi\">preds</span>)</td>\n      </tr>\n      <tr>\n        <td id=\"file-evaluate-r-L19\" class=\"blob-num js-line-number\" data-line-number=\"19\"></td>\n        <td id=\"file-evaluate-r-LC19\" class=\"blob-code blob-code-inner js-file-line\">\n</td>\n      </tr>\n      <tr>\n        <td id=\"file-evaluate-r-L20\" class=\"blob-num js-line-number\" data-line-number=\"20\"></td>\n        <td id=\"file-evaluate-r-LC20\" class=\"blob-code blob-code-inner js-file-line\"><span class=\"pl-c\"><span class=\"pl-c\">#</span>write AUC into txt file</span></td>\n      </tr>\n      <tr>\n        <td id=\"file-evaluate-r-L21\" class=\"blob-num js-line-number\" data-line-number=\"21\"></td>\n        <td id=\"file-evaluate-r-LC21\" class=\"blob-code blob-code-inner js-file-line\">write.table(<span class=\"pl-v\">file</span><span class=\"pl-k\">=</span><span class=\"pl-smi\">args</span>[<span class=\"pl-c1\">3</span>],paste(<span class=\"pl-s\"><span class=\"pl-pds\">'</span>AUC for the test file is : <span class=\"pl-pds\">'</span></span>,<span class=\"pl-e\">glmnet</span><span class=\"pl-k\">:::</span>auc(<span class=\"pl-smi\">testMM_reg</span>[,<span class=\"pl-c1\">1</span>], <span class=\"pl-smi\">preds</span>)),<span class=\"pl-v\">row.names</span> <span class=\"pl-k\">=</span> <span class=\"pl-c1\">FALSE</span>,<span class=\"pl-v\">col.names</span> <span class=\"pl-k\">=</span> <span class=\"pl-c1\">FALSE</span>)</td>\n      </tr>\n      <tr>\n        <td id=\"file-evaluate-r-L22\" class=\"blob-num js-line-number\" data-line-number=\"22\"></td>\n        <td id=\"file-evaluate-r-LC22\" class=\"blob-code blob-code-inner js-file-line\">\n</td>\n      </tr>\n      <tr>\n        <td id=\"file-evaluate-r-L23\" class=\"blob-num js-line-number\" data-line-number=\"23\"></td>\n        <td id=\"file-evaluate-r-LC23\" class=\"blob-code blob-code-inner js-file-line\">\n</td>\n      </tr>\n</tbody></table>\n\n\n  </div>\n\n  </div>\n</div>\n\n      </div>\n      <div class=\"gist-meta\">\n        <a href=\"https://gist.github.com/Zoldin/bfc2d4ee449098a9ff64b99c3326e61d/raw/8044bf4a8bf9301113705332f6a26936bd89445b/evaluate.r\" style=\"float:right\">view raw</a>\n        <a href=\"https://gist.github.com/Zoldin/bfc2d4ee449098a9ff64b99c3326e61d#file-evaluate-r\">evaluate.r</a>\n        hosted with ❤ by <a href=\"https://github.com\">GitHub</a>\n      </div>\n    </div>\n</div></body></html></p>\n<p>Firstly, codes from above we will download into the new folder (clone the\nrepository):</p>\n<html><head></head><body><div class=\"gatsby-highlight\" data-language=\"dvc\"><pre class=\"language-dvc\"><code class=\"language-dvc\"><span class=\"token line\"><span class=\"token input\">$ </span><span class=\"token command\">mkdir</span> R_DVC_GITHUB_CODE\n</span><span class=\"token line\"><span class=\"token input\">$ </span><span class=\"token command\">cd</span> R_DVC_GITHUB_CODE\n</span>\n<span class=\"token line\"><span class=\"token input\">$ </span><span class=\"token git\">git clone</span> https://github.com/Zoldin/R_AND_DVC</span></code></pre></div></body></html>\n<h2>DVC installation and initialization</h2>\n<p>On the first site it seemed that DVC will not be compatible to work with R\nbecause of the fact that DVC is written in Python and as that needs/requires\nPython packages and pip package manager. Nevertheless, the tool can be used with\nany programming language, it is language agnostic and as such is excellent for\nworking with R.</p>\n<p>Dvc installation:</p>\n<html><head></head><body><div class=\"gatsby-highlight\" data-language=\"dvc\"><pre class=\"language-dvc\"><code class=\"language-dvc\"><span class=\"token line\"><span class=\"token input\">$ </span><span class=\"token command\">pip3</span> <span class=\"token function\">install</span> dvc\n</span><span class=\"token line\"><span class=\"token input\">$ </span><span class=\"token dvc\">dvc init</span></span></code></pre></div></body></html>\n<p>With code below 5 R scripts with <html><head></head><body><code class=\"language-text\">dvc run</code></body></html> are executed. Each script is started\nwith some arguments — input and output file names and other parameters (seed,\nsplitting ratio etc). It is important to use <html><head></head><body><code class=\"language-text\">dvc run</code></body></html> — with this command R\nscript are entering pipeline (DAG graph).</p>\n<html><head></head><body><div class=\"gatsby-highlight\" data-language=\"dvc\"><pre class=\"language-dvc\"><code class=\"language-dvc\"><span class=\"token line\"><span class=\"token input\">$ </span><span class=\"token dvc\">dvc import</span> https://s3-us-west-2.amazonaws.com/dvc-share/so/25K/Posts.xml.tgz <span class=\"token punctuation\">\\</span>\n             data/\n</span>\n<span class=\"token comment\"># Extract XML from the archive.</span>\n<span class=\"token line\"><span class=\"token input\">$ </span><span class=\"token dvc\">dvc run</span> <span class=\"token function\">tar</span> zxf data/Posts.xml.tgz -C data/\n</span>\n<span class=\"token comment\"># Prepare data.</span>\n<span class=\"token line\"><span class=\"token input\">$ </span><span class=\"token dvc\">dvc run</span> Rscript code/parsingxml.R <span class=\"token punctuation\">\\</span>\n                  data/Posts.xml <span class=\"token punctuation\">\\</span>\n                  data/Posts.csv\n</span>\n<span class=\"token comment\"># Split training and testing dataset. Two output files.</span>\n<span class=\"token comment\"># 0.33 is the test dataset splitting ratio.</span>\n<span class=\"token comment\"># 20170426 is a seed for randomization.</span>\n<span class=\"token line\"><span class=\"token input\">$ </span><span class=\"token dvc\">dvc run</span> Rscript code/train_test_spliting.R <span class=\"token punctuation\">\\</span>\n                  data/Posts.csv <span class=\"token number\">0.33</span> <span class=\"token number\">20170426</span> <span class=\"token punctuation\">\\</span>\n                  data/train_post.csv <span class=\"token punctuation\">\\</span>\n                  data/test_post.csv\n</span>\n<span class=\"token comment\"># Extract features from text data.</span>\n<span class=\"token comment\"># Two TSV inputs and two pickle matrixes outputs.</span>\n<span class=\"token line\"><span class=\"token input\">$ </span><span class=\"token dvc\">dvc run</span> Rscript code/featurization.R <span class=\"token punctuation\">\\</span>\n                  data/train_post.csv <span class=\"token punctuation\">\\</span>\n                  data/test_post.csv <span class=\"token punctuation\">\\</span>\n                  data/matrix_train.txt <span class=\"token punctuation\">\\</span>\n                  data/matrix_test.txt\n</span>\n<span class=\"token comment\"># Train ML model out of the training dataset.</span>\n<span class=\"token comment\"># 20170426 is another seed value.</span>\n<span class=\"token line\"><span class=\"token input\">$ </span><span class=\"token dvc\">dvc run</span> Rscript code/train_model.R <span class=\"token punctuation\">\\</span>\n                  data/matrix_train.txt <span class=\"token number\">20170426</span> <span class=\"token punctuation\">\\</span>\n                  data/glmnet.Rdata\n</span>\n<span class=\"token comment\"># Evaluate the model by the testing dataset.</span>\n<span class=\"token line\"><span class=\"token input\">$ </span><span class=\"token dvc\">dvc run</span> Rscript code/evaluate.R <span class=\"token punctuation\">\\</span>\n                  data/glmnet.Rdata <span class=\"token punctuation\">\\</span>\n                  data/matrix_test.txt <span class=\"token punctuation\">\\</span>\n                  data/evaluation.txt\n</span>\n<span class=\"token comment\"># The result.</span>\n<span class=\"token line\"><span class=\"token input\">$ </span><span class=\"token command\">cat</span> data/evaluation.txt</span></code></pre></div></body></html>\n<h2>Dependency flow graph on R example</h2>\n<p>Dependency graph is shown on picture below:</p>\n<p><html><head></head><body><span class=\"gatsby-resp-image-wrapper\" style=\"position: relative; display: block; margin-left: auto; margin-right: auto;  max-width: 256.5px;\">\n      <a class=\"gatsby-resp-image-link\" href=\"/static/e9ba609b030acd01d27fcd1ff99a3f7f/4df79/dependency-graph.png\" style=\"display: block\" target=\"_blank\" rel=\"noopener\">\n    <span class=\"gatsby-resp-image-background-image\" style=\"padding-bottom: 190.05847953216372%; position: relative; bottom: 0; left: 0; background-image: url(&#x27;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAmCAYAAADEO7urAAAACXBIWXMAAAsSAAALEgHS3X78AAAIEUlEQVRIx51Vd1CbRxbfD4FBYIoKSAhRQjNNBoyRQF0gEMUChd5sC5BkVSB04kIJptgGJCQkUQ04sccJccEODg4YHwYBJpfEnmTm5ibOH1dyNzdzN5n8f6P79AkwBJKZ3JvZ2be7b3/73r4GwA5lNN8HGvMmwgu0WmQOuVqIDl/IQ9N41BAWVRDGoGaHnRJnY2M+Kjqe1tnr0Gv+D6AaFIisxrQBjiTFiOXA2nXY0RPbDTw8ZMDbvQFgCDecMcdNwHX3PH/o1tFABZ3zQGFYR/gL+nVnlXHdv9a07deo/5Nv35IVzyxVJJ3tvxeoXbZiG4zf+cLnAbBW3jb5c9efH9aQX38PtM58BSFaGtajakc3KUrDGkltXgsoazRRROIrrCJFP7VA0U+rvr4QXDu2TVQbLWy1yeJiuwPzBwHTYMCWXUD9enSNeTNGabAQ1SPr/sW1utgS1TVq4YWepCK1Nv7C0Eqg2rSFAGpMFvSRgCWdC0Cu3zF5eM1FNWIJUBktAQr9auClxz8RMmXdFNmIhdx2/x9Em7nweaDauOFjk8/pXP4NpxgsR+7DdhHgyfGX+8WX/nw0UHbLQ6Dc8fD5kU8gq6MV5I50eArvXsHS0rlkAa8ghJv+bkDKlUov0XQ7UW54hFLpt6Gz45PIHeX+6AgODtnjxe2zBx6ijfBxUQ9DPchygm/wdV8MbZqJSx7n7YXNG3i812GP29bWFgCEsVxkQfAhAqvVivByw4ab0rAeqRrZCqsf+z6o7cMfg9KqL6ae7Z6jXLz9ExxOr4Ph8wj4e4J2gZv6JuwMGQZhCbKBokABmNlix3gAUIqBhUSVaTNG3nXHR3S2hZFzrjUxv6ozLrekkZ1bVsfNkV49eXHubx5y3Uqast3oo4gDDgDwgKyp8eA/euJxGHhyv/HKipHrX9AVw6ux1b2Pkqt65pPFXXOMyu77jKoPPmVKB5YS4ViNb7jzJjYyke6FwbjhbfeDov0gIHj4NdRIxANO2El88kmmDyuBRT5FDMEizln/J3SUA3WwVU8XrSDexc8jMSqJQI/jkJmhbAxymDO5iMoyzYXxFE0BNL6IyMwsILLFGpLQPBeZNzWPZIGo5iJI4aUBfo4IuZOn7UZljjWEsy8I/ampPL/kzFQftiSHkD1eF2HPEvMnjulfvPSg6sZ843QTxMyVL72Ed5eOHRlfO45L+7DVkbMg9Uy5U0FIny0jcR5UYHKn6p1B5tiD/eJQbnwc+gt4zodHIRmg8rwAqswboFhk4MzwB87lGIASkYFjNbA5AoAOX6LjRN4N1IFHz9x8DNgDo6Dy23+DfI0Ikg89xWsMz/00+hWSSrdEMr62YlLLG2O4RTWxNl6tWyap9c9IGsMyWTn81BvEwhhTciAcb95vid2UzJa7DnAu8zVGS0jN6HaI7PqTiCJlf0qR8hqvUNHPK1IPcqp756PqJr4K0Rg33oGLQgac1072ArEvn4PDY/d4pXGTUTP+x/DaiW/CVCPrEefaJunlTWMMuIwxy9tmaNLBZ5Gq0S/DFcbNMI35JbtudAsxWWPegjXS3waspm4ECIvFO8If45rAFrpmnm/DCioaMRkVzV45tTo3fokaK2m5jjM8euXWMnDbq6FvGtPUP4uV1l5yR76ttBlkzXx20IFeeDzOE4v13bdl+3jU7hDSw1Hi4jRUkzOAdmuMf6B/MAj3d9i7UfHDfx0EHVo8I44Zwo2hk/lxLBKXkkwWNF0j9VmtDkdFTvrrn51YRRIcM4L6DjOKRkyOSAhi0TOIgpmF40A59Q2UafrYj9NnCGRcHSSzerVkbq8uIMvwUVDp1CJS/9wTooG3byAQ9RgQQOHMEzTvxmhQUkt7YNL7HQG0tvYAbo8uSHjzgc/BDhcV7OEWHuD5S43kw29rneDWwh4PFxaHwsuX3TmfLzqRdqLE1pC4qpEXcY3T31KKm8c5RY1mXq1xI14x/DxBbdqgyobXfOydcA1kjd2z9/Cb8yBf2YbwqooKF+tuPE/MASAdXssorR9zzcpX0XLKW6Mz85S0FKGMKdIYMSrTVphU9yLOJgzLQf/6698BY3TG/veaGmSukkjQ5+RypIhoSgqRbpVYa96Mqht7Ga3ULsdKrz2Or5/82tZGo2vNG7FwacccKvG2tUq1O6PhAe3f26Mgoid2J0QOEGz2IU9zOJzd2RUe0P69PcJ64/32r6XjW5BCZ/lV0B2y9eW3dTP7xZtj9AQ2nh6dHMqm0P1PRVBjGKe4uNzH20gtLOoaBBKJ5NCgUCjIfRQK5QrBhGSBAxy2uZOf4fja6ROshiskmliJo9dfIQgGJ0/kzS4Sfk2l93dDxJ5Mx0LjU6GH8B7sD7um/KUNB+7nq66crj7/dwd0hMr2zuOKXIGzRCZDw9q8HVIpWlIlRlfyQ53OlxWgh/svOxF8cP4HXhNNPjlclG39ubUTklZVAhgEgsEgqVwNFZ8+Dup6ZxxkulV65cCz05LB56fTK7tEpZc/ZopvrMTBbZUHBKNzIG3yU8AbnQW0W/eA33c/gt8itWnzmMq4kSrT/QFzprSJKiqtZ2aVNSefUQ1j3hvfTkGEysrK7MU1Kwv5cKlUesgJ1T32LPng5gqoGlqNEA88j67SrkeWdD6kVA6tRlYOrkZUa1djwe+hi6ZXR23jwf9Dtlym6m+BaFtM/sUKJQI3J87pdFxMVPwJeiLPh+Ue4nL5Z+vvA00x3bXH7vy2E7N7KDopu/hESgKPyKUL/Dit3THZtx55/w8YH8oPRLNy4wAAAABJRU5ErkJggg==&#x27;); background-size: cover; display: block;\"></span>\n  <picture>\n        <source srcset=\"/static/e9ba609b030acd01d27fcd1ff99a3f7f/c54d4/dependency-graph.webp 175w, /static/e9ba609b030acd01d27fcd1ff99a3f7f/a3432/dependency-graph.webp 350w, /static/e9ba609b030acd01d27fcd1ff99a3f7f/3be34/dependency-graph.webp 513w\" sizes=\"(max-width: 513px) 100vw, 513px\" type=\"image/webp\">\n        <source srcset=\"/static/e9ba609b030acd01d27fcd1ff99a3f7f/17006/dependency-graph.png 175w, /static/e9ba609b030acd01d27fcd1ff99a3f7f/d6f3f/dependency-graph.png 350w, /static/e9ba609b030acd01d27fcd1ff99a3f7f/4df79/dependency-graph.png 513w\" sizes=\"(max-width: 513px) 100vw, 513px\" type=\"image/png\">\n        <img class=\"gatsby-resp-image-image\" src=\"/static/e9ba609b030acd01d27fcd1ff99a3f7f/4df79/dependency-graph.png\" alt=\"Dependency graph\" title=\"Dependency graph\" loading=\"lazy\" style=\"width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;\">\n      </picture>\n  </a>\n    </span></body></html><em>Dependency\ngraph</em></p>\n<p>DVC memorizes this dependencies and helps us in each moment to reproduce\nresults.</p>\n<p>For example, lets say that we are changing our training model — using ridge\npenalty instead of lasso penalty (changing alpha parameter to <html><head></head><body><code class=\"language-text\">0</code></body></html>). In that case\nwill change/modify <html><head></head><body><code class=\"language-text\">train_model.R</code></body></html> job and if we want to repeat model\ndevelopment with this algorithm we don’t need to repeat all steps from above,\nonly steps marked red on a picture below:</p>\n<p><html><head></head><body><span class=\"gatsby-resp-image-wrapper\" style=\"position: relative; display: block; margin-left: auto; margin-right: auto;  max-width: 256.5px;\">\n      <a class=\"gatsby-resp-image-link\" href=\"/static/da29b8bd00ccba3578fdfe91cd7f34bc/4df79/marked-steps.png\" style=\"display: block\" target=\"_blank\" rel=\"noopener\">\n    <span class=\"gatsby-resp-image-background-image\" style=\"padding-bottom: 190.05847953216372%; position: relative; bottom: 0; left: 0; background-image: url(&#x27;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAmCAYAAADEO7urAAAACXBIWXMAAAsSAAALEgHS3X78AAAIFUlEQVRIx5VWCVCTWRJ+SUBNIHdCkj8/SSAEY7hBEkLIxRUOw4AKqIiKQCAhHBYgIDUKigc6KpEECKgjuroX47nquIu6ByAuzo5j7dZO7VZN1W7N7FG7U+rq1FjK5t+XP4BcU7N2VVf3e//r7/Xr7tfvB2CGsloug9qBSVw39ThwKT1YTA6/uZ6sMqqkOpVJplGuk8WV5bKiLhYFZnR2Ew+7vwZKpw1fW9v/ACxLNtfEgjHF4U9ndgEazQK4gY2AyTu2ghHQByiz3/OPX1geaH3HVWB1+sCqe8dX1vRNBNf3TwmbnZ8LjoxinJTN1qTSo1fEPfcwVmP/HwQ1feMi6BXXu7744F1QP3O6OUpv/Ai0DP+O4NWtznFF/eBklM01jtS6x0Qlzf1RBWX7tEW2buVGW7eq/IPboQ1Dj/hwU32Na2KV1wbqCwEzGi9DwE9wwGrnWETdwGQk9JRvd40FF9c7YoohWGHVwaRC+8k4y8n7YnvfJK/GNe4FJOOAi0IFNu2/5T0qrlscY6tgLEXQQFTd+ytx+42nvGxLV6TFNYG2Xv4b39Y3JoLfIeiDINyZtlug7ruSMgu6mKAbPCj8Fs+bm58sD5Tbem3O7e19PyVgAAP5rk66+cfvs1SZBjTLWCg1mApEqfvKGAXnOvjVvTdItt4pwrahs0urQyoNm9PLOs4v2CjRmcpWXJXRRFa+QHoMZarOpbCTh1LnyuYR5IaOHlxvadkNPYvU4QMBHwEYhgFflh8EwEyvqXE9lO0a/LOk9QdfSTLK29NKD4xE7rn4NLiu/7NQq3NMDtdIZoGbDp/2KXwIos1aB6rXVwHtujK/aABIthO3Eu39k5FV+y8F5Ze2aPK2tSZu2NkZ+97mJl3+ll2GvMqD0XtGvqRZHfczbPv6giyxgAhAGrA0N83AE3yCweGwoKCeeIIxoQfJtt7fxFQcvqHeefi6eseBEc3OriuQL6dYjo8m2pzjcY0//CJGkahhsFiBHK99SISQAEzXPyXsZlOBQRbNUUdrg7QJOjSeL2XhyRn7O2G5BLrgqW5/jIG4VUJaokLNS47Vo9owPQP/mHfmDimnf0RmtDaJVOn5/JScjXzdjlokzz2yZsPZG/gtKKhrB2mpmSDjvQLcZkPPQVL2YFO4rsocrEw3CtXZ6Tx9hZmXO9QgxxdkDoz4ZfxikqZyDAriTp3mZ99/xFj3o9EVy9bXTOIyLrT56W5V0I2XtvIzh7cguqslzPyzu1aC7MFr85cTChLiyL+GchvkzRJAKuYA0jYBIOlFYKVWDFZu5wLSRjHwg02L6DXoFCJ+7vyjpAWbrvvwZ0B73A12/P7foLC+gFB98g7H3nsftZ+6h9h6RoXOzzBmWklTpKGoLsb1GGPWOEYR+6m7iL33Lmo99XMuUMDyO1sFzEPNs0kmztVgTstPiJZTY+nwHktr3VNSy7Hb8kJrd2qR7aix0HrEWGg/oS8/dE1RP/SJ1N43EQI5y+6a8Pfawrs977asjnnbZPsmNdAgvP70YxkElm9rPaPe2jyo2dLYrylpG1ZWnri3xu5+JIPrwmoHfqtrcD/Ej1w78BCA7N6LQNfchQOx2Fw/GBhKvM5Myd7exjKVNDFNW3czzPWOgPRNdlZF6zF27/XHAS3HLzEaj5xjNnefZ1U2vE/Fw7a5GWQP31yYQAaHy6azWIJ5U97Ak2bZrFlN2lGcSWpeBQifzywIFotCgVxMnLMo/WKaaOro4WhiNVJ9pEaYHqtFDFFqNGv3UaQbw4jLVU7mk+f+2qJydopcGZKiUPHV8rUSbXIW3zR8MxD8yZBHyHNeFKZ1nRQbOrvR1APH0Myu46I8xzlJe4fD1//se4AsIBDkH3L6LsPwbbLxA7ckqWWvOLm9U6Rq2ycyHHJIzGevBi16B7JowJBGX1LL/m/7qunCrTldBE9QtHcv1fDxHX9kpkqAh8s1vELRWIzPi5pks/VjXK4REyBxr4RogkcgUGI0Or6rh84AOYNXfG/4h9fBRlubz/nS0lXYbD2fHoG7s1hZGCqkvF4TrcKiEyOxGKXqjSIhBROKmZiAL/NQqbE4YGAg4elfvwQa97Av9nX1uCyvrCRvt1rxJtKwqQgC8nhKD4JEYEJUMR3ES5jmcJMxoVDuESARcD7Gw+EwZ06yIAx2u31WkiET5s/N0TSdrnnFYi+JoYe+ZAro9fpZSYFMmD/3NviBFPmCcbiCgPn77r2HRgPfQeS3LRrS65oDKzTxWk5EgjEMaHKQNdGaqCRlGvufVe14L/yXMgVgYWEL2COVgqsIgttzSX4UQCDigAFEuPlo1V72LzfsXH3emIOcj9dw+1PNQXeKKmR3dx3hgv+PSK10KgErXAtehCM+T7G8YiJWUkF+o4iKw2LjIzCFQoShCOIJC0MxqRTFvBKyb+xlGf+1TCbEQkNRN5WqWQB/u3r/wjh6mexH+CYmnoRJQ4kQhAjBoJSSsBApcVoR7vcynJfyPJSZ+F8ZN2kUZW/5Ryg35VspK/YbGc8IprbUgKnSWvCVKQ98WlL9ved7GYH6Q059EYswvk2IV2JJavV0XJz6WUI485VclOrLoMT3XnuEQjzgeOAXJeKV1JeEr8154EUIQ/4fCS3yWQhN/pxHSX4mokY8l1DlL0NZMeBd6A2dvPTNIlOSMF97ezeCcQN/PLMLavCXxRoOVguiVoSrTOyLIbLI8ugkPkORQcGC3xH0L7Z0n1fd60kf5a8V5icZQlmaXL4yySRw5OrETy0a6v8AG2qzeCeNScoAAAAASUVORK5CYII=&#x27;); background-size: cover; display: block;\"></span>\n  <picture>\n        <source srcset=\"/static/da29b8bd00ccba3578fdfe91cd7f34bc/c54d4/marked-steps.webp 175w, /static/da29b8bd00ccba3578fdfe91cd7f34bc/a3432/marked-steps.webp 350w, /static/da29b8bd00ccba3578fdfe91cd7f34bc/3be34/marked-steps.webp 513w\" sizes=\"(max-width: 513px) 100vw, 513px\" type=\"image/webp\">\n        <source srcset=\"/static/da29b8bd00ccba3578fdfe91cd7f34bc/17006/marked-steps.png 175w, /static/da29b8bd00ccba3578fdfe91cd7f34bc/d6f3f/marked-steps.png 350w, /static/da29b8bd00ccba3578fdfe91cd7f34bc/4df79/marked-steps.png 513w\" sizes=\"(max-width: 513px) 100vw, 513px\" type=\"image/png\">\n        <img class=\"gatsby-resp-image-image\" src=\"/static/da29b8bd00ccba3578fdfe91cd7f34bc/4df79/marked-steps.png\" alt=\"marked steps\" title=\"marked steps\" loading=\"lazy\" style=\"width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;\">\n      </picture>\n  </a>\n    </span></body></html></p>\n<p>DVC knows based on DAG graph that changed <html><head></head><body><code class=\"language-text\">train_model.R</code></body></html> file will only change\nfollowing files: <html><head></head><body><code class=\"language-text\">Glmnet.RData</code></body></html> and <html><head></head><body><code class=\"language-text\">Evaluation.txt</code></body></html>. If we want to see our new\nresults we need to execute only <html><head></head><body><code class=\"language-text\">train_model.R</code></body></html> and <html><head></head><body><code class=\"language-text\">evaluate.R job</code></body></html>. It is cool\nthat we don’t have to think all the time what we need to repeat (which steps).\n<html><head></head><body><code class=\"language-text\">dvc repro</code></body></html> command will do that instead of us. Here is a code example :</p>\n<html><head></head><body><div class=\"gatsby-highlight\" data-language=\"dvc\"><pre class=\"language-dvc\"><code class=\"language-dvc\"><span class=\"token line\"><span class=\"token input\">$ </span><span class=\"token command\">vi</span> train_model.R\n</span><span class=\"token line\"><span class=\"token input\">$ </span><span class=\"token git\">git commit</span> -am <span class=\"token string\">\"Ridge penalty instead of lasso\"</span>\n</span><span class=\"token line\"><span class=\"token input\">$ </span><span class=\"token dvc\">dvc repro</span> data/evaluation.txt\n</span>\nReproducing run command for data item data/glmnet.Rdata. Args: Rscript code/train_model.R data/matrix_train.txt 20170426 data/glmnet.Rdata\nReproducing run command for data item data/evaluation.txt. Args: Rscript code/evaluate.R data/glmnet.Rdata data/matrix_test.txt data/evaluation.txt\n\n<span class=\"token line\"><span class=\"token input\">$ </span><span class=\"token command\">cat</span> data/evaluation.txt\n</span>\"AUC for the test file is :  0.947697381983095\"</code></pre></div></body></html>\n<p><html><head></head><body><code class=\"language-text\">dvc repro</code></body></html> always re executes steps which are affected with the latest\ndeveloper changes. It knows what needs to be reproduced.</p>\n<p>DVC can also work in an <em>“multi-user environment”</em> . Pipelines (dependency\ngraphs) are visible to others colleagues if we are working in a team and using\ngit as our version control tool. Data files can be shared if we set up a cloud\nand with <em>dvc sync</em> we specify which data can be shared and used for other\nusers. In that case other users can see the shared data and reproduce results\nwith those data and their code changes.</p>\n<h2>Summary</h2>\n<p>DVC tool improves and accelerates iterative development and helps to keep track\nof ML processes and file dependencies in the simple form. On the R example we\nsaw how DVC memorizes dependency graph and based on that graph re executes only\njobs that are related to the latest changes. It can also work in multi-user\nenvironment where dependency graphs, codes and data can be shared among multiple\nusers. Because it is language agnostic, DVC allows us to work with multiple\nprogramming languages within a single data science project.</p>","timeToRead":15,"fields":{"slug":"/r-code-and-reproducible-model-development-with-dvc"},"frontmatter":{"title":"R code and reproducible model development with DVC","date":"July 24, 2017","description":"In this document we will briefly explore possibilities of a new open source\ntool that could help with achieving code simplicity, readability and faster\nmodel development.\n","descriptionLong":"There are a lot of example on how to use Data Version Control (DVC) with a\nPython project. In this document I would like to see how it can be used with a\nproject in R.\n","tags":["RStats","R","DVC"],"commentsUrl":"https://discuss.dvc.org/t/r-code-and-reproducible-model-development-with-dvc/298","author":{"childMarkdownRemark":{"frontmatter":{"name":"Marija Ilić","avatar":{"childImageSharp":{"fixed":{"base64":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAUCAIAAAAC64paAAAACXBIWXMAAAsSAAALEgHS3X78AAAEkElEQVQ4yy3Q+VPSeRzHcXZnZzs0w8IjQYNUEhRF5VYQEAUEb0XFXEtyK7LD1U6PstTWXPPKQctUvgffk+/3C4jpmLaNre1O287sVLM/9M/sl21nnvP57fF+fefLYej1SPglG8OsU2yhDSYUfb64dG/g9tPpaRhESJwKMVGCCJGxaByjMDSIBggExjkoQhI4zQZC2OoauLS41NXRUSDOykw5np8p1BUVOcrNY0ODCBgA12DAH1hbhVZXwJUXwMqyn8NeCpIhFiOsB+G2xrrkw9+nHzkkST4q5sULEg5mcA/li/gjt/oxGPEDCASi7Ak/e2gN5mAETZAMQTA4Ti/6FjWy01pRSnOxuFMnaSgU2XMFFVKhnM8rleYszi/AcBD0oxCAQn4EXAtw8P8w+5IhxnuhWy8WXKkoGnQUPWzQjLpKx5t0DxpKrlQoNIIkT1sbFqAgPwoDGPuCawgHwykWk8EQAIFGrcIpP3nHUTjRpF7wVPku1fu67LNu46TbdEYpYceXZp8hEAkB2FfPLlPsLPuHJ8bGcwS8OoVwsFY15dYv9zSu9nUt97jmz9nnOp19ZpWcnzTcd4dCmQCAxfLjMcyOhyMb173eDO6B+iLRnSr1zy2mF94WuK975dqZha66uWZHb5lSwou/2N5BozQC4hgbQMU+m10OhaPtjXWC+G/blJKrZtWPZlWvreS5p3nSXdNTaRh2Vl4tVUq4h886q0mYxGACZ4NimGYxRa83VFVmxH/XbVK4i7Jq5DkdOvlDh/6GWeeQ5Xo0xf2mUtUJbndtLQEFiQBNwDQRCH3FDEWtn3W1SHiJ16zaQZtqqMZ006IdNhTer9DesOhHK00jVZVGwbHrrW00thFEIhS6TiPrMcxGkqGBvn7hce45tWym2rzQartrko8aVU9s+vma8uWm6vHaujJB0oi3J0q/YvBomNwMYy85KEaxsX52Zp6feswmTp+0mx5Xlw2aiydtloVq67zTDJ1p7TWbSkRpvvHHL5ndCLkZpbY3g9v/4wBKrSxDytzMYn7iaHXJA6vqlqZgwlA2bTXN1ZcDHnetJLuyMC8CEFvM663Q7nb49U741xhG0CBGME9nZnTSdDEvrkuTNWJVDGgL76vVTyzGFbfzXo1Zyk3otJS+IiO7kb030b29zd/ebu5zAmgQRggMZ/oun5fz4zJ58caspEsl0keO0plGy7MW+y/1FpMoNT+F222QUz7fu53377Z+f7/z5/vdDxwQIhCUXpietioluamHRYkHzDknvOUFvWV5Y3bdmK2kQ56dz0swZqa68oVD7c37oa2P+58/7n369PYzBwDxAIR21lcpTibK+Amnjh00ZiXfcmpv2JUDTu1dm7JZLixMPWIVn7BlJrcUZE+dv7hPbP6z9/eXv75w/H5kdGhYdyo5P+WgmHdAEPeNLO3QRVPeSINhxGV82KT/QZWtSztiPXXckHbUlZvuUcsnr/xE+YBPbz5wnk7N9V+6YFVIDbkitThDJuSrT6d7q3QTZ2unPI3TnqbrDn27VtauyXMppb12w2Vr2ezN2+Qi/MfG/r98UBJBtoZjogAAAABJRU5ErkJggg==","width":40,"height":40,"src":"/static/9add844328fd47c78f5df2bdfe1c56a3/4d3a4/marija_ilic.png","srcSet":"/static/9add844328fd47c78f5df2bdfe1c56a3/4d3a4/marija_ilic.png 1x,\n/static/9add844328fd47c78f5df2bdfe1c56a3/4c8bc/marija_ilic.png 1.5x,\n/static/9add844328fd47c78f5df2bdfe1c56a3/c0e17/marija_ilic.png 2x","srcWebp":"/static/9add844328fd47c78f5df2bdfe1c56a3/e145b/marija_ilic.webp","srcSetWebp":"/static/9add844328fd47c78f5df2bdfe1c56a3/e145b/marija_ilic.webp 1x,\n/static/9add844328fd47c78f5df2bdfe1c56a3/0d42c/marija_ilic.webp 1.5x,\n/static/9add844328fd47c78f5df2bdfe1c56a3/f46db/marija_ilic.webp 2x"}}}}}},"picture":{"childImageSharp":{"fluid":{"base64":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAOCAYAAAAvxDzwAAAACXBIWXMAAAsTAAALEwEAmpwYAAAC00lEQVQ4y2NgwAFcrd3BtASvAMPUOTvAbE0NYzDt5ZfBQBIIm9MKpoMn1DPE7pgEF895NoUhbGo9mB25oJXBpl+PsGG504+D6ZTpaxm0GUrBbFVJbU5laVV+mJqUmSvBdOGs00D1J7AbZGpmBmfH1LZAvLuGgdF7Vhq3g6+1uIu/g5RXezhPwKoUNpBcfGcLInicndFcFRkFptXV1Bj+//8PZmfOucyUO+0EX/Lq1eLebWVKzqXp6skblkhmzt4llDvlHDtIzfI//xlcvQMhQZCWCTEsZNYWMO3fMZ8hc+s5Bu/UBiZgdDDlTd6nnT35kFpq5Xzt2KJpBlHZPRZJJTOMUto2KOdMO2qcWVDFBXbpwm5gmHeAzYiY1wYxNHLbXbhrxcVlWHiYGbjLl9yWypp0RDmpcZlWYsNSncT6RXqJtfMN0nv3K+bPvqJuZGYmhOzLuG1zUF0YduU/Y0jHDGE//3AZdws7Sd+UQomEBQcEG/b9Z67c8YulZMd3lr6b/5kSlq4Q8EtLlHI2c5by9vGXiJhUKRi5fioT3IURsyEG9h7/zxA6c52AV/sUCcfaDqng6cvFo5fu5kCPvMAlTWw+U3LEfDvjpb1648Qj59UK9d2FhHvMwiaIoowtzxgkdRnBbP6YVAaf9DSuaSBvSDEwp0fYswTZSLEYaSvy1CXKM6WqCDEXMDMwJRfZoViWd2AzNPygLgSB8ElrwabmTjzKmj3lsGbBjOOKBVMPqcRXzDQNyWyxS29ZrZs//ahS/vQjKnnTTiqD1EbNr2E4vQPiwvjFkETP4D9rJYNGQChD9vSTYH76lKOc2dOOm+XPPquePWm/XmL1XOfUuoUOSXWL7fJmntLMm3FSK2/GCWO462YcZwhYGIIaNsFBQRDvTz4KpssWXGLMmXKYuWLTFyZkdfnTjzPnzTjFfP8xxFU5uHIKMsiacoyoPA8zzCRvGVwMAD/A8rORmcCjAAAAAElFTkSuQmCC","aspectRatio":1.4166666666666667,"src":"/static/18df1a636cbcc0b9b44625158d4de277/286b3/post-image.png","srcSet":"/static/18df1a636cbcc0b9b44625158d4de277/1f44b/post-image.png 213w,\n/static/18df1a636cbcc0b9b44625158d4de277/3e433/post-image.png 425w,\n/static/18df1a636cbcc0b9b44625158d4de277/286b3/post-image.png 850w,\n/static/18df1a636cbcc0b9b44625158d4de277/9a739/post-image.png 1275w,\n/static/18df1a636cbcc0b9b44625158d4de277/c47cc/post-image.png 1700w","srcWebp":"/static/18df1a636cbcc0b9b44625158d4de277/5c1d9/post-image.webp","srcSetWebp":"/static/18df1a636cbcc0b9b44625158d4de277/99b2d/post-image.webp 213w,\n/static/18df1a636cbcc0b9b44625158d4de277/23220/post-image.webp 425w,\n/static/18df1a636cbcc0b9b44625158d4de277/5c1d9/post-image.webp 850w,\n/static/18df1a636cbcc0b9b44625158d4de277/5e720/post-image.webp 1275w,\n/static/18df1a636cbcc0b9b44625158d4de277/35cfd/post-image.webp 1700w","sizes":"(max-width: 850px) 100vw, 850px","presentationWidth":850}}},"pictureComment":"DAG on R example"}}},"pageContext":{"next":{"fields":{"slug":"/data-version-control-in-analytics-devops-paradigm"},"frontmatter":{"title":"Data Version Control in Analytics DevOps Paradigm"}},"previous":{"fields":{"slug":"/how-a-data-scientist-can-improve-his-productivity"},"frontmatter":{"title":"How A Data Scientist Can Improve His Productivity"}},"currentPage":19,"slug":"/r-code-and-reproducible-model-development-with-dvc"}}}