{"componentChunkName":"component---src-templates-blog-post-tsx","path":"/data-version-control-in-analytics-devops-paradigm","result":{"data":{"markdownRemark":{"id":"9798ae3d-70df-5b5d-9c91-9cad33201105","excerpt":"<h2>Data Science and DevOps Convergence</h2>\n<p>The primary mission of DevOps is to help the teams to resolve various Tech Ops\ninfrastructure, tools and…</p>","html":"<h2>Data Science and DevOps Convergence</h2>\n<p>The primary mission of DevOps is to help the teams to resolve various Tech Ops\ninfrastructure, tools and pipeline issues.</p>\n<p>At the other hand, as mentioned in the conceptual review by\n<a href=\"https://www.forbes.com/sites/teradata/2016/11/14/devops-for-data-science-why-analytics-ops-is-key-to-value/\">Forbes</a>\nin November 2016, the industrial analytics is no more going to be driven by data\nscientists alone. It requires an investment in DevOps skills, practices and\nsupporting technology to move analytics out of the lab and into the business.\nThere are even\n<a href=\"https://www.computing.co.uk/ctg/news/2433095/a-lot-of-companies-will-stop-hiring-data-scientists-when-they-realise-that-the-majority-bring-no-value-says-data-scientist\">voices</a>\ncalling Data Scientists to concentrate on agile methodology and DevOps if they\nlike to retain their jobs in business in the long run.</p>\n<h2>Why DevOps Matters</h2>\n<p>The eternal dream of almost every Data Scientist today is to spend all (well,\nalmost all) the time in the office exploring new datasets, engineering decisive\nnew features, inventing and validating cool new algorithms and strategies.\nHowever, reality is often different. One of the unfortunate daily routines of a\nData Scientist work is to do raw data pre-processing. It usually translates to\nthe challenges to</p>\n<ol>\n<li>\n<p><strong>Pull all kinds of necessary data from a variety of sources</strong></p>\n<ul>\n<li>Internal data sources like ERP, CRM, POS systems, or data from online\ne-commerce platforms</li>\n<li>External data, like weather, public holidays, Google trends etc.</li>\n</ul>\n</li>\n<li>\n<p><strong>Extract, transform, and load the data</strong></p>\n<ul>\n<li>Relate and join the data sources</li>\n<li>Aggregate and transform the data</li>\n</ul>\n</li>\n<li><strong>Avoid technical and performance drawbacks</strong> when everything ends up in\n“one big table” at the end</li>\n<li>\n<p><strong>Facilitate continuous machine learning and decision-making in a\nbusiness-ready framework</strong></p>\n<ul>\n<li>Utilize historic data to train the machine learning models and algorithms</li>\n<li>Use the current, up-to-date data for decision-making</li>\n<li>Export back the resulting decisions/recommendations to review by business\nstakeholders, either back into the ERP system or some other data warehouse</li>\n</ul>\n</li>\n</ol>\n<p>Another big challenge is to organize <strong>collaboration and data/model sharing</strong>\ninside and across the boundaries of teams of Data Scientists and Software\nEngineers.</p>\n<p>DevOps skills as well as effective instruments will certainly be beneficial for\nindustrial Data Scientists as they can address the above-mentioned challenges in\na self-service manner.</p>\n<h2>Can DVC Be a Solution?</h2>\n<p><a href=\"https://dvc.org\">Data Version Control</a> or simply DVC comes to the scene\nwhenever you start looking for effective DevOps-for-Analytics instruments.</p>\n<p>DVC is an open source tool for data science projects. It makes your data science\nprojects reproducible by automatically building data dependency graph (DAG).\nYour code and the dependencies could be easily shared by Git, and data — through\ncloud storage (AWS S3, GCP) in a single DVC environment.</p>\n<blockquote>\n<p>Although DVC was created for machine learning developers and data scientists\n<a href=\"https://dvc.org/doc/understanding-dvc/what-is-dvc\">originally</a>, it appeared\nto be useful beyond it. Since it brings proven engineering practices to not\nwell defined ML process, I discovered it to have enormous potential as an\nAnalytical DevOps instrument.</p>\n</blockquote>\n<p>It clearly helps to manage a big fraction of DevOps issues in daily Data\nScientist routines</p>\n<ol>\n<li><strong>Pull all kinds of necessary data from a variety of sources</strong>. Once you\nconfigure and script your data extraction jobs with DVC, it will be\npersistent and operable across your data and service infrastructure</li>\n<li><strong>Extract, transform, and load the data</strong>. ETL is going to be easy and\nrepeatable once you configure it with DVC scripting. It will become a solid\npipeline to operate without major supportive effort. Moreover, it will track\nall changes and trigger an alert for updates in the pipeline steps via DAG.</li>\n<li><strong>Facilitate continuous machine learning and decision-making.</strong> The part of\nthe pipeline facilitated through DVC scripting can be jobs to upload data\nback to any transactional system (like ERP, ERM, CRM etc.), warehouse or data\nmart. It will then be exposed to business stakeholders to make intelligent\ndata-driven decisions.</li>\n<li><strong>Share your algorithms and data</strong>. Machine Learning modeling is an iterative\nprocess and it is extremely important to keep track of your steps,\ndependencies between the steps, dependencies between your code and data files\nand all code running arguments. This becomes even more important and\ncomplicated in a team environment where data scientists’ collaboration takes\na serious amount of the team’s effort. DVC will be the arm to help you with\nit.</li>\n</ol>\n<p>One of the ‘juicy’ features of DVC is ability to support multiple technology\nstacks. Whether you prefer R or use promising Python-based implementations for\nyour industrial data products, DVC will be able to support your pipeline\nproperly. You can see it in action for both\n<a href=\"https://blog.dvc.org/how-a-data-scientist-can-improve-his-productivity\">Python-based</a>\nand\n<a href=\"https://blog.dvc.org/r-code-and-reproducible-model-development-with-dvc\">R-based</a>\ntechnical stacks.</p>\n<p>As such, DVC is going to be one of the tools you would enjoy to use if/when you\nembark on building continual analytical environment for your system or across\nyour organization.</p>\n<h2>Continual Analytical Environment and DevOps</h2>\n<p>Building a production pipeline is quite different from building a\nmachine-learning prototype on a local laptop. Many teams and companies face the\nchallenges there.</p>\n<p>At the bare minimum, the following requirements shall be met when you move your\nsolution into production</p>\n<ol>\n<li>Periodic re-training of the models/algorithms</li>\n<li>Ease of re-deployment and configuration changes in the system</li>\n<li>Efficiency and high performance of real-time scoring the new out-of-sample\nobservations</li>\n<li>Availability of the monitor model performance over time</li>\n<li>Adaptive ETL and ability to manage new data feeds and transactional systems\nas data sources for AI and machine learning tools</li>\n<li>Scaling to really big data operations</li>\n<li>Security and Authorized access levels to different areas of the analytical\nsystems</li>\n<li>Solid backup and recovery processes/tools</li>\n</ol>\n<p>This goes into the territory traditionally inhabited by DevOps. Data Scientists\nshould ideally learn to handle the part of those requirements themselves or at\nleast be informative consultants to classical DevOps gurus.</p>\n<p>DVC can help in many aspects of the production scenario above as it can\norchestrate relevant tools and instruments through its scripting. In such a\nsetup, DVC scripts will be sharable manifestation (and implementation) of your\nproduction pipeline where each step can be transparently reviewed, easily\nmaintained, and changed as needed over time.</p>\n<h2>Will DevOps Be Captivating?</h2>\n<p>If you are further interested in understanding the ever-proliferating role of\nDevOps in the modern Data Science and predictive analytics in business, there\nare good resources for your review below</p>\n<ol>\n<li><a href=\"https://www.forbes.com/sites/teradata/2016/11/14/devops-for-data-science-why-analytics-ops-is-key-to-value/\">DevOps For Data Science: Why Analytics Ops Is Key To Value</a>\n(Forbes, Nov 14, 2016)</li>\n<li><a href=\"https://www.packtpub.com/books/content/bridging-gap-between-data-science-and-devops\">Bridging the Gap Between Data Science and DevOps</a></li>\n<li><a href=\"https://devops.com/devops-life-better-data-scientists/\">Is DevOps Making Life Better for Data Scientists?</a></li>\n</ol>\n<p>By any mean, DVC is going to be a useful instrument to fill the multiple gaps\nbetween the classical in-lab old-school data science practices and growing\ndemands of business to build solid DevOps processes and workflows to streamline\nmature and persistent data analytics.</p>","timeToRead":5,"fields":{"slug":"/data-version-control-in-analytics-devops-paradigm"},"frontmatter":{"title":"Data Version Control in Analytics DevOps Paradigm","date":"July 27, 2017","description":"Why DevOps matters in data science, what specific challenges data scientists\nface in the day to day work, and how do we setup a better environment for the\nteam.\n","descriptionLong":"The eternal dream of almost every Data Scientist today is to spend all the\ntime exploring new datasets, engineering new features, inventing and\nvalidating cool new algorithms and strategies. However, daily routines of a\nData Scientist include raw data pre-processing, dealing with infrastructure,\nbringing models to production. That's where good DevOps practices and skills\nare essential and will certainly be beneficial for industrial Data Scientists\nas they can address the above-mentioned challenges in a self-service manner.\n","tags":["DevOps","DVC"],"commentsUrl":"https://discuss.dvc.org/t/data-version-control-in-analytics-devops-paradigm/297","author":{"childMarkdownRemark":{"frontmatter":{"name":"George Vyshnya","avatar":{"childImageSharp":{"fixed":{"base64":"data:image/jpeg;base64,/9j/2wBDABALDA4MChAODQ4SERATGCgaGBYWGDEjJR0oOjM9PDkzODdASFxOQERXRTc4UG1RV19iZ2hnPk1xeXBkeFxlZ2P/2wBDARESEhgVGC8aGi9jQjhCY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2P/wgARCAAUABQDASIAAhEBAxEB/8QAGQABAAMBAQAAAAAAAAAAAAAAAAIEBQMG/8QAFwEAAwEAAAAAAAAAAAAAAAAAAQIDAP/aAAwDAQACEAMQAAABjC9VJ6sc87GoTp50Zf/EABoQAAIDAQEAAAAAAAAAAAAAAAECAAMREhP/2gAIAQEAAQUCYPpFlZCkznFsI8+zO9tcZTs//8QAGREAAgMBAAAAAAAAAAAAAAAAAAECEBJB/9oACAEDAQE/AVBZMnK//8QAGhEAAgIDAAAAAAAAAAAAAAAAAAECEBESQf/aAAgBAgEBPwFzeTdna//EAB0QAAICAgMBAAAAAAAAAAAAAAABAhESIRAxQXH/2gAIAQEABj8CUYd1Ys3aZpWZdfCT2abIQpJZeE37XH//xAAaEAEBAQEBAQEAAAAAAAAAAAABEQBhQSEx/9oACAEBAAE/IWkCqc0EET55i645mSq94yEldwZBnHXOzwji/RDlw5v/2gAMAwEAAgADAAAAEKAoQ//EABcRAQEBAQAAAAAAAAAAAAAAAAABESH/2gAIAQMBAT8QuMpemP/EABcRAAMBAAAAAAAAAAAAAAAAAAABESH/2gAIAQIBAT8Q16O2DFZ//8QAHBABAQADAAMBAAAAAAAAAAAAAREAITFBUXGx/9oACAEBAAE/ELAwC8WBvziOXvYnEgcxZl9ZsjGNVeg/cVTAJDHyT5heJcYLgaB20Sy1uXG0xWawIKa1wz//2Q==","width":40,"height":40,"src":"/static/226395429650b032ac92f5ecf1410e9b/d83e5/george_vyshnya.jpg","srcSet":"/static/226395429650b032ac92f5ecf1410e9b/d83e5/george_vyshnya.jpg 1x,\n/static/226395429650b032ac92f5ecf1410e9b/58860/george_vyshnya.jpg 1.5x,\n/static/226395429650b032ac92f5ecf1410e9b/90ac5/george_vyshnya.jpg 2x","srcWebp":"/static/226395429650b032ac92f5ecf1410e9b/e145b/george_vyshnya.webp","srcSetWebp":"/static/226395429650b032ac92f5ecf1410e9b/e145b/george_vyshnya.webp 1x,\n/static/226395429650b032ac92f5ecf1410e9b/0d42c/george_vyshnya.webp 1.5x,\n/static/226395429650b032ac92f5ecf1410e9b/f46db/george_vyshnya.webp 2x"}}}}}},"picture":{"childImageSharp":{"fluid":{"base64":"data:image/jpeg;base64,/9j/2wBDABALDA4MChAODQ4SERATGCgaGBYWGDEjJR0oOjM9PDkzODdASFxOQERXRTc4UG1RV19iZ2hnPk1xeXBkeFxlZ2P/2wBDARESEhgVGC8aGi9jQjhCY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2P/wgARCAANABQDASIAAhEBAxEB/8QAGQAAAgMBAAAAAAAAAAAAAAAAAAIBAwQF/8QAFQEBAQAAAAAAAAAAAAAAAAAABAP/2gAMAwEAAhADEAAAAeXteGDoGIo//8QAHBAAAQQDAQAAAAAAAAAAAAAAAQACERIDExQh/9oACAEBAAEFAmtlRRHN6RURsZzhf//EABgRAQEAAwAAAAAAAAAAAAAAAAEAAxEU/9oACAEDAQE/ARcrq5r/xAAZEQACAwEAAAAAAAAAAAAAAAAAAQIRFCL/2gAIAQIBAT8BfKs0xP/EABgQAAMBAQAAAAAAAAAAAAAAAAABMSAh/9oACAEBAAY/AsLsRWf/xAAcEAEAAgEFAAAAAAAAAAAAAAABABEhEDFBcZH/2gAIAQEAAT8hfieQIFKuVYQMBL2UbFd6d//aAAwDAQACAAMAAAAQhM//xAAYEQACAwAAAAAAAAAAAAAAAAAAATFBYf/aAAgBAwEBPxDQYEqo/8QAGBEAAgMAAAAAAAAAAAAAAAAAAAERMWH/2gAIAQIBAT8QdTKRqf/EAB0QAQACAQUBAAAAAAAAAAAAAAEAESFBUXGRodH/2gAIAQEAAT8QOFxdpW3Cxqh4goLE2KRBHs4NRLS3z+T/2Q==","aspectRatio":1.5005861664712778,"src":"/static/9def39df2ac1ace1d24a5a1a08bc9462/6fdf8/post-image.jpg","srcSet":"/static/9def39df2ac1ace1d24a5a1a08bc9462/9fc73/post-image.jpg 213w,\n/static/9def39df2ac1ace1d24a5a1a08bc9462/ee221/post-image.jpg 425w,\n/static/9def39df2ac1ace1d24a5a1a08bc9462/6fdf8/post-image.jpg 850w,\n/static/9def39df2ac1ace1d24a5a1a08bc9462/88a70/post-image.jpg 1275w,\n/static/9def39df2ac1ace1d24a5a1a08bc9462/15ae8/post-image.jpg 1700w,\n/static/9def39df2ac1ace1d24a5a1a08bc9462/770cf/post-image.jpg 2560w","srcWebp":"/static/9def39df2ac1ace1d24a5a1a08bc9462/5c1d9/post-image.webp","srcSetWebp":"/static/9def39df2ac1ace1d24a5a1a08bc9462/99b2d/post-image.webp 213w,\n/static/9def39df2ac1ace1d24a5a1a08bc9462/23220/post-image.webp 425w,\n/static/9def39df2ac1ace1d24a5a1a08bc9462/5c1d9/post-image.webp 850w,\n/static/9def39df2ac1ace1d24a5a1a08bc9462/5e720/post-image.webp 1275w,\n/static/9def39df2ac1ace1d24a5a1a08bc9462/35cfd/post-image.webp 1700w,\n/static/9def39df2ac1ace1d24a5a1a08bc9462/ab8d3/post-image.webp 2560w","sizes":"(max-width: 850px) 100vw, 850px","presentationWidth":850}}},"pictureComment":null}}},"pageContext":{"next":{"fields":{"slug":"/ml-model-ensembling-with-fast-iterations"},"frontmatter":{"title":"ML Model Ensembling with Fast Iterations"}},"previous":{"fields":{"slug":"/r-code-and-reproducible-model-development-with-dvc"},"frontmatter":{"title":"R code and reproducible model development with DVC"}},"currentPage":18,"slug":"/data-version-control-in-analytics-devops-paradigm"}}}