Automating machine learning workflow with DVC

What data scientist / ML engineer wants to do while software engineers are busy with CI/CD.

Hongjoo Lee

Big Data Data Data Science Deployment/Continuous Integration and Delivery

See in schedule Download/View Slides

As software engineers work on CI/CD process as soon as they start a new project, data scientists and ML engineers define a pipeline for data as it flows through a typical workflow. Each step of the pipeline is fed data processed from its preceding step as CI/CD process starts from code changes.

"Pipelining ML project" is sometimes misleading as it implies a large project with a group of engineers working on some large systems , being considered to be hard for an individual and unnecessary for a small project. Regardless of its size, having well organized pipelines for any ML projects is essential to succeed and actually it could be done easily with utilizing a proper tool.

In this talk, we will go through a machine learning workflow divided into a few steps composing a ML pipeline from data ingestion to model deployment. Each step depends on data produced by previous step, which are controlled by DVC. DVC is open-source version control system for data scientist and ML engineer helping them to organize data, models and experiments for some ML projects. The presentation will not only introduce how to use the tool but also show how to organize a ML pipeline with some examples.

The goal of this talk is to motivate data scientists and ML engineer to start building machine learning pipeline with DVC. Audience might expect a guide to using DVC for automating the pipeline. Also I will give some explanation about concepts of machine learning related techniques necessary for understanding the pipeline.

This session is designed to be accessible to everyone in beginners level. Understandings of basic concepts of machine learning and version control system (preferably, Git) might be helpful but not mandatory for the audience.

Type: Talk (30 mins); Python level: Beginner; Domain level: Intermediate


Hongjoo Lee

SK Hynix

Hongjoo is a software engineer and machine learning engineer from Seoul, Korea. He recently developed Fraud Detection System based on machine learning technique for a startup company in fintech area. He started his industrial career as a software developing engineer from 2000. He holds MPhil degree in computer science and engineering from HKUST.