Docker and Python: making them play nicely and securely for Data Science and ML

Tania Allard

Conda / conda forge Data Science Deep Learning Machine-Learning Scientific Libraries (Numpy/Pandas/SciKit/...)

See in schedule Download/View Slides

Docker has become a standard tool for developers around the world to deploy applications in a reproducible and robust manner. The existence of Docker and Docker compose have reduced the time needed to set up new software and implementing complex technology stacks for our applications. Now, six years after the initial release of Docker, we can say with confidence that containers and containers orchestration have become some of the defaults in the current technology stacks.

There are thousands of tutorials and getting started documents for those wanting to adopt Docker for apps deployment. However, if you are a Data Scientist, a researcher or someone working on scientific computing wanting to adopt Docker, the story is quite different. There are very few tutorials (in comparison to app/web) and documents focused on Docker best practices for DS and scientific computing. If you are working on DS, ML or scientific computing, this talk is for you. We'll cover best practices when building Docker containers for data-intensive applications, from optimising your image build, to ensuring your containers are secure and efficient deployment workflows. We will talk about the most common problems faced while using Docker with data intensive applications and how you can overcome most of them. Finally I'll give some practical and useful tips for you to improve your Docker workflows and practises.

Attendees will leave the talk feeling confident about adopting Docker across a range of DS, ML and research projects.

Who and Why (audience)
This talk is designed for folks working in data-intensive environments (i.e. Machine Learning, Data Science, research and scientific computing) and that are either using Docker or want to learn more about how to use Docker in these environments. Attendees will leave the talk feeling confident about adopting Docker in their workflows as well as have acquired several best practices and guidelines to do this robustly.
Introduction (5 minutes)
About me
When is Docker the right choice?
Docker for all Python users: introduction to Docker in Machine Learning (ML), Data Science (DS) and research contexts
The usual culprits
Optimising for data-oriented application (10 minutes)
Creating a data-oriented Docker image - how is this different from an app/web image?
Choosing the right base image - set yourself for success
Dependencies, volumes and code best practices
Security and performance (10 minutes)
Finding vulnerabilities in your images
Image consistency and reproducibility
Optimising image building - cache and image size considerations
Do not reinvent the wheel - automate! (10 minutes)
Consider tools to assist with Dockerfile generation - e.g. repo2docker, dokta
Creating templates for projects
Automating image build and publishing - e.g. GitHub actions
Automated deployment strategies - going from local to deploying your containerised application
Conclusions (5 minutes)
Top 10 best practices when working with Docker and Python for DS/ML and research
Additional resources
Thanks and getting in touch

Type: Talk (45 mins); Python level: Intermediate; Domain level: Beginner


Tania Allard

Microsoft

Tania is a Research Engineer and developer advocate with vast experience in academic research and industrial environments. Her main areas of expertise are within data-intensive applications, scientific computing, and machine learning. One of her main areas of expertise is the improvement of processes, reproducibility and transparency in research, data science and artificial intelligence.
She is passionate about mentoring, open source, and its community and is involved in a number of initiatives aimed to build more diverse and inclusive communities. She is also a contributor, maintainer, and developer of a number of open source projects and the Founder of Pyladies NorthWest UK.