Writing and Scaling Collaborative Data Pipelines with Kedro

How to get your Data Scientists and Data Engineers to play nice, both now and in the future.

Tam-Sanh Nguyen

Best Practice Data Development Open-Source python

See in schedule Download/View Slides

The goal of this talk is to introduce data pipeline developers to QuantumBlack's approach for keeping data pipelines healthy and sustainable and facilitating collaboration between data scientists and data engineers by using our open source framework, Kedro. Attendees need between novice and intermediate knowledge of Python (enough to understand syntactic sugar and funargs) in order to appreciate this talk.

As data continues to inform more and more business strategy, high quality, fully featured data pipelines have never been more critical. Small data scripts and single-coder science projects are not enough to keep up with the pace of day-to-day business and their ever-growing list of requirements. Now, more than ever, we need data engineers and data scientists to collaborate effectively. Yet, these two parties come with inherently competing needs.

Data scientists need high data volatility and parameterization, for experimentation, and data engineers, on the other hand, need stability and performance, to deliver data. Furthermore, as pipelines grow, the cost of knowledge transfer and training new team members also increases. How can we get scientists and engineers to work well together, and sustain pipeline growth as the team also grows?

For this, QuantumBlack created Kedro, a framework for writing data pipelines that addresses both the needs for flexibility and stability in its features and patterns of use. By using Kedro’s tools and operating model, we have enabled our teams to scale our single-developer, micro-pipes to industrial sized data processors with dozens of developers; all without sacrificing readability, quality, or stability. This talk will show you how.

Type: Talk (45 mins); Python level: Intermediate; Domain level: Intermediate

Tam-Sanh Nguyen

McKinsey / QuantumBlack

Tam-Sanh Nguyen has been working on data engineering for a majority of his career, for a variety of different industries, all across the world.

Data pipelining is one of his passions in life and the only thing that pleases him more than writing good data pipelines is helping others to write good data pipelines. He's been working on data pipelines since Spark 0.6.2, while at Palantir, and has enjoyed wearing a few different hats in startups all over the world, since.

He is currently hired out of Shanghai, as part of the McKinsey Digital team for Digital Transformations, and is in the midst of moving to Singapore to join their growing QuantumBlack team.

In his downtime, you can find him practicing and teaching yoga, enjoying foreign language exchanges, and hosting a YouTube channel dedicated to writing data pipelines: http://youtube.com/c/DataEngineerOne