Mastering a data pipeline with Python: 6 years of learned lessons from mistakes

Robson Junior

Beginners Big Data Case Study Data Science Open-Source

See in schedule

Building data pipelines are a consolidated task, there are a vast number of tools that automate and help developers to create data pipelines with few clicks on the cloud. It might solve non-complex or well-defined standard problems. This presentation is a demystification of years of experience and painful mistakes using Python as a core to create reliable data pipelines and manage insanely amount of valuable data. Let's cover how each piece fits into this puzzle: data acquisition, ingestion, transformation, storage, workflow management and serving. Also, we'll walk through best practices and possible issues. We'll cover PySpark vs Dask and Pandas, Airflow, and Apache Arrow as a new approach.

Type: Talk (45 mins); Python level: Beginner; Domain level: Beginner


Robson Junior

Microsoft

Robson is a developer deeply involved with software communities, especially the Python community. I've been organizing conferences and meetups since 2011 and effectively speaking in conferences since 2012 about python and cloud technologies and since 2016 about data-related technologies. Also as an Independent consultant, I conduct on-demand architecture consultancy and training sessions about data-related technologies.