ASYNC / Concurreny Best Practice Big Data Distributed Systems Scientific Libraries (Numpy/Pandas/SciKit/...)See in schedule
This talk will provide a practical insight on how to build scalable data streaming machine learning pipelines to process large datasets in real time using Python Asyncio, Kafka, Faust, SpaCy and Seldon.
We will be covering a case study performing automated content moderation on Reddit comments in real time. Our dataset will consist of 200k reddit comments from /r/science, 50,000 of which have been removed by moderators. We will be handling the stream data in a Kafka cluster, and the stream processing will be handled using the stream processing library Faust. We will be running the end-to-end pipeline in Kubernetes with various components legeraging SKLearn, SpaCy and Seldon.
We will then dive into fundamental concepts on stream processing such as windows, watermarking and checkponting, and we will show how to use each of these frameworks to build complex data streaming pipelines that can perform real time processing at scale.
Finally we will show best practices when using these frameworks, as well as a high level overview of tools that can be used for monitoring, including Grafana and Kafka Manager.
Type: Talk (30 mins); Python level: Intermediate; Domain level: Intermediate
Alejandro is the Chief Scientist at the Institute for Ethical AI & Machine Learning, where he leads the development of industry standards on machine learning bias, adversarial attacks and differential privacy. Alejandro is also the Director of Machine Learning Engineering at Seldon Technologies, where he leads large scale projects implementing open source and enterprise infrastructure for Machine Learning Orchestration and Explainability. With over 10 years of software development experience, Alejandro has held technical leadership positions across hyper-growth scale-ups and has delivered multi-national projects with top tier investment banks, magic circle law-firms and global insurance companies. He has a strong track record building cross-functional departments of software engineers from scratch, and leading the delivery of large-scale machine learning systems across the financial, insurance, legal, transport, manufacturing and construction sectors (in Europe, US and Latin America).