Top 15 Python Tips for Data Cleaning/ Understanding

Hui Xiang Chua

Beginners Data Data Science Scientific Libraries (Numpy/Pandas/SciKit/...) Use Case

See in schedule Download/View Slides

Data cleaning is one of the most important tasks in data science but it is unglamorous, underappreciated and under-discussed. These are some common tasks involved in data cleaning but not limited to:
- Merging/ appending
- Checking completeness of data
- Checking of valid values
- De-duplication
- Handling of missing values
- Recoding

Most, if not all, of the time, the datasets that we have to analyze are unclean. i.e. they are not necessarily complete/ accurate/ valid. This will impact the accuracy of our analysis if we do not clean them properly.

This talk covers how to perform data cleaning and understanding using primarily Pandas and Numpy. If you’re new to data analytics/ data science and are interested how to use Python to perform analysis, or if you're an Excel user hoping to move to Python, this talk might be for you.

Participants should be at least familiar with the basics of Python programming.

Type: Talk (30 mins); Python level: Beginner; Domain level: Beginner


Hui Xiang Chua

Essence

Hui Xiang Chua is a Data Science for Social Good fellow and was an instructor with General Assembly prior to joining Essence as a Senior Analytics Manager. She has over six years of experience solving problems using data in the public service. Combining her passion for education, data, and tech, she was a recipient of the KDD Impact Program award for bringing data science into a high school curriculum. She is also the #VizforSocialGood local chapter leader for Singapore and runs a data science blog called Data Double Confirm that was recognised as 2018/2019 Top 100 Data Science Resources on MastersInDataScience.com. She holds a B.Sc.(Hons) in Statistics and M.Sc. in Business Analytics from National University of Singapore.