Practical Optimisations for Pandas¶

Eyal Trabelsi¶

About Me 🙈¶

Software Engineer at Salesforce 👷

Big passion for python, data and performance optimisations 🐍🤖

Online at medium | twitter 🌐

What's Pandas?🐼¶

Library for data manipulation

Dataset on Memory

Widely used

Performance?! Why ?🤨¶

Fast is better than slow 🐇

Memory efficiency is good 💾

Saving money is awesome 💸

Hardware will only take you so far 💻

When ⏰¶

Program doesn't meet requirements 🚔

Program execution affects development pace 👷

Only bottlenecks

Well Tested

Dataset 📉¶

In [9]:

df = load_dataset()
df.head()

Out[9]:

	order_id	date	date_of_meal	participants	meal_price	type_of_meal	heroes_adjustment	meal_tip
0	839FKFW2LLX4LMBB	2016-05-27	2016-05-31 07:00:00+02:00	['David Bishop']	469	Breakfast	False	93.8
1	97OX39BGVMHODLJM	2018-09-27	2018-10-01 20:00:00+02:00	['David Bishop']	22	Dinner	False	4.4
2	041ORQM5OIHTIU6L	2014-08-24	2014-08-23 14:00:00+02:00	['Karen Stansell']	314	Lunch	False	62.8
3	YT796QI18WNGZ7ZJ	2014-04-12	2014-04-07 21:00:00+02:00	['Addie Patino']	438	Dinner	False	87.6
4	6YLROQT27B6HRF4E	2015-07-28	2015-07-27 14:00:00+02:00	['Addie Patino' 'Susan Guerrero']	690	Lunch	False	138.0

How 👀¶

Use What You Need 🧑¶

Keep needed columns only

Keep needed rows only

Dont Reinvent the Wheel 🎡¶

Vast ecosystem

Use existing solutions

Less bugs

Highly optimized

Avoid Loops ♾¶

Bad Option 😈¶

In [1]:

def iterrows_original_meal_price(df):
    for i, row in df.iterrows():
        df.loc[i]["original_meal_price"] = row["meal_price_with_tip"] - row["meal_tip"]
    return df

In [5]:

%%timeit -r 1 -n 1
iterrows_original_meal_price(df)

35min 13s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

Better Option 🤵¶

In [4]:

def apply_original_meal_price(df):
    df["original_meal_price"] = df.apply(lambda x: x['meal_price_with_tip'] - x['meal_tip'], axis=1)
    return df

In [5]:

%%timeit 
apply_original_meal_price(df)

22.5 s ± 170 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

100x Improvement In Execution Time ⌛¶

Iterrow is evil 😈¶

Best Option 👼¶

In [65]:

def vectorized_original_meal_price(df):
    df["original_meal_price"] = df["meal_price_with_tip"] - df["meal_tip"] 
    return df

In [66]:

%%timeit 
vectorized_original_meal_price(df)

2.46 ms ± 18.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Another 8000x Improvement In Execution Time ⌛¶

Use Vectorized Operations 😇¶

Picking the Right Type 🌈¶

Motivation 🏆¶

In [67]:

ones = np.ones(shape=5000)
ones

Out[67]:

array([1., 1., 1., ..., 1., 1., 1.])

In [84]:

types = ['object', 'complex128', 'float64', 'int64', 'int32', 'int16', 'int8', 'bool']
df = pd.DataFrame(dict([(t, ones.astype(t)) for t in types]))
df.memory_usage(index=False, deep=True)

Out[84]:

object        160000
complex128     80000
float64        40000
int64          40000
int32          20000
int16          10000
int8            5000
bool            5000
dtype: int64

Where We Stand 🌈¶

Dataframe Size 🏋️¶

In [70]:

df.memory_usage(deep=True).sum()

Out[70]:

478844140

Columns Sizes 🏋️¶

In [71]:

df.memory_usage(deep=True)

Out[71]:

Index                 8002720
order_id             73024820
date                 67022780
date_of_meal         82027880
participants         84977580
meal_price           36012240
type_of_meal         63688760
heroes_adjustment    32076480
meal_tip             32010880
dtype: int64

Supported Types 🌈¶

int64

float64

bool

objects

datetime64

timedelta

Category

Sparse Types

Nullable Integer/Nullable Bolean

Optimizing Types 🌈¶

Loading dataframes with specific types

Use to_numeric/to_datetime/to_timedelta functions with downcast parameter

In [11]:

df = df.astype({'order_id': 'category',
                'date': 'category',
                'date_of_meal': 'category',
                'participants': 'category',
                'meal_price': 'int16',
                'type_of_meal': 'category',
                'heroes_adjustment': 'bool',
                'meal_tip': 'float32'})

Optimized Types 🌈¶

Dataframe Size 🏋️¶

In [75]:

df.memory_usage(deep=True).sum()

Out[75]:

36999962

Columns Sizes 🏋️¶

In [76]:

df.memory_usage(deep=True)

Out[76]:

Index                8002720
order_id             8963321
date                 2204942
date_of_meal         3942538
participants         5883450
meal_price           2000680
type_of_meal         1000611
heroes_adjustment    1000340
meal_tip             4001360
dtype: int64

12x Improvement In Memory ⌛¶

Custom Types 🦸🏼🦸‍♀️¶

Your Own Types

Open Sourced Types like cyberpandas and geopandas

Pandas Usage 🐼¶

Chunks 🍰¶

Splitting large data to smaller parts

Work with large datasets

In [53]:

def proccess_file(huge_file_path, chunksize = 10 ** 6):
    for chunk in pd.read_csv(path, chunksize=chunksize):
        process(chunk)

Mean/Sum/Mode/Min/etc optimization 🧮¶

Types matter

In [32]:

%%timeit
df["meal_price_with_tip"].astype(object).mean()

96 ms ± 499 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [33]:

%%timeit
df["meal_price_with_tip"].astype(float).mean()

4.27 ms ± 34.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

20x Performance Improvement⌛¶

DataFrame Serialization 🏋¶

Various file formats

Loading time

Saving time

Disk space

Query/Eval 🧬¶

Use numexpr if installed

Improve Execution Time 👍

Improve Memory 👍

Not all Operations are supported 👎

Example¶

In [28]:

%%timeit
df[df.type_of_meal=="Breakfast"]

103 ms ± 348 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [29]:

%%timeit
df.query("type_of_meal=='Breakfast'")

82.4 ms ± 223 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

% 20 Performance Improvement⌛¶

Use on big datasets 🧞¶

Concat vs Append ➕¶

Every append creates a new dataframe object

Multiple appends are inefficient

Use concat

Sorting Optimization 📟¶

Vanilla python

Pandas

Numpy

Pytorch/Tensorflow

GroupBy Optimizations 👩‍👩‍👧¶

Filter early

Custom functions are slow

Extract logic for custom functions when possible

Merge Optimization 🔍¶

Filter/Aggregate early

Semi join

Join on index

Compiled Code ?! Why 🤯¶

Python dynamic nature

No compilation optimization

Pure Python can be slow

In [17]:

def foo(N):
    accumulator = 0
    for i in range(N):
        accumulator = accumulator + i
    return accumulator

In [18]:

%%timeit
df.meal_price_with_tip.map(foo)

17.9 s ± 25.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Cython and Numba for the rescue 👨‍🚒¶

Cython 🤯¶

Up to 50x speedup from pure python 👍

Learning Curve 👎

Separated Compilation Step 👎 👍

Example¶

In [20]:

%%cython
def cython_foo(long N):
    cdef long accumulator
    accumulator = 0

    cdef long i
    for i in range(N):
        accumulator += i

    return accumulator

In [21]:

%%timeit
df.meal_price_with_tip.map(cython_foo)

365 ms ± 2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

49x Performance Improvement⌛¶

Numba 🤯¶

Up to 200x speedup from pure python 👍

Easy 👍

Highly Configurable 👍

Debugging 👎 👍

Mostly Numeric 👎

Example¶

In [22]:

@jit(nopython=True)
def numba_foo(N):
    accumulator = 0
    for i in range(N):
        accumulator = accumulator + i
    return accumulator

In [23]:

%%timeit
df.meal_price_with_tip.map(numba_foo)

414 ms ± 596 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

43x Performance Improvement⌛¶

1️⃣ Vectorized methods¶

2️⃣ Numba¶

3️⃣ Cython¶

General Python Optimizations 🐍¶

Caching 🏎¶

Avoid unnecessary work/computation.

Faster code

Intermediate Variables👩‍👩‍👧‍👧¶

Intermediate calculations

Memory foot print of both objects

Smarter variables allocation

Example¶

In [37]:

def load_data():
    return np.ones((2 ** 30), dtype=np.uint8)

In [50]:

%%memit
def proccess():
    data = load_data()
    return another_foo(foo(data))

proccess()

peak memory: 8106.62 MiB, increment: 3042.64 MiB

In [51]:

%%memit
def proccess():
    data = load_data()
    data = foo(data)
    data = another_foo(data)
    return data

proccess()

peak memory: 7102.64 MiB, increment: 2038.66 MiB

Concurrency And Parallelism 🎸🎺🎻🎷¶

pandas methods use single process

CPU bound can benefit parallelism

IO bound is very bad without parallesim/concurrency

High Performance Python Book 📖¶

Pandas Alternatives 🐨🐻¶

Libraries that provide dataframe apis

modin

cudf

pyspark

No free lunch 🥢

How 👀¶

Use What You Need 💾⌛

Dont Reinvent the Wheel ⌛💾

Avoid Loops ⌛

Picking the Right Types 💾⌛

Pandas Usage ⌛💾

Compiled Code ⌛

General Python Optimizations ⌛💾

Pandas Alternatives ⌛💾

Additional Resources 📚¶

Vectorization Mindset 📖
Numpy/Pandas/Scipy vectorizations methods 📖
Serialization Time 📖
Chunks 📖
Sorting📖
Caching 📖
Numba Workshop 🛠
Cython Workshop🛠
Numba VS C++ ⚖
Cython VS C ⚖
Intermediate Variables 📖
High Performance Python Book 📖
Pandas Alternatives 📖