The Top 5 Pandas Optimization Methods You Should Know!

Akash Dugam
6 min readSep 11, 2022

--

Things get messy if we don’t employ memory optimization techniques when dealing with high-dimensional data. You don't want the pop-up sighting the ‘Memory Error’ on your screen, do you? Therefore we need to be careful about how we utilize memory.

I have used the techniques below to optimize memory use and speed up the computations.

But, wait. How do we find out the memory uses in the first place? This is very simpler than you think. We can use info() or memory_usage(). The main difference between the two is info() doesn’t show a detailed report on memory uses whereas the memory_uses() method showcases a detailed view of the memory usage.

Now let’s talk about the methods that will help in optimizing memory usage.

  1. Use of inplace assignment
  2. Read what you require
  3. Change the data type of the column
  4. Parallelize your operation
  5. Read in chunks!

Let’s talk about each of these techniques in brief.

Use of inplace assignment

Most of the time we tend to use standard assignment in data manipulation. We manipulate the data and save that manipulation in a separate copy of the dataframe which is exactly what standard assignment means. Take a look at the example below.

Due to the use of the standard assignment technique, two distinct dataframe lies in the same environment causing an increase in the memory. Intuitively, the DF (original dataframe) is no use when we have decided to go with DF2 (manipulated dataframe). This very problem gets solved when you decide to use the ‘inplace assignment’ technique than a standard one.

In the inplace assignment technique, multiple copies of the dataframe won’t be created avoiding the increase in memory with a useless dataframe.

Read what you require

The name itself means, accepting the columns that you required and abandoning the least interesting columns. Sometimes, you might be in a situation where you have high-dimensional data but your interest lies in certain (let's say 2 or 3) columns. In such cases, reading all irrelevant columns cause heavy memory use than reading the interesting columns.

Let me elaborate with one example, let’s say you have medical data of many patients. Your job is to calculate the BMI. Your dataset is high dimensional (Think of 3M rows and 300 columns). But you required only the information of the patient's name, height, and weight. That’s it. Therefore instead of reading the entire data, it’s advisable to read what you required.

Change the data type of the column

We all know how pandas work in assigning data types. By default, pandas assign the highest memory datatype to columns. But the question is do we required to have the highest memory type every time? Let’s find out.

Inspecting numerical columns:

Let me be honest with you, you don’t require to have int64 or float64 datatype every time. Refer to the below table to understand why I am saying this to you.

This range is applicable for numerical columns

As we discussed already when you have an int or float column by default the pandas will assign an int64 or float64 data type. But we need to inspect the column and find the minimum and maximum value of the integer column to decide the range. Once we have the range we can identify and specify the correct data type.

This table shows the % of reduction of the memory after applying the optimized datatype.

Inspecting Object columns:

Apart from numeric (int or float) you will see another most important datatype known as ‘Object’. This data type represents the categorical data which has some unique values repeated over and over again. We can convert ‘object’ datatype to ‘category’ with the help of astype method.

I generally use the below logic to optimize the categorical column.

If (unique_value (of Column) / Total Value (of column) < 0.5):
data[col] = data[col].astype('category')

Look at the above table, the name column has more unique values than Cabin or Embarked, therefore as per our criteria if the ratio between unique values and total values is less than 50% then it will be optimized. In the above case, the name column does not meet the criteria (which is obvious!) and hence optimization doesn’t happen.

Parallelize your operations

Have you ever thought of parallelizing the program execution? Your laptop or system has multiple cores. Let’s not keep them idle. Let’s put them to work by using the multicore functionality with the help of pandarallel.

Note :- Pandarallel is alternative to Dask

First thing first, you need to install the library called — Pandarallel.

pip install pandarallel

Let me show you the difference in execution → With and Without Parallelization!

This is without parallelization:

This is with parallelization:

Note:- By modifying only one line of code, it offers a quick solution to parallelize your pandas operations over all of your CPUs. Additionally, progress bars are shown.

Let me show you how what changes you need to make to utilize the capabilities of the Pandarallel.

https://github.com/nalepae/pandarallel

Read in chunks!

There are times when you face high-dimensional data. All other techniques mentioned above will be useless if you cannot read the data in the first place in the memory.

One interesting thing about Pandas is serialization. Pandas read data sequentially row-by-row at a time. We can leverage this and command pandas to read data by chunk size.

We can easily achieve this with the following python code -

Note that, every chunk is dataframe. You can add type(chunk_data) to check the type of the dataframe.

Footnote

These are the ways I generally employ in projects to optimize memory. This has a very much impact on the process when you are dealing with high-dimensional data.

Above mentioned subtle ways to optimize memory are not exhaustive in nature. There could be many ways to enhance memory optimization in pandas. I will be adding those in future updates.

Stay updated with all the content I post -

More content is in the draft and will be ready to post. Until then, stay tuned. :)

--

--