Using OpenAI and PandasAI for Series Operations

Incorporate natural language queries and operations into your Python data cleaning workflow.

Michael B Walker

Published in

Towards Data Science

6 min read

4 days ago

—

Red panda drawing donated by Karen Walker, the artist.

Many of the series operations we need to do in our pandas data cleaning projects can be assisted by AI tools, including by PandasAI. PandasAI takes advantage of large language models, such as that from OpenAI, to enable natural language queries and operations on data columns. In this post, we examine how to use PandasAI to query Series values, create new Series, set Series values conditionally, and reshape our data.

You can install PandasAI by entering pip install pandasai into a terminal or into Windows Powershell. You will also need to get a token from openai.com to send a request to the OpenAI API.

As the PandasAI library is developing rapidly, you can anticipate different results depending on the versions of PandasAI and pandas you are using. In this article, I use version 1.4.8 of PandasAI and version 1.5.3 of pandas.

We will work with data from the National Longitudinal Study of Youth (NLS) conducted by the United States Bureau of Labor Statistics. The NLS has surveyed the same cohort of high school students for over 25 years, and has useful data items on educational outcomes and weeks worked for each of those years, among many other variables. It is available for public use at nlsinfo.org. (The NLS public releases are covered by the United States government Open Data Policy, which permits both non-commercial and commercial use.)

We will also work with COVID-19 data provided by Our World in Data. That dataset has one row per country per day with number of new cases and new deaths. This dataset is available for download at ourworldindata.org/covid-cases, with a Creative Commons CC BY 4.0 license. You can also download all code and data used in this post from GitHub.

We start by importing the OpenAI and SmartDataframe modules from PandasAI. We also have to instantiate an llm object:

import pandas as pd
from pandasai.llm.openai import OpenAI
from pandasai import SmartDataframe
llm = OpenAI(api_token="Your API Token")

Next, we load the DataFrames we will be using and create a SmartDataframe object from the NLS pandas DataFrame:

covidcases = pd.read_csv("data/covidcases.csv")
nls97 = pd.read_csv("data/nls97f.csv")
nls97.set_index("personid", inplace=True)
nls97sdf = SmartDataframe(nls97, config={"llm": llm})

Now we are ready to generate summary statistics on Series from our SmartDataframe. We can ask for the average for a single Series, or for multiple Series:

nls97sdf.chat("Show average of gpaoverall")2.8184077281812128
nls97sdf.chat("Show average for each weeks worked column")
Average Weeks Worked
                  0
weeksworked00 26.42
weeksworked01 29.78
weeksworked02 31.83
weeksworked03 33.51
weeksworked04 35.10
weeksworked05 37.34
weeksworked06 38.44
weeksworked07 39.29
weeksworked08 39.33
weeksworked09 37.51
weeksworked10 37.12
weeksworked11 38.06
weeksworked12 38.15
weeksworked13 38.79
weeksworked14 38.73
weeksworked15 39.67
weeksworked16 40.19
weeksworked17 40.37
weeksworked18 40.01
weeksworked19 41.22
weeksworked20 38.35
weeksworked21 36.17
weeksworked22 11.43

We can also summarize Series values by another Series, usually one that is categorical:

nls97sdf.chat("Show satmath average by gender")Female   Male
0  486.65 516.88

We can also create a new Series with the chat method of SmartDataframe. We do not need to use the actual column names. For example, PandasAI will figure out that we want the childathome Series when we write child at home:

nls97sdf = nls97sdf.chat("Set childnum to child at home plus child not at
  home")
nls97sdf[['childnum','childathome','childnotathome']].
  sample(5, random_state=1)childnum  childathome  childnotathome
personid                                       
211230        2.00         2.00            0.00
990746        3.00         3.00            0.00
308169        3.00         1.00            2.00
798458         NaN          NaN             NaN
312009         NaN          NaN             NaN

We can use the chat method to create Series values conditionally:

nls97sdf = nls97sdf.chat("evermarried is 'No' when maritalstatus is
  'Never-married', else 'Yes'")
nls97sdf.groupby(['evermarried','maritalstatus']).size()evermarried  maritalstatus
No           Never-married    2767
Yes          Divorced          669
             Married          3068
             Separated         148
             Widowed            23
dtype: int64

PandasAI is quite flexible regarding the language you might use here. For example, the following provides the same results:

nls97sdf = nls97sdf.chat("if maritalstatus is 'Never-married' set
  evermarried2 to 'No', otherwise 'Yes'")
nls97sdf.groupby(['evermarried2','maritalstatus']).size()evermarried2  maritalstatus
No            Never-married    2767
Yes           Divorced          669
              Married          3068
              Separated         148
              Widowed            23
dtype: int64

We can do calculations across a number of similarly named columns:

nls97sdf = nls97sdf.chat("Set weeksworked for each row to the average of
  all weeksworked columns for that row")

This will calculate the average of all weeksworked00-weeksworked22 columns and assign that to a new column called weeksworked.

We can easily impute values where they are missing based on summary statistics:

nls97sdf.gpaenglish.describe()count   5,798
mean      273
std        74
min         0
25%       227
50%       284
75%       323
max       418
Name: gpaenglish, dtype: float64
nls97sdf = nls97sdf.chat("set missing gpaenglish to the average")
nls97sdf.gpaenglish.describe()
count   8,984
mean      273
std        59
min         0
25%       264
50%       273
75%       298
max       418
Name: gpaenglish, dtype: float64

We can also use PandasAI to do some reshaping. Recall that the COVID-19 case data has new cases for each day for each country. Let’s say we only want the first row of data for each country. We can do that the traditional way with drop_duplicates:

firstcase = covidcases.
  sort_values(['location','casedate']).
  drop_duplicates(['location'], keep='first')firstcase.set_index('location', inplace=True)
firstcase.shape
(231, 67)
firstcase[['iso_code','continent','casedate',
  'total_cases','new_cases']].head(2).T
location    Afghanistan     Albania
iso_code            AFG         ALB
continent          Asia      Europe
casedate     2020-03-01  2020-03-15
total_cases        1.00       33.00
new_cases          1.00       33.00

We can get the same results by creating a SmartDataframe and using the chat method. The natural language I use here is remarkably straightforward, Show first casedate and location and other values for each country:

covidcasessdf = SmartDataframe(covidcases, config={"llm": llm})
firstcasesdf = covidcasessdf.chat("Show first casedate and location and
other values for each country.")firstcasesdf.shape
(231, 7)
firstcasesdf[['location','continent','casedate',
  'total_cases','new_cases']].head(2).T
iso_code               ABW          AFG
location             Aruba  Afghanistan
continent    North America         Asia
casedate        2020-03-22   2020-03-01
total_cases           5.00         1.00
new_cases             5.00         1.00

Notice that PandasAI makes smart choices about the columns to get. We get the columns we need rather than all of them. We could have also just passed the names of the columns we wanted to chat. (PandasAI sorted the rows by iso_code, rather than by location, which is why the first row is different.)

Much of the work when using PandasAI is really just importing the relevant libraries and instantiating large language model and SmartDataframe objects. Once that’s done, simple sentences sent to the chat method of the SmartDataframe are sufficient to summarize Series values and create new Series.

PandasAI excels at generating simple statistics from Series. We don’t even need to remember the Series name exactly. Often the natural language we might use can be more intuitive than traditional pandas methods like groupby. The Show satmath average by gender value passed to chat is a good example of that.

Operations on Series, including the creation of a new Series, is also quite straightforward. We created a total number of children Series (childnum) by instructing the SmartDataframe to add the number of children living at home to the number of children not living at home. We didn’t even provide the literal Series names, childathome and childnotathome respectively. PandasAI figured out what we meant.

Since we are passing natural language instructions to chat for our Series operations, there is no one right way to get what we want. For example, we get the same result when we passed evermarried is ‘No’ when maritalstatus is ‘Never-married’, else ‘Yes’ to chat as we did with if maritalstatus is ‘Never-married’ set evermarried2 to ‘No’, otherwise ‘Yes’.

We can also do fairly extensive DataFrame reshaping with simple natural language instructions, as in the last command we provided. We add and other values to the instructions to get columns other than casedate. PandasAI also figures out that location makes sense as the index.

You can read more about how to use PandasAI and SmartDataframes here:

Large Language Models – PandasAI

PandasAI supports several large language models (LLMs) that are used to generate code from natural language queries.

docs.pandas-ai.com

Or in the second edition of my book, Python Data Cleaning Cookbook:

Python Data Cleaning Cookbook – Second Edition: Prepare your data for analysis with pandas, NumPy…

Python Data Cleaning Cookbook – Second Edition: Prepare your data for analysis with pandas, NumPy, Matplotlib…

packt.link

Good luck with your data cleaning and I would love to hear how things are going!

Using OpenAI and PandasAI for Series Operations

Incorporate natural language queries and operations into your Python data cleaning workflow.

Large Language Models – PandasAI

PandasAI supports several large language models (LLMs) that are used to generate code from natural language queries.

Python Data Cleaning Cookbook – Second Edition: Prepare your data for analysis with pandas, NumPy…

Python Data Cleaning Cookbook – Second Edition: Prepare your data for analysis with pandas, NumPy, Matplotlib…

Bitcoin explodes to a new all-time high of over $88,000 after President Trump wins US elections

Robotics Factory opens applications for Scale Residency Program – The Robot Report

Introspect ships the world’s first GDDR7 memory test system, with up to 40Gbps GDDR7

Supercharge Your Portfolio with Future Tech Stocks!

Join us for Profitable Insights & Expert Tips!

Using OpenAI and PandasAI for Series Operations

Incorporate natural language queries and operations into your Python data cleaning workflow.

Large Language Models – PandasAI

PandasAI supports several large language models (LLMs) that are used to generate code from natural language queries.

Python Data Cleaning Cookbook – Second Edition: Prepare your data for analysis with pandas, NumPy…

Python Data Cleaning Cookbook – Second Edition: Prepare your data for analysis with pandas, NumPy, Matplotlib…

Bitcoin explodes to a new all-time high of over $88,000 after President Trump wins US elections

Robotics Factory opens applications for Scale Residency Program – The Robot Report

Introspect ships the world’s first GDDR7 memory test system, with up to 40Gbps GDDR7

Supercharge Your Portfolio with Future Tech Stocks!

Join us for Profitable Insights & Expert Tips!

Subscribe