Incorporate natural language queries and operations into your Python data cleaning workflow.
Many of the series operations we need to do in our pandas data cleaning projects can be assisted by AI tools, including by PandasAI. PandasAI takes advantage of large language models, such as that from OpenAI, to enable natural language queries and operations on data columns. In this post, we examine how to use PandasAI to query Series values, create new Series, set Series values conditionally, and reshape our data.
You can install PandasAI by entering pip install pandasai into a terminal or into Windows Powershell. You will also need to get a token from openai.com to send a request to the OpenAI API.
As the PandasAI library is developing rapidly, you can anticipate different results depending on the versions of PandasAI and pandas you are using. In this article, I use version 1.4.8 of PandasAI and version 1.5.3 of pandas.
We will work with data from the National Longitudinal Study of Youth (NLS) conducted by the United States Bureau of Labor Statistics. The NLS has surveyed the same cohort of high school students for over 25 years, and has useful data items on educational outcomes and weeks worked for each of those years, among many other variables. It is available for public use at nlsinfo.org. (The NLS public releases are covered by the United States government Open Data Policy, which permits both non-commercial and commercial use.)
We will also work with COVID-19 data provided by Our World in Data. That dataset has one row per country per day with number of new cases and new deaths. This dataset is available for download at ourworldindata.org/covid-cases, with a Creative Commons CC BY 4.0 license. You can also download all code and data used in this post from GitHub.
We start by importing the OpenAI and SmartDataframe modules from PandasAI. We also have to instantiate an llm object:
import pandas as pd
from pandasai.llm.openai import OpenAI
from pandasai import SmartDataframe
llm = OpenAI(api_token="Your API Token")
Next, we load the DataFrames we will be using and create a SmartDataframe object from the NLS pandas DataFrame:
covidcases = pd.read_csv("data/covidcases.csv")
nls97 = pd.read_csv("data/nls97f.csv")
nls97.set_index("personid", inplace=True)
nls97sdf = SmartDataframe(nls97, config={"llm": llm})
Now we are ready to generate summary statistics on Series from our SmartDataframe. We can ask for the average for a single Series, or for multiple Series:
nls97sdf.chat("Show average of gpaoverall")2.8184077281812128
nls97sdf.chat("Show average for each weeks worked column")
Average Weeks Worked
0
weeksworked00 26.42
weeksworked01 29.78
weeksworked02 31.83
weeksworked03 33.51
weeksworked04 35.10
weeksworked05 37.34
weeksworked06 38.44
weeksworked07 39.29
weeksworked08 39.33
weeksworked09 37.51
weeksworked10 37.12
weeksworked11 38.06
weeksworked12 38.15
weeksworked13 38.79
weeksworked14 38.73
weeksworked15 39.67
weeksworked16 40.19
weeksworked17 40.37
weeksworked18 40.01
weeksworked19 41.22
weeksworked20 38.35
weeksworked21 36.17
weeksworked22 11.43
We can also summarize Series values by another Series, usually one that is categorical:
nls97sdf.chat("Show satmath average by gender")Female Male
0 486.65 516.88
We can also create a new Series with the chat
method of SmartDataframe. We do not need to use the actual column names. For example, PandasAI will figure out that we want the childathome Series when we write child at home:
nls97sdf = nls97sdf.chat("Set childnum to child at home plus child not at
home")
nls97sdf[['childnum','childathome','childnotathome']].
sample(5, random_state=1)childnum childathome childnotathome
personid
211230 2.00 2.00 0.00
990746 3.00 3.00 0.00
308169 3.00 1.00 2.00
798458 NaN NaN NaN
312009 NaN NaN NaN
We can use the chat
method to create Series values conditionally:
nls97sdf = nls97sdf.chat("evermarried is 'No' when maritalstatus is
'Never-married', else 'Yes'")
nls97sdf.groupby(['evermarried','maritalstatus']).size()evermarried maritalstatus
No Never-married 2767
Yes Divorced 669
Married 3068
Separated 148
Widowed 23
dtype: int64
PandasAI is quite flexible regarding the language you might use here. For example, the following provides the same results:
nls97sdf = nls97sdf.chat("if maritalstatus is 'Never-married' set
evermarried2 to 'No', otherwise 'Yes'")
nls97sdf.groupby(['evermarried2','maritalstatus']).size()evermarried2 maritalstatus
No Never-married 2767
Yes Divorced 669
Married 3068
Separated 148
Widowed 23
dtype: int64
We can do calculations across a number of similarly named columns:
nls97sdf = nls97sdf.chat("Set weeksworked for each row to the average of
all weeksworked columns for that row")
This will calculate the average of all weeksworked00-weeksworked22 columns and assign that to a new column called weeksworked.
We can easily impute values where they are missing based on summary statistics:
nls97sdf.gpaenglish.describe()count 5,798
mean 273
std 74
min 0
25% 227
50% 284
75% 323
max 418
Name: gpaenglish, dtype: float64
nls97sdf = nls97sdf.chat("set missing gpaenglish to the average")
nls97sdf.gpaenglish.describe()
count 8,984
mean 273
std 59
min 0
25% 264
50% 273
75% 298
max 418
Name: gpaenglish, dtype: float64
We can also use PandasAI to do some reshaping. Recall that the COVID-19 case data has new cases for each day for each country. Let’s say we only want the first row of data for each country. We can do that the traditional way with drop_duplicates:
firstcase = covidcases.
sort_values(['location','casedate']).
drop_duplicates(['location'], keep='first')firstcase.set_index('location', inplace=True)
firstcase.shape
(231, 67)
firstcase[['iso_code','continent','casedate',
'total_cases','new_cases']].head(2).T
location Afghanistan Albania
iso_code AFG ALB
continent Asia Europe
casedate 2020-03-01 2020-03-15
total_cases 1.00 33.00
new_cases 1.00 33.00
We can get the same results by creating a SmartDataframe and using the chat
method. The natural language I use here is remarkably straightforward, Show first casedate and location and other values for each country:
covidcasessdf = SmartDataframe(covidcases, config={"llm": llm})
firstcasesdf = covidcasessdf.chat("Show first casedate and location and
other values for each country.")firstcasesdf.shape
(231, 7)
firstcasesdf[['location','continent','casedate',
'total_cases','new_cases']].head(2).T
iso_code ABW AFG
location Aruba Afghanistan
continent North America Asia
casedate 2020-03-22 2020-03-01
total_cases 5.00 1.00
new_cases 5.00 1.00
Notice that PandasAI makes smart choices about the columns to get. We get the columns we need rather than all of them. We could have also just passed the names of the columns we wanted to chat
. (PandasAI sorted the rows by iso_code, rather than by location, which is why the first row is different.)
Much of the work when using PandasAI is really just importing the relevant libraries and instantiating large language model and SmartDataframe objects. Once that’s done, simple sentences sent to the chat
method of the SmartDataframe are sufficient to summarize Series values and create new Series.
PandasAI excels at generating simple statistics from Series. We don’t even need to remember the Series name exactly. Often the natural language we might use can be more intuitive than traditional pandas methods like groupby. The Show satmath average by gender value passed to chat
is a good example of that.
Operations on Series, including the creation of a new Series, is also quite straightforward. We created a total number of children Series (childnum) by instructing the SmartDataframe to add the number of children living at home to the number of children not living at home. We didn’t even provide the literal Series names, childathome and childnotathome respectively. PandasAI figured out what we meant.
Since we are passing natural language instructions to chat
for our Series operations, there is no one right way to get what we want. For example, we get the same result when we passed evermarried is ‘No’ when maritalstatus is ‘Never-married’, else ‘Yes’
to chat
as we did with if maritalstatus is ‘Never-married’ set evermarried2 to ‘No’, otherwise ‘Yes’
.
We can also do fairly extensive DataFrame reshaping with simple natural language instructions, as in the last command we provided. We add and other values
to the instructions to get columns other than casedate. PandasAI also figures out that location makes sense as the index.
You can read more about how to use PandasAI and SmartDataframes here:
Or in the second edition of my book, Python Data Cleaning Cookbook:
Good luck with your data cleaning and I would love to hear how things are going!