TL;DR
In the humanitarian response world there can be tens of thousands of tabular (CSV and Excel) datasets, many of which contain critical information for helping save lives. Data can be provided by hundreds of different organizations with different naming conventions, languages and data standards, so having information (metadata) about what each column represents in tables is important for finding the right data and understanding how it fits together. Much of this metadata is set manually, which is time-consuming and error prone, so any automatic method can have a real effect towards helping people. In this article we revisit a previous analysis “Predicting Metadata of Humanitarian Datasets with GPT 3” to see how advances in the last 18 months open the way for more efficient and less time-consuming methods for setting metadata on tabular data.
Using metadata-tagged CSV and Excel datasets from the Humanitarian Data Exchange (HDX) we show that fine-tuning GPT-4o-mini works well for predicting Humanitarian Exchange Language (HXL) tags and attributes for the most common tags related to location and dates. However, for less well-represented tags and attributes the technique can be a bit limited due to poor quality training data where humans have made mistakes in manually labelling data or simply aren’t using all possible HXL metadata combinations. It also has the limitation of not being able to adjust when the metadata standard changes, since the training data would not reflect those changes.
Given more powerful LLMs are now available, we tested a technique to directly prompt GPT-4o or GPT-4o-mini rather than fine-tuning, providing the full HXL core schema definition in the system prompt now that larger context windows are available. This approach was shown to be more accurate than fine-tuning when using GPT-4o, able to support rarer HXL tags and attributes and requiring no custom training data, making it easier to manage and deploy. It is however more expensive, but not if using GPT-4o-mini, albeit with a slight decrease in performance. Using this approach we provide a simple Python class in a GitHub Gist that can be used in data processing pipelines to automatically add HXL metadata tags and attributes to tabular datasets.
Generative AI moves fast!
About 18 months ago I wrote a blog post Predicting Metadata of Humanitarian Datasets with GPT 3.
That’s right, with GPT 3, not even 3.5! 🙂
Even so, back then Large Language Model (LLM) fine-tuning produced great performance for predicting Humanitarian Exchange Language (HXL) metadata fields for tabular datasets on the amazing Humanitarian Data Exchange (HDX). In that study, the training data represented the distribution of HXL data on HDX and so was comprised of the most common tags relating to location and dates. These are very important for linking different datasets together in location and time, a crucial factor in using data to optimize humanitarian response.
The LLM field has since advanced … a LOT.
So in this article, we will revisit the technique, expand it to cover less frequent HXL tags and attributes and explore other options now available to us for situations where a complex, high-cardinality taxonomy needs to be applied to data. We will also explore the ability to predict less frequent HXL standard tags and attributes not currently represented in the human-labeled training data.
Setup
You can follow along with this analysis by opening these notebooks in Google Colab or running them locally:
- generate-test-train-data.ipynb — A notebook for creating test and training datasets
- openai-hxl-prediction.ipynb — Notebook exploring fine-tuning and prompting for predicting HXL datasets
Please refer to the README in the repo for installation instructions.
HXL Data from the Humanitarian Data Exchange
For this study, and with help from the HDX team, we will use data extracted from the HDX platform using a crawler process they run to track the use of HXL metadata tags and attributes on the platform. You can find great HXL resources on GitHub, but if you want to follow along with this analysis I have also saved the source data on Google Drive as the crawler will take days to process the hundreds of thousands of tabular datasets on HDX.
The data looks like this, with one row per HXL-tagged table column …
The core HXL Schema
The HXL postcard is a really great overview of the most common HXL tags and attributes in the core schema. For our analysis, we will apply the full standard as found on HDX which provides a spreadsheet of supported tags and attributes …
Data Processing
The generate-test-train-data.ipynb notebook provides all the steps taken to create test and training datasets, but here are some key points to note:
1. Removal of automatic pipeline repeat HXL data
In this study, I removed duplicate data created by automated pipelines that upload data to HDX, by using an MDF hash of column names in each tabular dataset (CSV and Excel files). For example, a CSV file of population statistics created by an organization is often very similar for each country-specific CSV or Excel file, so we only take one example. This has a balancing effect on the data, providing more variation of HXL tags and attributes by removing very similar repeat data.
2. Constraining data to valid HXL
About 50% of the HDX data with HXL tags uses a tag or attribute which are not specified in the HXL Core Schema, so this data is removed from training and test sets.
3. Data enrichment
As a (mostly!) human being, when deciding what HXL tags and attributes to use on a column, I take a peek at the data for that column and also the data as a whole in the table. For this analysis we do the same for the LLM fine-tuning and prompt data, adding in data excerpts for each column. A table description is also added using an LLM (GPT-3.5-Turbo) summary of the data to make them consistent, as summaries on HDX can vary in form, ranging from pages to a few words.
4. Carefully splitting data to create train/test sets
Many machine learning pipelines split data randomly to create training and test sets. However, for HDX data this would result in columns and files from the same organization being in train and test. I felt this was a bit too easy for testing predictions and so instead split the data by organizations to ensure organizations in the test set were not in the training data. Additionally, subsidiaries of the same parent organization — eg “ocha-iraq” and “ocha-libya” — were not allowed to be in both the training and test sets, again to make the predictions more realistic. My aim was to test prediction with organizations as if their data had never been seen before.
After all of the above and down-sampling to save costs, we are left with 2,883 rows in the training set and 485 rows in the test set.
Creating JSONL fine-tuning prompt files
In my original article I opted for using a completion model, but with the release of GPT-4o-mini I instead generated prompts appropriate for fine-tuning a chat model (see here for more information about the available models).
Each prompt has the form …
{
"messages": [
{
"role": "system",
"content": "<SYSTEM PROMPT>"
},
{
"role": "user",
"content": "<INPUT PROMPT>"
},
{
"role": "assistant",
"content": "<EXPECTED OUTPUT>"
}
]
}
Note: The above has been formatted for clarity, but JSONL will have everything in one line per record.
Using the data excerpts, LLM_generated table description, column name we collated, we can now generate prompts which look like this …
{
"messages": [
{
"role": "system",
"content": "You are an assistant that replies with HXL tags and attributes"
},
{
"role": "user",
"content": "What are the HXL tags and attributes for a column with these details?
resource_name='admin1-summaries-earthquake.csv';
dataset_description='The dataset contains earthquake data for various
administrative regions in Afghanistan,
including country name, admin1 name, latitude,
longitude, aggregation type, indicator name,
and indicator value. The data includes maximum
earthquake values recorded in different regions,
with corresponding latitude and longitude coordinates.
The dataset provides insights into the seismic
activity in different administrative areas of
Afghanistan.';
column_name:'indicator';
examples: ['earthquake', 'earthquake', 'earthquake', 'earthquake', 'earthquake', 'earthquake', 'earthquake', 'earthquake', 'earthquake', 'earthquake', 'earthquake']"
},
{
"role": "assistant",
"content": "#indicator+name"
}
]
}
Fine-tuning GPT-4o-mini
We now have test and training files in the right format for fine-tuning an OpenAI chat model, so let’s tune our model …
def fine_tune_model(train_file, model_name="gpt-4o-mini"):
"""
Fine-tune an OpenAI model using training data.Args:
prompt_file (str): The file containing the prompts to use for fine-tuning.
model_name (str): The name of the model to fine-tune. Default is "davinci-002".
Returns:
str: The ID of the fine-tuned model.
"""
# Upload file to OpenAI for fine-tuning
file = client.files.create(
file=open(train_file, "rb"),
purpose="fine-tune"
)
file_id = file.id
print(f"Uploaded training file with ID: {file_id}")
# Start the fine-tuning job
ft = client.fine_tuning.jobs.create(
training_file=file_id,
model=model_name
)
ft_id = ft.id
print(f"Fine-tuning job started with ID: {ft_id}")
# Monitor the status of the fine-tuning job
ft_result = client.fine_tuning.jobs.retrieve(ft_id)
while ft_result.status != 'succeeded':
print(f"Current status: {ft_result.status}")
time.sleep(120) # Wait for 60 seconds before checking again
ft_result = client.fine_tuning.jobs.retrieve(ft_id)
if 'failed' in ft_result.status.lower():
sys.exit()
print(f"Fine-tuning job {ft_id} succeeded!")
# Retrieve the fine-tuned model
fine_tuned_model = ft_result.fine_tuned_model
print(f"Fine-tuned model: {fine_tuned_model}")
return fine_tuned_model
model = fine_tune_model("hxl_chat_prompts_train.jsonl", model_name="gpt-4o-mini-2024-07-18")
In the above we are using the new GPT-4-mini model, which from OpenAI is currently free to fine-tune …
“Now through September 23, GPT-4o mini is free to fine-tune up to a daily limit of 2M training tokens. Overages over 2M training tokens will be charged at $3.00/1M tokens. Starting September 24, fine-tuning training will cost $3.00/1M tokens. Check out the fine-tuning docs for more details on free access.”
Even at $3.00/1 Million tokens, the costs are quite low for this task, coming out at about $7 a fine-tuning run for just over 2 million tokens in the test file. Bearing in mind, fine-tuning should be a rare event for this particular task, once we have such a model it can be reused.
The fine-tuning produces the following output …
Uploaded training file with ID: file-XXXXXXXXXXXXXXX
Fine-tuning job started with ID: ftjob-XXXXXXXXXXXXXXX
Current status: validating_files
Current status: validating_files
Current status: running
Current status: running
Current status: running
Current status: running
Current status: running
Current status: running
Current status: running
Current status: running
Current status: running
Current status: running
Current status: running
Current status: running
Current status: running
Current status: running
Current status: running
Current status: running
Current status: running
Current status: running
Current status: running
Fine-tuning job ftjob-XXXXXXXXXXXXXXX succeeded!
Fine-tuned model: ft:gpt-4o-mini-2024-07-18::XXXXXXX
It took about 45 minutes.
Testing our fine-tuned model to predict HXL
Now that we have a nice new shiny fine-tuned model for predicting HXL tags and attributes, we can use the test file to take it for a spin …
def make_chat_predictions(prompts, model, temperature=0.1, max_tokens=13):
"""
Generate chat predictions based on given prompts using the OpenAI chat model.Args:
prompts (list): A list of prompts, where each prompt is a dictionary containing a list of messages.
Each message in the list has a 'role' (either 'system', 'user', or 'assistant') and 'content'.
model (str): The name or ID of the OpenAI chat model to use for predictions.
temperature (float, optional): Controls the randomness of the predictions. Higher values (e.g., 0.5) make the
output more random, while lower values (e.g., 0.1) make it more deterministic.
Defaults to 0.1.
max_tokens (int, optional): The maximum number of tokens in the predicted response. Defaults to 13.
Returns:
pandas.DataFrame: A DataFrame containing the results of the chat predictions. Each row in the DataFrame
corresponds to a prompt and includes the prompt messages, the actual message, and the
predicted message.
"""
results = []
for p in prompts:
actual = p["messages"][-1]["content"]
p["messages"] = p["messages"][0:2]
completion = client.chat.completions.create(
model=model,
messages=p["messages"],
temperature=temperature,
max_tokens=max_tokens
)
predicted = completion.choices[0].message.content
predicted = filter_for_schema(predicted)
res = {
"prompt": p["messages"],
"actual": actual,
"predicted": predicted
}
print(f"Predicted: {predicted}; Actual: {actual}")
results.append(res)
results = pd.DataFrame(results)
return results
def filter_for_schema(text):
"""
Filters the input text to extract approved HXL schema tokens.
Args:
text (str): The input text to be filtered.
Returns:
str: The filtered text containing only approved HXL schema tokens.
"""
if " " in text:
text = text.replace(" ","")
tokens_raw = text.split("+")
tokens = [tokens_raw[0]]
for t in tokens_raw[1:]:
tokens.append(f"+{t}")
filtered = []
for t in tokens:
if t in APPROVED_HXL_SCHEMA:
if t not in filtered:
filtered.append(t)
filtered = "".join(filtered)
if len(filtered) > 0 and filtered[0] != '#':
filtered = ""
return filtered
def output_prediction_metrics(results, prediction_field="predicted", actual_field="actual"):
"""
Prints out model performance report for HXL tag prediction. Metrics are for
just predicting tags, as well as predicting tags and attributes.
Parameters
----------
results : dataframe
Dataframe of results
prediction_field : str
Field name of element with prediction. Handy for comparing raw and post-processed predictions.
actual_field: str
Field name of the actual result for comparison with prediction
"""
y_test = []
y_pred = []
y_justtag_test = []
y_justtag_pred = []
for index, r in results.iterrows():
if actual_field not in r and predicted_field not in r:
print("Provided results do not contain expected values.")
sys.exit()
y_pred.append(r[prediction_field])
y_test.append(r[actual_field])
actual_tag = r[actual_field].split("+")[0]
predicted_tag = r[prediction_field].split("+")[0]
y_justtag_test.append(actual_tag)
y_justtag_pred.append(predicted_tag)
print(f"LLM results for {prediction_field}, {len(results)} predictions ...")
print("nJust HXL tags ...n")
print(f"Accuracy: {round(accuracy_score(y_justtag_test, y_justtag_pred),2)}")
print(
f"Precision: {round(precision_score(y_justtag_test, y_justtag_pred, average='weighted', zero_division=0),2)}"
)
print(
f"Recall: {round(recall_score(y_justtag_test, y_justtag_pred, average='weighted', zero_division=0),2)}"
)
print(
f"F1: {round(f1_score(y_justtag_test, y_justtag_pred, average='weighted', zero_division=0),2)}"
)
print(f"nTags and attributes with {prediction_field} ...n")
print(f"Accuracy: {round(accuracy_score(y_test, y_pred),2)}")
print(
f"Precision: {round(precision_score(y_test, y_pred, average='weighted', zero_division=0),2)}"
)
print(
f"Recall: {round(recall_score(y_test, y_pred, average='weighted', zero_division=0),2)}"
)
print(
f"F1: {round(f1_score(y_test, y_pred, average='weighted', zero_division=0),2)}"
)
return
with open(TEST_FILE) as f:
X_test = [json.loads(line) for line in f]
results = make_chat_predictions(X_test, model)
output_prediction_metrics(results)
print("Done")
Noting in the above that all predictions are filtered for allowed tags and attributes as defined in the HXL standard.
This gives the following results …
LLM results for predicted, 458 predictions ...Just HXL tags ...
Accuracy: 0.83
Precision: 0.85
Recall: 0.83
F1: 0.82
Tags and attributes with predicted ...
Accuracy: 0.61
Precision: 0.6
Recall: 0.61
F1: 0.57
‘Just HXL Tags’ means predicting the first part of the HXL, for example if the full HXL is #affected+infected+f, the model correctly got the #affected part correct. ‘Tags and attributes’ means predicting the full HXL string, ie ‘#affected+infected+f’, a much harder challenge due to all the combinations possible.
The performance isn’t perfect, but not that bad, especially as we have balanced the dataset to reduce the number of location and date tags and attributes (ie made this study a bit more challenging). There are tens of thousands of humanitarian response tables without HDX, even the above performance would likely add value.
Let’s look into cases where predictions didn’t agree with human-labeled data …
Reviewing the Human Labeled HXL Data
The predictions were saved to a spreadsheet, and I manually went through most of the predictions that didn’t agree with the labels. You can find this analysis here and summarized below …
What’s interesting is that in some cases the LLM is actually correct, for example in adding additional HXL attributes which the human labeled data doesn’t include. There are also cases where the human labeled HXL was perfectly reasonable, but the LLM predicted another tag or attribute that could also be interpreted as correct. For example a #region can also be an #admin1 in some countries, and whether something is an +id or +code is sometimes difficult to decide, both are appropriate.
Using the above categories, I created a new test set where the expected HXL tags were corrected. On re-running the prediction we get improved results …
Just HXL tags ...Accuracy: 0.88
Precision: 0.88
Recall: 0.88
F1: 0.88
Tags and attributes with predicted ...
Accuracy: 0.66
Precision: 0.71
Recall: 0.66
F1: 0.66
Predicting HXL without Fine-tuning, instead only prompting GPT-4o
The above shows that the human-labeled data itself can be incorrect. The HXL standard is designed excellently, but can be a challenge to memorize for developers and data scientists when setting HXL tags and attributes on data. There are some amazing tools already provided by the HXL team, but sometimes the HXL is still incorrect. This introduces a problem to the fine-tuning approach which relies on this human-labeled data for training, especially for less well represented tags and attributes that humans are not using very often. It also has the limitation of not being able to adjust when the metadata standard changes, since the training data would not reflect those changes.
Since the initial analysis 18 months ago various LLM providers have advanced their models significantly. OpenAI of course released GPT-4o as their flagship product, which importantly has a context window of 128k tokens and is another data point suggesting costs of foundational models are decreasing (see for example GPT-4-Turbo compared to GPT-4o here). Given these factors, I wondered …
If models are becoming more powerful and less expensive to use, could we avoid fine-tuning altogether and use them to predict HXL tags and attributes by prompting alone?
Not only could this mean less engineering work to clean data and fine-tune models, it may have a big advantage in being able to include HXL tags and attributes which are not included in the human-labeled training data but are part of the HXL standard. This is one potentially huge advantage of powerful LLMs, being able to classify with zero- and few-shot prompting.
Creating a prompt for predicting HXL
Models like GPT-4o are trained on web data, so I thought I’d first do a test using one of our prompts to see if it already knew everything there was to know about HXL tags …
What we see is that it seems to know about HXL syntax, but the answer is incorrect (the correct answer is ‘#affected+infected’), and it has chosen tags and attributes that are not in the HXL standard. It’s actually similar to what we see with human-tagged HXL.
How about we provide the most important parts of the HXL standard in the system prompt?
def generate_hxl_standard_prompt(local_data_file):
"""
Generate a standard prompt for predicting Humanitarian Markup Language (HXL) tags and attributes.Args:
local_data_file (str): The path to the local data file containing core hashtags and attributes.
Returns:
str: The generated HXL standard prompt.
"""
core_hashtags = pd.read_excel(local_data_file, sheet_name='Core hashtags')
core_hashtags = core_hashtags.loc[core_hashtags["Release status"] == "Released"]
core_hashtags = core_hashtags[["Hashtag", "Hashtag long description", "Sample HXL"]]
core_attributes = pd.read_excel(local_data_file, sheet_name='Core attributes')
core_attributes = core_attributes.loc[core_attributes["Status"] == "Released"]
core_attributes = core_attributes[["Attribute", "Attribute long description", "Suggested hashtags (selected)"]]
print(core_hashtags.shape)
print(core_attributes.shape)
core_hashtags = core_hashtags.to_dict(orient='records')
core_attributes = core_attributes.to_dict(orient='records')
hxl_prompt= f"""
You are an AI assistant that predicts Humanitarian Markup Language (HXL) tags and attributes for columns of data where the HXL standard is defined as follows:
CORE HASHTAGS:
{json.dumps(core_hashtags,indent=4)}
CORE ATTRIBUTES:
{json.dumps(core_attributes, indent=4)}
Key points:
- ALWAYS predict hash tags
- NEVER predict a tag which is not a valid core hashtag
- NEVER start with a core hashtag, you must always start with a core hashtag
- Always try and predict an attribute if possible
- Do not use attribute +code if the data examples are human readable names
You must return your result as a JSON record with the fields 'predicted' and 'reasoning', each is of type string.
"""
print(len(hxl_prompt.split(" ")))
print(hxl_prompt)
return hxl_prompt
This gives us a prompt like this …
You are an AI assistant that predicts Humanitarian Markup Language (HXL) tags and attributes for columns of data where the HXL standard is defined as follows:CORE HASHTAGS:
[
{
"Hashtag": "#access",
"Hashtag long description": "Accessiblity and constraints on access to a market, distribution point, facility, etc.",
"Sample HXL": "#access +type"
},
{
"Hashtag": "#activity",
"Hashtag long description": "A programme, project, or other activity. This hashtag applies to all levels; use the attributes +activity, +project, or +programme to distinguish different hierarchical levels.",
"Sample HXL": "#activity +project"
},
{
"Hashtag": "#adm1",
"Hashtag long description": "Top-level subnational administrative area (e.g. a governorate in Syria).",
"Sample HXL": "#adm1 +code"
},
{
"Hashtag": "#adm2",
"Hashtag long description": "Second-level subnational administrative area (e.g. a subdivision in Bangladesh).",
"Sample HXL": "#adm2 +name"
},
{
"Hashtag": "#adm3",
"Hashtag long description": "Third-level subnational administrative area (e.g. a subdistrict in Afghanistan).",
"Sample HXL": "#adm3 +code"
},
{
"Hashtag": "#adm4",
"Hashtag long description": "Fourth-level subnational administrative area (e.g. a barangay in the Philippines).",
"Sample HXL": "#adm4 +name"
},
{
"Hashtag": "#adm5",
"Hashtag long description": "Fifth-level subnational administrative area (e.g. a ward of a city).",
"Sample HXL": "#adm5 +code"
},
{
"Hashtag": "#affected",
"Hashtag long description": "Number of people or households affected by an emergency. Subset of #population; superset of #inneed.",
"Sample HXL": "#affected +f +children"
},
{
"Hashtag": "#beneficiary",
"Hashtag long description": "General (non-numeric) information about a person or group meant to benefit from aid activities, e.g. "lactating women".",
"Sample HXL": "#beneficiary +name"
},
{
"Hashtag": "#capacity",
"Hashtag long description": "The response capacity of the entity being described (e.g. "25 beds").",
"Sample HXL": "#capacity +num"
},
... Truncated for brevity
},
{
"Hashtag": "#targeted",
"Hashtag long description": "Number of people or households targeted for humanitarian assistance. Subset of #inneed; superset of #reached.",
"Sample HXL": "#targeted +f +adult"
},
{
"Hashtag": "#value",
"Hashtag long description": "A monetary value, such as the price of goods in a market, a project budget, or the amount of cash transferred to beneficiaries. May be used together with #currency in financial or cash data.",
"Sample HXL": "#value +transfer"
}
]
CORE ATTRIBUTES:
[
{
"Attribute": "+abducted",
"Attribute long description": "Hashtag refers to people who have been abducted.",
"Suggested hashtags (selected)": "#affected, #inneed, #targeted, #reached"
},
{
"Attribute": "+activity",
"Attribute long description": "The implementers classify this activity as an "activity" proper (may imply different hierarchical levels in different contexts).",
"Suggested hashtags (selected)": "#activity"
},
{
"Attribute": "+adolescents",
"Attribute long description": "Adolescents, loosely defined (precise age range varies); may overlap +children and +adult. You can optionally create custom attributes in addition to this to add precise age ranges, e.g. "+adolescents +age12_17".",
"Suggested hashtags (selected)": "#affected, #inneed, #targeted, #reached, #population"
},
{
"Attribute": "+adults",
"Attribute long description": "Adults, loosely defined (precise age range varies); may overlap +adolescents and +elderly. You can optionally create custom attributes in addition to this to add precise age ranges, e.g. "+adults +age18_64".",
"Suggested hashtags (selected)": "#affected, #inneed, #targeted, #reached, #population"
},
{
"Attribute": "+approved",
"Attribute long description": "Date or time when something was approved.",
"Suggested hashtags (selected)": "#date"
},
{
"Attribute": "+bounds",
"Attribute long description": "Boundary data (e.g. inline GeoJSON).",
"Suggested hashtags (selected)": "#geo"
},
{
"Attribute": "+budget",
"Attribute long description": "Used with #value to indicate that the amount is planned/approved/budgeted rather than actually spent.",
"Suggested hashtags (selected)": "#value"
},
{
"Attribute": "+canceled",
"Attribute long description": "Date or time when something (e.g. an #activity) was canceled.",
"Suggested hashtags (selected)": "#date"
},
{
"Attribute": "+children",
"Attribute long description": "The associated hashtag applies to non-adults, loosely defined (precise age range varies; may overlap +infants and +adolescents). You can optionally create custom attributes in addition to this to add precise age ranges, e.g. "+children +age3_11".",
"Suggested hashtags (selected)": "#affected, #inneed, #targeted, #reached, #population"
},
{
"Attribute": "+cluster",
"Attribute long description": "Identifies a sector as a formal IASC humanitarian cluster.",
"Suggested hashtags (selected)": "#sector"
},
{
"Attribute": "+code",
"Attribute long description": "A unique, machine-readable code.",
"Suggested hashtags (selected)": "#region, #country, #adm1, #adm2, #adm3, #adm4, #adm5, #loc, #beneficiary, #activity, #org, #sector, #subsector, #indicator, #output, #crisis, #cause, #impact, #severity, #service, #need, #currency, #item, #need, #service, #channel, #modality, #event, #group, #status"
},
{
"Attribute": "+converted",
"Attribute long description": "Date or time used for converting a monetary value to another currency.",
"Suggested hashtags (selected)": "#date"
},
{
"Attribute": "+coord",
"Attribute long description": "Geodetic coordinates (lat+lon together).",
"Suggested hashtags (selected)": "#geo"
},
{
"Attribute": "+dest",
"Attribute long description": "Place of destination (intended or actual).",
"Suggested hashtags (selected)": "#region, #country, #adm1, #adm2, #adm3, #adm4, #adm5, #loc"
},
{
"Attribute": "+displaced",
"Attribute long description": "Displaced people or households. Refers to all types of displacement: use +idps or +refugees to be more specific.",
"Suggested hashtags (selected)": "#affected, #inneed, #targeted, #reached, #population"
},
{
"Attribute": "+elderly",
"Attribute long description": "Elderly people, loosely defined (precise age range varies). May overlap +adults. You can optionally create custom attributes in addition to this to add precise age ranges, e.g. "+elderly +age65plus".",
"Suggested hashtags (selected)": "#affected, #inneed, #targeted, #reached, #population"
},
... Truncated for brevity
{
"Attribute": "+url",
"Attribute long description": "The data consists of web links related to the main hashtag (e.g. for an #org, #service, #activity, #loc, etc).",
"Suggested hashtags (selected)": "#contact, #org, #activity, #service, #meta"
},
{
"Attribute": "+used",
"Attribute long description": "Refers to a #service, #item, etc. that affected people have actually consumed or otherwise taken advantage of.",
"Suggested hashtags (selected)": "#service, #item"
}
]
Key points:
- ALWAYS predict hash tags
- NEVER predict a tag which is not a valid core hashtag
- NEVER start with a core hashtag, you must always start with a core hashtag
- Always try and predict an attribute if possible
You must return your result as a JSON record with the fields 'predicted' and 'reasoning', each is of type string.
It’s pretty long (the above has been truncated), but encapsulates the HXL standard.
Another advantage of the direct prompt method is that we can also ask for the LLM to provide its reasoning when predicting HXL. This can of course include hallucination, but I’ve always found it useful for refining prompts.
For the user prompt, we will use the same information that we used for fine-tuning, to include excerpt and LLM-generated table summary …
What are the HXL tags and attributes for a column with these details? resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/IFRC Appeals Data for South Sudan8.csv';
dataset_description='The dataset contains information on various
appeals and events related to South Sudan,
including details such as the type of appeal,
status, sector, amount requested and funded,
start and end dates, as well as country-specific
information like country code, region, and average
household size. The data includes appeals for
different crises such as floods, population
movements, cholera outbreaks, and Ebola preparedness,
with details on beneficiaries and confirmation needs.
The dataset also includes metadata such as IDs,
names, and translation modules for countries and regions.';
column_name:'aid';
examples: ['18401', '17770', '17721', '16858', '15268', '15113', '14826', '14230', '12788', '9286', '8561']
Putting it all together, and prompting both GPT-4o-mini and GPT-4o for comparison …
def call_gpt(prompt, system_prompt, model, temperature, top_p, max_tokens):
"""
Calls the GPT model to generate a response based on the given prompt and system prompt.Args:
prompt (str): The user's input prompt.
system_prompt (str): The system's input prompt.
model (str): The name or ID of the GPT model to use.
temperature (float): Controls the randomness of the generated output. Higher values (e.g., 0.8) make the output more random, while lower values (e.g., 0.2) make it more deterministic.
top_p (float): Controls the diversity of the generated output. Higher values (e.g., 0.8) make the output more diverse, while lower values (e.g., 0.2) make it more focused.
max_tokens (int): The maximum number of tokens to generate in the response.
Returns:
dict or None: The generated response as a dictionary object, or None if an error occurred during generation.
"""
response = client.chat.completions.create(
model=model,
messages= [
{"role": "system", "content": system_prompt},
{"role": "user", "content": prompt}
],
max_tokens=2000,
temperature=temperature,
top_p=top_p,
frequency_penalty=0,
presence_penalty=0,
stop=None,
stream=False,
response_format={ "type": "json_object" }
)
result = response.choices[0].message.content
result = result.replace("```json","").replace("```","")
try:
result = json.loads(result)
result["predicted"] = result["predicted"].replace(" ","")
except:
print(result)
result = None
return result
def make_prompt_predictions(prompts, model, temperature=0.1, top_p=0.1,
max_tokens=2000, debug=False, actual_field="actual"):
"""
Generate predictions for a given set of prompts using the specified model.
Args:
prompts (pandas.DataFrame): A DataFrame containing the prompts to generate predictions for.
model (str): The name of the model to use for prediction.
temperature (float, optional): The temperature parameter for the model's sampling. Defaults to 0.1.
top_p (float, optional): The top-p parameter for the model's sampling. Defaults to 0.1.
max_tokens (int, optional): The maximum number of tokens to generate for each prompt. Defaults to 2000.
debug (bool, optional): Whether to print debug information during prediction. Defaults to False.
actual_field (str, optional): The name of the column in the prompts DataFrame that contains the actual values. Defaults to "actual".
Returns:
pandas.DataFrame: A DataFrame containing the results of the predictions, including the prompt, actual value, predicted value, and reasoning.
"""
num_prompts = len(prompts)
print(f"Number of prompts: {num_prompts}")
results = []
for index, p in prompts.iterrows():
if index % 50 == 0:
print(f"{index/num_prompts*100:.2f}% complete")
prompt = p["prompt"]
prompt = ast.literal_eval(prompt)
prompt = prompt[1]["content"]
actual = p[actual_field]
result = call_gpt(prompt, hxl_prompt, model, temperature, top_p, max_tokens)
if result is None:
print(" !!!!! No LLM result")
predicted = ""
reasoning = ""
else:
predicted = result["predicted"]
reasoning = result["reasoning"]
if debug is True:
print(f"Actual: {actual}; Predicted: {predicted}; Reasoning: {reasoning}")
results.append({
"prompt": prompt,
"actual": actual,
"predicted": predicted,
"reasoning": reasoning
})
results = pd.DataFrame(results)
print(f"nn===================== {model} Results =========================nn")
output_prediction_metrics(results)
print(f"nn=================================================================")
results["match"] = results['predicted'] == results['actual']
results.to_excel(f"{LOCAL_DATA_DIR}/hxl-metadata-prompting-only-prediction-{model}-results.xlsx", index=False)
return results
for model in ["gpt-4o-mini","gpt-4o"]:
print(f"Model: {model}")
results = make_prompt_predictions(X_test, model, temperature=0.1, top_p=0.1, max_tokens=2000)
We get …
===================== gpt-4o-mini Results =========================LLM results for predicted, 458 predictions ...
Just HXL tags ...
Accuracy: 0.77
Precision: 0.83
Recall: 0.77
F1: 0.77
Tags and attributes with predicted ...
Accuracy: 0.53
Precision: 0.54
Recall: 0.53
F1: 0.5
===================== gpt-4o Results =========================
LLM results for predicted, 458 predictions ...
Just HXL tags ...
Accuracy: 0.86
Precision: 0.86
Recall: 0.86
F1: 0.85
Tags and attributes with predicted ...
Accuracy: 0.71
Precision: 0.7
Recall: 0.71
F1: 0.69
=================================================================
As a reminder, the fine-tuned model produced the following results …
Just HXL tags ...Accuracy: 0.83
Precision: 0.85
Recall: 0.83
F1: 0.82
Tags and attributes with predicted ...
Accuracy: 0.61
Precision: 0.6
Recall: 0.61
F1: 0.57
How does prompting-only GPT-4o compare with GPT-4o-mini?
Looking at the above, we see that GPT-4o-mini prompting-only predicts just tags with 77% accuracy, which is less than GPT-4o-mini fine-tuning (83%) and GPT-4o prompting-only (86%). That said the performance is still good and would improve HXL coverage even if used as-is.
How does prompting-only compare with the fine-tuned model?
GPT-4o prompting-only gave the best results of all models, with 86% accuracy on tags and 71% on tags and attributes. In fact, the performance could well be better after a bit more analysis of the test data to correct incorrect human-labeled tags,.
Let’s take a closer look at the times GPT-4o got it wrong …
df = pd.read_excel(f"{LOCAL_DATA_DIR}/hxl-metadata-prompting-only-prediction-gpt-4o-results.xlsx")breaks = df[df["match"]==False]
print(breaks.shape)
for index, row in breaks.iterrows():
print("n======================================== ")
pprint.pp(f"nPrompt: {row['prompt']}")
print()
print(f"Actual", row["actual"])
print(f"Predicted", row["predicted"])
print()
pprint.pp(f'Reasoning: n{row["reasoning"]}')
'n'
'Prompt: What are the HXL tags and attributes for a column with these '
'details? '
"resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/IFRC "
"Appeals Data for South Sudan8.csv'; dataset_description='The dataset "
'contains information on various appeals and events related to South Sudan, '
'including details such as the type of appeal, status, sector, amount '
'requested and funded, start and end dates, as well as country-specific '
'information like country code, region, and average household size. The data '
'includes appeals for different crises such as floods, population movements, '
'cholera outbreaks, and Ebola preparedness, with details on beneficiaries and '
'confirmation needs. The dataset also includes metadata such as IDs, names, '
"and translation modules for countries and regions.'; column_name:'dtype.id'; "
"examples: ['12', '5', '1', '1', '12', '12', '1', '6', '1', '1', '7']")Actual #cause+id
Predicted #meta+id
('Reasoning: n'
"The column 'dtype.id' contains numeric identifiers (e.g., '12', '5', '1') "
'which are likely to be internal identifiers for data records. According to '
'the HXL standard, the appropriate hashtag for internal identifiers is '
"'#meta' with the attribute '+id'.")
========================================
('n'
'Prompt: What are the HXL tags and attributes for a column with these '
'details? '
"resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/IFRC "
"Appeals Data for South Sudan8.csv'; dataset_description='The dataset "
'contains information on various appeals and events related to South Sudan, '
'including details such as the type of appeal, status, sector, amount '
'requested and funded, start and end dates, as well as country-specific '
'information like country code, region, and average household size. The data '
'includes appeals for different crises such as floods, population movements, '
'cholera outbreaks, and Ebola preparedness, with details on beneficiaries and '
'confirmation needs. The dataset also includes metadata such as IDs, names, '
"and translation modules for countries and regions.'; "
"column_name:'dtype.name'; examples: ['Flood', 'Population Movement', "
"'Epidemic', 'Epidemic', 'Flood', 'Flood', 'Epidemic', 'Complex Emergency', "
"'Epidemic', 'Epidemic', 'Civil Unrest']")
Actual #cause+name
Predicted #event+type
('Reasoning: n'
"The examples provided in the column ('Flood', 'Population Movement', "
"'Epidemic', 'Complex Emergency', 'Civil Unrest') describe different types of "
'events or incidents within a crisis or emergency. According to the HXL '
'standard, the appropriate hashtag for this type of data is #event, and the '
'attribute +type is used to specify the type or category of the event.')
========================================
('n'
'Prompt: What are the HXL tags and attributes for a column with these '
'details? '
"resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/IFRC "
"Appeals Data for South Sudan8.csv'; dataset_description='The dataset "
'contains information on various appeals and events related to South Sudan, '
'including details such as the type of appeal, status, sector, amount '
'requested and funded, start and end dates, as well as country-specific '
'information like country code, region, and average household size. The data '
'includes appeals for different crises such as floods, population movements, '
'cholera outbreaks, and Ebola preparedness, with details on beneficiaries and '
'confirmation needs. The dataset also includes metadata such as IDs, names, '
"and translation modules for countries and regions.'; "
"column_name:'status_display'; examples: ['Active', 'Active', 'Closed', "
"'Closed', 'Closed', 'Closed', 'Closed', 'Closed', 'Closed', 'Closed', "
"'Closed']")
Actual #status+name
Predicted #status+code
('Reasoning: n'
"The column 'status_display' contains values such as 'Active' and 'Closed', "
'which describe the status of appeals or events. The appropriate HXL hashtag '
'for project or activity status is #status. Since the values are categorical '
'and represent different statuses, the attribute +code is suitable to '
'indicate these status codes.')
========================================
('n'
'Prompt: What are the HXL tags and attributes for a column with these '
'details? '
"resource_name='/content/drive/MyDrive/Colab/hxl-metadata-prediction/data/IFRC "
"Appeals Data for South Sudan8.csv'; dataset_description='The dataset "
'contains information on various appeals and events related to South Sudan, '
'including details such as the type of appeal, status, sector, amount '
'requested and funded, start and end dates, as well as country-specific '
'information like country code, region, and average household size. The data '
'includes appeals for different crises such as floods, population movements, '
'cholera outbreaks, and Ebola preparedness, with details on beneficiaries and '
'confirmation needs. The dataset also includes metadata such as IDs, names, '
"and translation modules for countries and regions.'; "
"column_name:'region.id'; examples: ['0', '0', '0', '0', '0', '0', '0', '0', "
"'0', '0', '0']")
Actual #adm1+code
Predicted #region+id
('Reasoning: n'
"The column 'region.id' contains numeric identifiers for regions, which "
'aligns with the HXL tag #region and the attribute +id. The examples provided '
'are all numeric, indicating that these are likely unique identifiers for '
'regions.')
========================================
Notice how we now have a ‘Reasoning’ field to indicate why the tags were chosen. This is useful and would be an important part for refining the prompt to improve performance.
Looking at the sample above, we see some familiar scenarios that were found when analyzing the fine-tuned model failed predictions …
- +id and +code ambiguity
- #region and #adm1 used interchangeably
- #event versus more detailed tags like #cause
These seem to fall into the category where two tags are possible for a given column given their HXL definition. But there are some real discrepancies which would need more investigation.
That said, using GPT-4o to predict HXL tags and attributes yields the best results, and I believe at an acceptable level given a lot of data is missing HXL metadata altogether and many of the datasets which have it have incorrect tags and attributes.
Cost Comparison
Let’s see how costs compare with each technique and model …
def num_tokens_from_string(string: str, encoding_name: str) -> int:
"""
Returns the number of tokens in a text string using toktoken.
See: https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynbArgs:
string (str): The text string to count the tokens for.
encoding_name (str): The name of the encoding to use.
Returns:
num_tokens: The number of tokens in the text string.
"""
encoding = tiktoken.get_encoding(encoding_name)
num_tokens = len(encoding.encode(string))
return num_tokens
def calc_costs(data, model, method="prompting"):
"""
Calculate token costs for a given dataset, method and model.
Note: Only for inference costs, not fine-tuning
Args:
data (pandas.DataFrame): The data to get the tokens for.
method (str, optional): The method to use. Defaults to "prompting".
model (str): The model to use, eg "gpt-4o-mini"
Returns:
input_tokens: The number of input tokens.
output_tokens: The number of output tokens.
"""
# See https://openai.com/api/pricing/
price = {
"gpt-4o-mini": {
"input": 0.150,
"output": 0.600
},
"gpt-4o": {
"input": 5.00,
"output": 15.00
}
}
input_tokens = 0
output_tokens = 0
for index, p in data.iterrows():
prompt = p["prompt"]
prompt = ast.literal_eval(prompt)
input = prompt[1]["content"]
# If prompting, we must include system prompt
if method == "prompting":
input += " " + hxl_prompt
output = p["Corrected actual"]
input_tokens += num_tokens_from_string(str(input), "cl100k_base")
output_tokens += num_tokens_from_string(str(output), "cl100k_base")
input_cost = input_tokens / 1000000 * price[model]["input"]
output_cost = output_tokens / 1000000 * price[model]["output"]
print(f"nFor {data.shape[0]} table columns where we predicted HXL tags ...")
print(f"{method} prediction with model {model}, {input_tokens} input tokens = ${input_cost}")
print(f"Fine-tuning prediction GPT-4o-mini {output_tokens} output tokens = ${output_cost}n")
hxl_prompt = generate_hxl_standard_prompt(HXL_SCHEMA_LOCAL_FILE, debug=False)
X_test2 = pd.read_excel(f"{LOCAL_DATA_DIR}/hxl-metadata-fine-tune-prediction-results-review.xlsx", sheet_name=0)
calc_costs(X_test2, method="fine-tuning", model="gpt-4o-mini")
calc_costs(X_test2, method="prompting", model="gpt-4o-mini")
calc_costs(X_test2, method="prompting", model="gpt-4o")
Which gives …
For 458 table columns where we predicted HXL tags ...
fine-tuning prediction with model gpt-4o-mini, 99738 input tokens = $0.014960699999999999
Fine-tuning prediction GPT-4o-mini 2001 output tokens = $0.0012006For 458 table columns where we predicted HXL tags ...
prompting prediction with model gpt-4o-mini, 2688812 input tokens = $0.4033218
Fine-tuning prediction GPT-4o-mini 2001 output tokens = $0.0012006
For 458 table columns where we predicted HXL tags ...
prompting prediction with model gpt-4o, 2688812 input tokens = $13.44406
Fine-tuning prediction GPT-4o-mini 2001 output tokens = $0.030015000000000003
Note: the above is only for the inference cost, there will be a very small additional cost in generating table data summaries with GPT-3.5.
Given the test set, predicting HXL for 458 columns …
Fine-tuning:
As expected, inference costs for the fine-tuned GPT-4o mini model (which cost about $7 to fine-tune) are very low about $0.02.
Prediction-only:
- GPT-4o prediction only is expensive, because of the HXL standard being passed in to the system prompt every time, and comes out at $13.44.
- GPT-4o-mini, albeit with reduced performance, is a more reasonable $0.40.
So ease of use comes with a cost if using GPT-4o, but GPT-4o-mini is an attractive alternative.
Finally, it’s worth noting that in many cases, setting HXL tags might not to be real time, for example for a crawler process that corrects already uploaded datasets. This would mean that the new OpenAI batch API could be used, reducing costs by 50%.
A Python class for predicting HXL Tags
Putting this all together, I created a Github gist hxl_utils.py. Check this out from GitHub and place the file in your current working directory.
Let’s download a file to test it with …
# See HDX for this file: https://data.humdata.org/dataset/sudan-acled-conflict-data
DATAFILE_URL="https://data.humdata.org/dataset/5efad450-8b15-4867-b7b3-8a25b455eed8/resource/3352a0d8-2996-4e70-b618-3be58699be7f/download/sudan_hrp_civilian_targeting_events_and_fatalities_by_month-year_as-of-25jul2024.xlsx"
local_data_file = f"{LOCAL_DATA_DIR}/{DATAFILE_URL.split('/')[-1]}"# Save data file locally
urllib.request.urlretrieve(DATAFILE_URL, local_data_file)
# Read it to get a dataframe
df = pd.read_excel(local_data_file, sheet_name=1)
And using this dataframe, let’s predict HXL tags …
from hxl_utils import HXLUtilshxl_utils = HXLUtils(LOCAL_DATA_DIR, model="gpt-4o")
data = hxl_utils.add_hxl(df,"sudan_hrp_civilian_targeting_events_and_fatalities_by_month-year_as-of-25jul2024.xlsx")
print("nnAFTER: nn")
display(data)
And there we have it, some lovely HXL tags!
Let’s see how well GPT-4o-mini does …
hxl_utils = HXLUtils(LOCAL_DATA_DIR, model="gpt-4o-mini")
data = hxl_utils.add_hxl(df,"sudan_hrp_civilian_targeting_events_and_fatalities_by_month-year_as-of-25jul2024.xlsx")
Which gives …
Pretty good! gpt-4o gave “#affected+killed+num” for the last column, where “gpt-4o-mini” gave “#affected+num”, but this could likely be resolved with some deft prompt engineering.
Admittedly this wasn’t a terribly challenging dataset, but it was able to correctly predict tags for events and fatalities, which are less frequent than location and dates.
Future Work
I think a big takeaway here is that the direct-prompting technique produces good results without the need for training. Yes, more expensive for inference, but maybe not if a data scientist is required to curate incorrectly human-labeled fine-tuning data. It would depend on the organization and metadata use-case.
Here are some areas that might be considered in future work …
Improved test data
This analysis did a quick review of the test set to correct HXL tags which were incorrect in the data or had multiple possible values. More time could be spent on this, as always in machine learning, ground truth is key.
Prompt engineering and hyperparameter tuning
The above analysis uses very basic prompts with no real engineering or strategies applied, these could definitely be improved for better performance. With an evaluation set and a framework such as Promptflow, prompt variants could be tested. Additionally we might add more context data, for example in deciding administrative levels, which can vary per country. Finally, we have used fixed hyperparameters for temperature and top_p, as well as completion token length. All these could be tuned leading to better performance.
Cost optimization
The prompting-only approach definitely appears to be a strong option and simplifies how an organization can automatically set HXL tags on their data using GPT-4o. There are of course cost implications with this model, being a more expensive, but predictions occur only on low-volume schema changes, not when the underlying data itself changes, and with new options for batch submission on OpenAI and ever decreasing LLM costs, this technique appears viable for many organizations. GPT-4o-mini also performs well and is a fraction of the cost.
Application to other metadata standards
It would be interesting to apply this technique to other metadata and labeling standards, I’m sure many organizations are already using LLMs for this.
Please like this article if inclined and I’d be delighted if you followed me! You can find more articles here.