Exploring possible use cases of Phi-3-Vision, a small yet powerful MLLM that can be run locally (with code examples)
Microsoft recently released Phi-3, a powerful language model, with a new Vision-Language variant called Phi-3-vision-128k-instruct. This 4B parameter model achieved impressive results on public benchmarks, even surpassing GPT-4V in some cases and outperforming Gemini 1.0 Pro V in all but MMMU.
This blog post will explore how you can utilize Phi-3-vision-128k-instruct as a robust vision and text model in your data science toolkit. We’ll demonstrate its capabilities through various use cases, including:
- Optical Character Recognition (OCR)
- Image Captioning
- Table Parsing
- Figure Understanding
- Reading Comprehension on Scanned Documents
- Set-of-Mark Prompting
We’ll begin by providing a simple code snippet to run this model locally using transformers and bitsandbytes. Then, we’ll showcase an example for each of the use cases listed above.
Running the model locally:
Create a Conda Python environment and install torch and other python dependencies:
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
pip install git+https://github.com/huggingface/transformers.git@60bb571e993b7d73257fb64044726b569fef9403 pillow==10.3.0 chardet==5.2.0 flash_attn==2.5.8 accelerate==0.30.1 bitsandbytes==0.43.1
Then, we can run this script:
# Example inspired from https://huggingface.co/microsoft/Phi-3-vision-128k-instruct# Import necessary libraries
from PIL import Image
import requests
from transformers import AutoModelForCausalLM
from transformers import AutoProcessor
from transformers import BitsAndBytesConfig
import torch
# Define model ID
model_id = "microsoft/Phi-3-vision-128k-instruct"
# Load processor
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
# Define BitsAndBytes configuration for 4-bit quantization
nf4_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16,
)
# Load model with 4-bit quantization and map to CUDA
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="cuda",
trust_remote_code=True,
torch_dtype="auto",
quantization_config=nf4_config,
)
# Define initial chat message with image placeholder
messages = [{"role": "user", "content": "<|image_1|>nWhat is shown in this image?"}]
# Download image from URL
url = "https://images.unsplash.com/photo-1528834342297-fdefb9a5a92b?ixlib=rb-4.0.3&q=85&fm=jpg&crop=entropy&cs=srgb&dl=roonz-nl-vjDbHCjHlEY-unsplash.jpg&w=640"
image = Image.open(requests.get(url, stream=True).raw)
# Prepare prompt with image token
prompt = processor.tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
# Process prompt and image for model input
inputs = processor(prompt, [image], return_tensors="pt").to("cuda:0")
# Generate text response using model
generate_ids = model.generate(
**inputs,
eos_token_id=processor.tokenizer.eos_token_id,
max_new_tokens=500,
do_sample=False,
)
# Remove input tokens from generated response
generate_ids = generate_ids[:, inputs["input_ids"].shape[1] :]
# Decode generated IDs to text
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
# Print the generated response
print(response)
This code will load the Phi-3 model like we would any transfomers
model. We add bitsandbytes
so that its fits in a consumer-grade GPU memory.
We use a simple prompt `<|image_1|>nWhat is shown in this image?` where we reference the image and ask for a description of what is in it. This prompt get processed with the image (Same image as this blog’s thumbnail) and fed through the model. This results in the following output:
The image shows a single yellow flower with a green stem against a blue background.
Once the model is loaded, the processing and prediction took 2s on an RTX3080.
Now that we are all set up, lets explore some use cases:
Optical Character Recognition (OCR)
We want to transcribe the text from an image. To do so, change the message and the image URL lines in the code snippet above with:
messages = [{"role": "user", "content": "<|image_1|>nOCR the text of the image as is. OCR:"}]
url = "https://i.postimg.cc/fRFvFVyv/Screenshot-from-2024-05-22-20-55-43.png"
Input:
Output:
3 EXPERIMENTS
We show Position Interpolation can effectively extend context window up to 32 times of the original size, and such extension can be done with only several hundreds of training steps. We show the resulting models are strong LLMs with fully effective long context windows. We demonstrate its performance in a number of tasks including language modeling, passkey retrieval, and long doc-
ument summarization. We also present benchmark results of the extended models on the original LLaMA evaluation benchmarks.
The result is as expected! The full text transcription took 6 seconds, which is a bit slow compared to regular OCR.
Image Captioning
The objective here is to caption a natural image:
messages = [{"role": "user", "content": "<|image_1|>nWhat is this image of?"}]
url = "https://i.postimg.cc/q7Wjtj3n/Screenshot-from-2024-05-22-20-56-56.png"
Input:
Output:
This image shows a plush toy resembling a character from a popular animated television series. The toy has a round body with a hood, and it is wearing a pair of glasses. The character’s face is not visible in the image.
Mostly right (no glasses 😞). Still, it is impressive that it recognized Kenny from an animated series.
Table Parsing
We want to see if the model is able to parse the table into a format that is machine readable:
messages = [{"role": "user", "content": "<|image_1|>nExtract the table content of the table in JSON format: "}]
url = "https://i.postimg.cc/W1YwqXk8/Screenshot-from-2024-05-22-21-41-03.png"
Input:
Output:
```json
[
{
"Benchmark": "Phi-3 Vision-128K-In1",
"MMMU": 40.2,
"MMBench": 80.5,
"ScienceQA": 90.8
},
{
"Benchmark": "LlaVA-1.6 Vicuna-7B",
"MMMU": 34.2,
"MMBench": 76.3,
"ScienceQA": 70.6
},
{
"Benchmark": "QWEN-VI Chat",
"MMMU": 39.0,
"MMBench": 75.8,
"ScienceQA": 67.2
}
]
```
It is indeed able to extract the content of the table, in structured format, directly from the screenshot.
Figure Understanding
We prompt the model to describe a figure:
messages = [{"role": "user", "content": "<|image_1|>nWhat is this figure about? "}]
url = "https://i.postimg.cc/Y0sRQR4Q/Screenshot-from-2024-05-22-21-47-39.png"
Input:
Output:
The figure is a bar chart comparing the performance of various machine learning models, with each model represented by a pair of stacked bars indicating the performance of the base model and the fine-tuned model.
Brief but correct!
Reading Comprehension on Scanned Documents
The goal is to ask questions about scanned or photographed text. This differs from Optical Character Recognition (OCR) in that we are not interested in simply transcribing the text. Instead, we want to assess whether the model can understand the written content and answer questions about it directly.
messages = [{"role": "user", "content": "<|image_1|>nWhat is this book about? "}]
url = "https://i.postimg.cc/rmHxPbQY/Screenshot-from-2024-05-22-20-56-21.png"
Input:
Output:
The image shows the cover of a book titled ‘SAM TALLENT RUNNING THE LIGHT’ with a synopsis describing the story of a comedian named Billy Ray Schafer who is dealing with personal and professional challenges. The book appears to be a work of fiction, possibly a novel, as it includes a quote from DOUG STANHOPE, ICONOCLAST, praising the book as brilliant and the best fictional representation of comedy.
The book title is wrong but the the MLLM was able to understand what the text in the picture is about and summarize it in one shot.
Set-of-Mark Prompting
Set-of-Mark (SoM) prompting uses interactive segmentation models to divide an image into regions and mark them with symbols, enabling large multimodal models to better understand and answer visually grounded questions.
To simplify things in this example, I marked the objects manually instead of using a model and then referenced the mark (4) in my prompt:
messages = [{"role": "user", "content": "<|image_1|>nWhat is object number 4? "}]
url = "https://i.postimg.cc/fy0Lz798/scott-webb-p-0l-WFknspg-unsplash-2.jpg"
Input:
Object number 4 is a cactus with orange flowers in a pot.
The MLLM was able to understand my reference and answer my question accordingly.
Conclusion
So, there you have it! Phi-3-Vision is a powerful model for working with images and text, capable of understanding image content, extracting text from images, and even answering questions about what it sees. While its small size, with only 4 billion parameters, may limit its suitability for tasks demanding strong language skills, most models of its class are at least twice its size at 8B parameters or more, making it a standout for its efficiency. It shines in applications like document parsing, table structure understanding, and OCR in the wild. Its compact nature makes it ideal for deployment on edge devices or local consumer-grade GPUs, especially after quantization. It will be my go-to model in all document parsing and understanding pipelines, as its zero-shot capabilities make it a capable tool, especially for its modest size. Next, I will also work on some LoRA fine-tuning scripts for this model to see how far I can push it on more specialized tasks.
References:
- Phi-3-Vision Online Demo: https://ai.azure.com/explore/models/Phi-3-vision-128k-instruct/version/1/registry/azureml
- Set-of-Mark Prompting: https://arxiv.org/abs/2310.11441