Image Captioning with Vision Models

Yesterday, I did two outdoor activities, and took some pictures. One desire when creating a picture gallery is to add a few bits of annotation, and comments to remember the event and place.

Vision Models can help with automatically caption the pictures. E.g: “this picture shows a motorcycle and an suv on a gravel pit ..” etc. Having the computer generate this caption is helpful for

have the picture alt pre-populated. accessibility matters.
use the automatic captions as a starting point for adding more color commentary.

I wanted to see what we can do with the new vision models. The first stop was Claude.

With Claude

descibe pic with claude

This is an excellent description. I can go two directions from here:

use the Anthropic/Claude API to generate image captions and integrate into the gallery generation workflow
run a local model on my laptop and see if it can do an equavalent or better job.

So, I went to Huggingface and looked for some “Image captioning”models.

With `Salesforce/blip-image-captioning-large` model

Huggingface page.

Program (from the hf page):

#!/usr/bin/env uv run
 
# /// script
# requires-python = ">=3.12"
# dependencies = [
#     "pillow",
#     "transformers",
#     "torch",
# ]
# ///
import sys
import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration
 
processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large")
 
if len(sys.argv) < 2:
    print("Usage describe-picture.py filename.jpg")
    sys.exit(0)
img_url = sys.argv[1]
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')
 
# conditional image captioning
text = "a photograph of"
inputs = processor(raw_image, text, return_tensors="pt")
 
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
 
# unconditional image captioning
inputs = processor(raw_image, return_tensors="pt")
 
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

Script execution:

./describe-picture.py https://files.btbytes.com/pictures/2024/08/._IMG_4019.HEIC.webp

output from Salesforce/blip-image-captioning-large:

a photograph of a silver car and a black motorcycle parked in a gravel lot
arafed view of a car and a motorcycle parked in a gravel lot

This is still quite decent, though not as good as Claude.

TODO

accomplish the same by calling an API instead of uploading pic to Claude
explore better local models
If the threshhold for acceptance is lower. (e.g: I only want a broad description of the scene - correct, but not too specific, what’s the smallest model I can run locally?)

btbytes.com

Image Captioning with Vision Models

With Claude

With `Salesforce/blip-image-captioning-large` model

TODO

Table of Contents

Graph View

btbytes.com

Image Captioning with Vision Models

With Claude

With Salesforce/blip-image-captioning-large model

TODO

Table of Contents

Graph View

With `Salesforce/blip-image-captioning-large` model