My motivation to do it is to build a local system that can interpret both images and text from my clipboard and let me find stuff faster, and organize my clipboard automatically. AFAIK the only way to do this is by using a vision enabled model, and I want something very small, that I can easily run on my tiny GPU, besides that I'm not doing anything ultra complicated with the model, just general search for the moment. 

For this I chose to go with CLIP from OpenAI, because it seems widely used. To begin with, its nice to go to the repository from OpenAI where they explain a little bit about the model. To use it, I went with Hugging Face transformers instead of Pytorch directly, we can load a test clip on a image-pair this way:

from transformers import CLIPModel, CLIPProcessor, BertModel, BertTokenizer
import torch
import torch.nn as nn
from PIL import Image

clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

image = Image.open("./test_image.jpg")
texts = ["mountain", "trees", "a cat"]

# Padding is just a way to make sure that all the inputs have the same length.
inputs = clip_processor(images=image, text=texts, return_tensors="pt", padding=True)
outputs = clip_model(**inputs)

I used this image:

On the outputs object that is produced from the model, we have the text_embeds, the image_embeds and the logit_per_Image, which is the dot product of the text and image tensors. We can see a visual representation of whats happening on the CLIP paper (although this is for the training, it shows the dot product of both sensors):

The way to interpret the logit or dot product, is that the bigger is its the "closer" the images are to the corresponding text. The dot product represents the similarity between the vectors (it is also used on the retrieval part of RAG applications, when you want to find two similar text from their embeddings)

Properties Dot Product Scalar Product Two Stock Vector (Royalty Free)  2321373653 | Shutterstock

The are some other stuff happening as well, for example, on the paper they dediced to keep the logits at 100 max, because of stability issues during training. 

If we check the logits for this image and text:

logits_per_image = outputs.logits_per_image


tensor([[24.5542, 20.8046, 16.2604]], grad_fn=<TBackward0>)

So it considers Mountain > tree > cat (thanks god it does) for this image. We can make it more meaningful by applying a softmax function on it, which basically normalizes this vector to be read as probabilities (between 0 and 1):

probs = logits_per_image.softmax(dim=1).detach().numpy()

probs.tolist()


[[0.9767741560935974, 0.022981563583016396, 0.00024426792515441775]]

# So 97%, 2% and 0.02% for each class

I think its kind of bad that trees are sow low compared to mountain, but anyway, maybe its because of the size of the image and each class it has to decide between. 

Ranking

We can also pass multiple images as input, and do this to rank them as the closest to a given query:

image_names = os.listdir("./images")
images = [Image.open(f"./images/{i}") for i in image_names]
query = "robot"
inputs = clip_processor(images=images, text=query, return_tensors="pt", padding=True)
outputs = clip_model(**inputs)
logits_per_image = outputs.logits_per_image

This time we dont want softmax, since we are doing only one text query per image, it would always rank as 100% since there are no other options! We get the raw logit and then get the index of the highest one:

index = logits_per_image.argmax().item()
images[index]

Great! For the next steps, I need to merge it with a Text model on the embedding level, so I can do search to both text and image. Also, this is probably not optimal at all, I'm computing again the embeddings for the images each time I do a query! It would be better to save these embeddings on a vector database and do retrieval by reusing the image embeddings and only calculating the text embedding and their logits to rank them.