Embarking on a Quest for the Best On-Site Search Experience: Leveraging Multimodal AI at Boozt | by Gábor Jezerniczky | Boozt Tech

To test out the capabilities of a VLM model, we first used 2 small subsets of product data: one consisting of 2000 products from the posters category, and another one containing 3000 products from the women dresses category. We used Google’s Contrastive Captioner (CoCa) VLM model to generate the multimodal embeddings from the product images.

The Contrastive Captioner (CoCa) model employs a unique encoder-decoder structure, integrating contrastive loss and captioning loss, aiming at creating aligned image and text embeddings. It performs well in zero-shot learning scenarios, particularly in image classification and cross-modal retrieval.

The model generated a 1408 dimension vector for each product which we indexed using Google’s Vertex AI Vector Search.

The indexing is an effective way for fast information retrieval. Then it was time to test out the new search! The following steps were happening under the hood after we typed in our search query and hit the search button:

The search query was converted into multimodal embeddings vector.
This vector was compared with the already indexed embeddings of the products using approximate nearest neighbors (ANN).
The results were displayed sorted by the highest similar items.

After thorough testing and knowing that no product metadata, but only the product images were used to get these results, we were quite happy that we could for example:

Search for attributes effectively:

Search for styles pretty accurately:

Search for abstract meanings:

Search for a text and the model actually understands the text located on the image:

Although we were quite amazed by the capabilities of such a model, we also found the weak side of only utilizing the multimodal embeddings based on the product images:

Searching for specific brands, product names or categories did not yield the best results: