![](https://crypto4nerd.com/wp-content/uploads/2024/03/0vQ3vCzDhLIfw1lWS-1024x576.jpeg)
In this post, we will explore the capabilities of Vertex AI Search, specifically, website search — a standout feature within Google Cloud’s Vertex AI platform. This unique feature taps into Google’s vast indices, enabling intelligent document search across thousands of websites and millions of web pages. We will demonstrate how this feature works and what you can do with it. The guide will walk you through a simple use case of finding documents pertaining to graduate programs from websites of global universities, with a focus on uncovering educational PDF documents that can provide valuable insights for prospective students. The solution template covered in this article can be applied across any domain. In finance, it can find specific regulations, risk reports, or financial market analyses. In healthcare, it can locate medical research papers, patient case studies, or new treatment procedure documents. For the legal domain, it can help discover court rulings, law articles, or legal precedents. Traditionally, acquiring such documents involves manual downloading or tedious scraping methods when scaled up. However, with the advent of Vertex AI Search, leveraging Google’s pre-existing search indices allows for a streamlined, efficient discovery process. If a document is indexed by Google and publicly available, it becomes readily accessible through this tool. Let’s dive into how you can leverage Vertex AI Search to unlock a world of information. This post includes all the source code needed to replicate the demo discussed here.
Discovering documents on the web is a challenging problem due to the vast and constantly evolving nature of the internet. The web comprises billions of pages, each with its unique content, structure, and metadata. This diversity and scale make it difficult to identify, access, and extract information efficiently. Scraping and crawling are traditional methods used to gather data from websites. However, these approaches can be exhaustive and complex due to several reasons: websites have different structures, which means a scraper built for one website might not work for another; web pages often change their layout or content, requiring constant updates to the scraping code; and webmasters might implement measures to block or limit scraping activities, making data extraction a moving target.
Moreover, the legal and ethical considerations surrounding web scraping add another layer of complexity. Websites have terms of service that may restrict automated access, and not respecting these can lead to legal consequences or being blocked from the site. Instead of relying on custom scraping and crawling solutions, leveraging advanced tools like Vertex AI, which utilize the power of Google’s indices, presents a more efficient alternative. Google has already indexed the web comprehensively, and Vertex AI can tap into this vast reservoir of indexed information, allowing users to perform targeted searches and extract data more effectively and accurately. By utilizing Google’s advanced algorithms and vast index, users can bypass the challenges of manual scraping and crawling, accessing the needed information more efficiently and reliably, all while adhering to legal and ethical standards.
Google Cloud Platform’s Vertex AI is a comprehensive ML platform designed to simplify the deployment and scaling of ML models. It streamlines the entire ML workflow, from data preparation and model training to evaluation, deployment, and prediction. Vertex AI offers a variety of tools and services for both custom model development and AutoML. This allows users of all skill levels to build, experiment with, and deploy models more efficiently. With pre-trained models, a managed infrastructure for large-scale training, and seamless model serving capabilities, Vertex AI allows developers and scientists to focus on their core competencies while reducing the operational overhead of ML projects. Vertex AI also includes a suite of tools to support generative AI workloads, such as foundation model APIs, Vertex AI Search, Vertex AI Conversation, Model Garden, and more.
Vertex AI Search is a fully managed platform for developers to build Google-quality search experiences for websites, structured and unstructured data. Vertex AI Search is a component within Vertex AI’s broad suite of generative tools that specializes in information retrieval and question answering. It can be integrated into any generative AI application that uses your enterprise data. Currently, RAG is a popular architecture that combines LLMs with a data retrieval system. By basing LLM responses on your company’s data, Vertex AI Search guarantees enhanced accuracy, reliability, and relevance — all crucial for real-world business applications. Although you have the option to build your own RAG-based Search, architecting an end-to-end RAG pipeline is quite a complex process. Here’s where Vertex AI Search comes in as a ready-to-use RAG system for information retrieval. With Vertex AI Search, we’ve streamlined the entire search and discovery process from ETL, OCR, chunking, embedding, indexing, storing, input cleaning, schema adjustments, information retrieval, and summarization into a few simple clicks. This facilitates the building of RAG powered apps using Vertex AI Search as your retrieval engine.
In terms of data security, when you use Vertex AI Search from Google Cloud, your data remains secure in your cloud instance. Google does not use or access your data to train models or for any other unauthorized purposes. Furthermore, Vertex AI Search complies with specific industry standards like HIPAA, ISO 27000-series, and SOC -1/2/3. Virtual Private Cloud Service Controls are in place to prevent the infiltration or exfiltration of data by customer employees. Vertex AI Search also provides Customer-managed Encryption Keys (CMEK), which allow customers to encrypt their primary content with their own encryption keys.
The power of Vertex AI Search lies in its versatility, allowing it to be tailored to a broad range of domains.
Vertex AI Search offers versatile functionalities for creating search/recommendations/question answering applications linked to different types of data stores. These data stores include:
- Website Data: This feature allows indexing of website data, whether public or private. For instance, you can target specific domains such as
*.stanford.edu/*
or*.columbia.edu/*
. Utilizing this capability provides two main advantages:a). Leveraging Google Indices: Users can harness the power of Google’s indices to search for specific content within a subset of targeted websites (publicly available content), enabling efficient content mining from web pages or documents tailored to their search needs.
b). Domain Verification for Private Content: If dealing with private web content, domain verification enables more advanced functionalities. Beyond locating relevant webpages, Vertex AI Search can perform question answering directly on the HTML content of the webpage.
- Structured Data: Vertex AI Search supports search or recommendations or question answering on structured data, such as tables in BigQuery or NDJSON files. This feature is suitable for various applications like e-commerce catalogs, movie directories, doctor listings, or private property information catalogs.
- Unstructured Data: This type facilitates search or recommendations or question answering on documents and images, catering to scenarios like private research publications, medical research repositories, or domain-specific proprietary documents. Vertex AI Search supports search over documents in HTML, PDF with embedded text, and TXT formats. Additionally, PPTX and DOCX formats are available in Preview. Documents can be imported from Cloud Storage buckets or through streaming ingestion via RESTful CRUD APIs.
In this blog post, we will primarily explore option 1.a, using publicly accessible website data to mine for graduate school handbooks for prospective students. First, we’ll understand the use case, then learn how to set up search indices for our list of university websites. Lastly, we’ll learn how to query the site index to find relevant documents.
Problem Statement:
The goal is to locate graduate school handbooks for specific programs from approximately 7,000 universities worldwide, specifically in PDF format. This involves collecting comprehensive information on various university programs to assist prospective students in their decision-making process. Consider similar use cases where a cluster of websites is connected to a use case or a domain, and you want to tailor your search for specific types of documents. This guide will help you accomplish this using Vertex AI Search. The source code and template covered in this post can be reused for your own use cases and domains.
Dataset
For this exercise, we start with a list of triples in CSV format. This file, named entities.csv
, can be found in the /data
folder inside the repository. The dataset consists of three columns: entity, URL, and country. It includes URLs from 7,000 universities.
Fundamentally, our use case requires the design of two workflows. The first involves segmenting the provided list and creating indices. The second workflow intelligently routes incoming user queries to the correct index containing the target university.
Let’s outline the initial steps for preparing the dataset for Vertex AI Search, including indexing and organizing the data for efficient searching. The overall process is illustrated in the figure above and can be broken down into four steps:
- Divide the input file into multiple files, each containing 50 URLs. Given that we have approximately 7,000 site URLs in our original list, this equates to around 140 search apps within Vertex AI Search, each mapped to its respective datastore.
- The partitioned files are then copied from local storage to Google Cloud Storage.
- We iterate through all the entities (universities) in each partition file, inputting the associated entity, site URL, and other details like country into a Cloud SQL table. Each entity is associated with its batch number.
- Utilize the Vertex AI Search API to establish a datastore and a search application for each partition or batch. The key here is creating an entry in the Cloud SQL table that captures the entity info alongside the batch ID, distinguishing each datastore by its batch ID. Sample rows from the Cloud SQL table are shown below.