May 30th, 2025
2 reactions

Efficient Ground Truth Generation for Search Evaluation

Introduction

Cold start problems are a common challenge in AI-driven search and retrieval systems, particularly when there is little to no historical data to guide relevance assessments. In our recent project, we faced this issue while developing an information retrieval system for a customer with a vast collection of technical documents. To address this, we devised a structured approach to build a reliable ground truth dataset, leveraging a combination of Text REtrieval Conference (TREC) pooling and GPT-4o assisted ranking.

This article outlines our methodology, detailing how we streamlined the labor-intensive process of manual labeling while ensuring high-quality search evaluation.

Challenges in generating ground truth

Ground truth datasets are often created by experts who manually label the data, ensuring that the results are accurate and reliable. This process is usually labor and time consuming. Additionally, it is hard to have experts focused on providing the required information.

In this engagement:

  • Domain experts were on the production floor, and we had limited time with them.
  • To provide accurate answers to the queries, domain experts would have to scan numerous files (some significantly large, spanning hundreds of pages). For instance, with 1,839 documents indexed by page-chunking strategy for 50 queries, a domain expert would need to review 91,950 pages, which is not feasible.

These challenges make traditional approaches for collecting ground truth harder and impractical. Therefore, we developed a strategy to collect a reliable and robust ground truth dataset by optimizing the limited availability of domain experts. The solution involves a combination of the TREC Pooling Approach, GPT-4o assisted ranking methodology, and the use of a labeling tool, as presented in the next sections.

Process

The process of building a ground truth dataset starts with collecting the user queries. Once the user queries are gathered, we delve into the TREC pooling approach which was used to efficiently manage the high cost of manual document labeling by focusing on a subset of documents most likely to be relevant. Subsequently, these questions answers (Q&A) pairs are validated by multiple SMEs using the Labeling Tool developed by our team, as illustrated by Figure 1 below. Process overview

TREC Pooling Approach

Pooling [4, 5, 6] is a well-known method used in the TREC evaluations [2, 7] to address the huge cost of manually labeling every document in a large collection. The core idea behind pooling is to focus human assessors’ attention on a subset (or “pool”) of documents that are most likely to be relevant as demonstrated in [1].

Why is TREC Effective

In typical TREC evaluations [1, 3], multiple search systems (or different configurations of a single system) retrieve documents for the same queries. Then, each system outputs its top k-ranked documents – often $k$ is equal to 10, 50, or 100. These results are then combined into a single pool of documents for each query.

By focusing on the top results, we significantly reduce the labeling workload while maintaining a high likelihood of capturing the most relevant documents.

However, this approach involves a trade-off; while some relevant documents may occasionally fall outside the top k, the overall efficiency and cost savings of labeling only the top-ranked results generally outweigh this limitation.

Ground Truth Collection

Steps 1 and 2 summarize how we collect search results from various methods and create a unified set of documents.

  1. We utilized hybrid, text-based, and vector searches on our source documents, retrieving the top 100 results for each query. This selection assumes that documents ranked beyond 100 are likely irrelevant, given the effectiveness of these methods.
  2. Upon gathering the top 100 results from each method, the duplicates are eliminated to establish a distinct set of candidate documents for each query. Due to overlaps in the top lists of various systems, this merged, deduplicated list typically contains fewer than $100 $number_of_methods* documents.

Assumptions and Practical Considerations

  • We assume that the set of queries provided is representative and diverse, reflecting the actual questions technicians would ask on the production floor.
  • We assume our search systems are robust, so documents beyond the top 100 are unlikely to be relevant.
  • Missing some lower-ranked relevant documents is acceptable; TREC pooling balances coverage and cost effectively.
  • We continue to depend on domain experts to verify the actual relevance within the dataset. This procedure guarantees high-quality relevance assessments for the documents that are most likely to be significant.

By using TREC’s pooling strategy, we maintain the feasibility of our labeling tasks while retaining a high level of completeness. After defining the pool, these documents are input into our labeling tool for domain experts to evaluate.

Reducing Manual labeling effort

To minimize the burden of manual labeling, we implemented a solution based on GPT-4o. This approach utilizes the multi-modal capabilities of GPT-4o to analyze entire page content, including images, and assigns a relevancy score to each document.

Relevancy scoring process

Multi-Modal Document Review: GPT-4o can analyze both text and accompanying images or diagrams. By using an actual image (e.g., page 10 from a user manual) instead of only OCR-extracted text, it provides more context for ranking relevancy.

Relevancy scores (0 – 5): GPT-4o evaluates the relevance of each document/page on a scale from 0 to 5, with 5 being the highest relevance. This enables us to translate GPT-4o’s analysis into a numerical score.

Calibration Dataset and score threshold

To align GPT-4o’s scores with expert expectations, a subset (approximately 30% of the total set) is selected, and domain experts provide the correct answers (i.e., which documents/pages are relevant to each query). These calibrated results serve as a reference for GPT-4o’s predictions.

Finding the right score threshold

By comparing the domain experts’ judgments from the calibration dataset with GPT-4o’s scores, we can set a threshold that allows us to capture most relevant documents. In this project, a score of 4 was chosen as it covers approximately 90% of the documents deemed relevant by the domain experts in the calibrated set. Specifically, documents with scores of 5 and 4 are considered relevant, while those with scores below 4 (3, 2, 1, and 0) are not considered relevant.

By setting a threshold, documents above it are marked as “likely relevant,” requiring further review. Instead of 10,500 documents, experts might only review a few hundred, saving time and effort. This GPT-4o–based ranking step refined our TREC pooling approach and labeling workflow, reducing manual work while maintaining accuracy in our dataset.

Labeling tool for annotation

To improve efficiency and reduce errors in labeling for domain experts, we developed a labeling tool as a static website:

Inline Content Display: The tool presents the exact page content in the browser instead of just showing a filename and page number as shown in Figure 2. This feature helps save time and minimize context switching.

Relevance Scoring: Domain experts can assign a relevance score (e.g., 0, 1, etc.) directly on the displayed content, eliminating the need to manually open files or switch between Excel and a PDF viewer.

Export to QREL: Upon completion of labeling, the tool exports the judgments in the QREL format. The resulting file, which includes our queries and document IDs, is compatible with search evaluation tools like trec-eval. An example of this file is shown in Figure 2 below.

Labeling tool

Results and Impact

The combination of TREC pooling, GPT-4o assisted ranking methodology and the use of the labeling tool have yielded significant results, highlighting both time savings and efficiency gains in the document labeling process.

Time saving

The new methodology significantly cut down the number of documents needing manual review. Initially, 23 queries generated 1,839 documents requiring 42,297 labeling instances. Using the TREC pooling approach with GPT-4o for relevance scoring reduced this to just 246 document pairs needing expert review, saving hours of labor and allowing experts to focus on relevant documents. Our team’s labeling tool further minimized the workload by displaying the exact page for review, enabling experts to assign relevance scores directly on content without switching between files.

Accuracy and efficiency trade-off

The decision to set the relevance threshold to a score of 4 ensured that approximately 90% of the documents deemed relevant by domain experts were captured. While this meant that up to 10% of relevant documents might be missing, the trade-off proved acceptable given the substantial workload reduction. The calibrated results closely aligned GPT-4o’s predictions with expert judgments, ensuring the integrity of the dataset.

References

  1. TREC: Continuing Information Retrieval’s Tradition of Experimentation
  2. Text Retrieval Conference (TREC) – the official website
  3. trec_eval GitHub repository — a popular open-source library for evaluating retrieval systems
  4. Multi-Drop Polling – RAD Data Communications/Pulse Supply. 2007.
  5. Performance bounds for the effectiveness of pooling in multi-processing systems
  6. Effectiveness of sample pooling strategies for diagnosis of SARS-CoV-2: Specimen pooling vs. RNA elutes pooling
  7. Chowdhury, G. (2007), “TREC: Experiment and Evaluation in Information Retrieval”, Online Information Review, Vol. 31 No. 5, pp. 717-718.