Brandon Smith

Research Publications

Evaluating Chunking Strategies for Retrieval

Technical ReportJuly 2024RAG / Chunking

Smith, B., & Troynikov, A. (2024). Evaluating Chunking Strategies for Retrieval. Chroma Technical Report.

Abstract

This technical report introduces a token‑level evaluation framework that quantifies how different document chunking strategies affect retrieval performance in AI applications. Leveraging LLM‑generated query–excerpt pairs across five diverse corpora, the study measures precision, recall, and a proposed Intersection‑over‑Union (IoU) metric. Results show that the choice of chunking strategy can shift recall by up to 9 percentage points, with a 200‑token RecursiveCharacterTextSplitter (no overlap) offering a strong recall‑efficiency trade‑off. Two novel algorithms—ClusterSemanticChunker andLLMChunker—achieve the best IoU and highest recall, respectively.

Key Contributions

  • A reproducible pipeline that synthesizes 472 token‑level query–excerpt pairs spanning328 k tokens across five public corpora.
  • Token‑level evaluation metrics—IoU, precision, and recall—that capture both accuracy and retrieval efficiency.
  • Benchmarking of nine chunking methods, including the new ClusterSemanticChunker andLLMChunker.
  • Open‑source codebase enabling community replication and extension of all experiments.

Balancing the Picture: Debiasing Vision-Language Datasets with Synthetic Contrast Sets

arXiv Preprint2023NeurIPS Workshop

Smith, B., Farinha, M., Mackenzie Hall, S., Kirk, H. R., Shtedritski, A., & Bain, M. (2023). Balancing the Picture: Debiasing Vision‑Language Datasets with Synthetic Contrast Sets. arXiv preprint arXiv:2305.15407.

Abstract

Vision-language models are growing in popularity and public visibility to generate, edit, and caption images at scale; but their outputs can perpetuate and amplify societal biases learned during pre-training on uncurated image-text pairs from the internet. Although debiasing methods have been proposed, we argue that these measurements of model bias lack validity due to dataset bias. We demonstrate there are spurious correlations in COCO Captions, the most commonly used dataset for evaluating bias, between background context and the gender of people in-situ. This is problematic because commonly-used bias metrics (such as Bias@K) rely on per-gender base rates. To address this issue, we propose a novel dataset debiasing pipeline to augment the COCO dataset with synthetic, gender-balanced contrast sets, where only the gender of the subject is edited and the background is fixed. However, existing image editing methods have limitations and sometimes produce low-quality images; so, we introduce a method to automatically filter the generated images based on their similarity to real images. Using our balanced synthetic contrast sets, we benchmark bias in multiple CLIP-based models, demonstrating how metrics are skewed by imbalance in the original COCO images. Our results indicate that the proposed approach improves the validity of the evaluation, ultimately contributing to more realistic understanding of bias in vision-language models.

Key Contributions

  • Novel methodology for generating synthetic contrast datasets using generative AI
  • Quantitative framework for measuring bias in Vision Language models
  • Identification of specific bias patterns in CLIP and similar models
  • Proposed mitigation strategies with empirical validation
22 Citations
NeurIPS Workshop
View arXiv