CoRECT: A Framework for Evaluating Embedding Compression Techniques

This project accompanies our paper on evaluating embedding compression techniques for dense retrieval at scale. We present CoRECT, a framework designed to systematically measure the impact of compression strategies on retrieval performance.

Our work addresses two main objectives:

  • Comparing embedding compression methods: We benchmark quantization, binarization, vector truncation, Principal Component Analysis, Locality-Sensitive Hashing and Product Quantization.
  • Analyzing scalability: We evaluate how retrieval quality degrades or holds up when scaling the corpus from 10K to 100M passages and from 10K to 10M documents.

CoRE: Controlled Retrieval Evaluation

The CoRE benchmark is a key component of our framework. It enables controlled experiments by varying corpus size and document length independently. Built upon MS MARCO v2 and human relevance judgments (TREC DL 2023), CoRE provides:

  • Passage retrieval: 65 queries over 5 corpus sizes (10K to 100M)
  • Document retrieval: 55 queries over 4 corpus sizes (10K to 10M)

To ensure realistic evaluation:

  • Each query includes 10 high-quality relevant documents
  • We adopt an advanced subsampling technique to retain hard distractors, drawn from top-ranked TREC DL system runs
  • Each query is paired with 100 mined distractors and additional random negatives

This design makes CoRE robust for analyzing how retrieval performance is affected by corpus complexity and size.

Quantization Methods

We implement and benchmark a range of compression techniques, including:

  • Scalar quantization: Values are mapped into fixed bit-width ranges (e.g., 8-bit → 256 bins), computed on a per-batch and per-dimension basis using percentile binning and euqal-distance binning
  • Binarization: Each embedding value is converted to 0 or 1 via simple zero-thresholding as well as using the median of each dimension as a threshold.
  • Casting: We type-cast the original embedding to other floating point types, i.e. FP16, BF16 or FP8, depending on the tensor type used during model training.
  • Vector Truncation: We truncate embedding vectors by taking only the first x dimensions, with x chosen according the cutoff points of our models trained with Matryoshka Representation Learning. We also combine this approach with the three methods above.
  • Principal Component Analysis: We use PCA to reduce the number of dimensions per embedding vector, choosing the same number of dimensions as for vector truncation.
  • Locality-Sensitive Hashing: We use LSH to map the embeddings to a binary vector. We choose the output length of the binary embedding such that we compress the original vector by a factor of 4, 8, 16, an 32, respectively.
  • Product Quantization: Lastly, we apply PQ, choosing six combinations for the number of subvectors and code size.

Implementation Details

  • Compression is applied to batches of up to 50,000 vectors at once.
  • Quantization thresholds and similar embedding-dependent parameters are computed and applied per batch.
  • Results are shown in different plots, depending on the dataset:
    • Heatmaps visualize the results that combine vector truncation with FP casting, scalar quantization (equal-distance) and binarization (zero threshold).
    • Line Charts show results on the CoRE dataset, giving insights into how compression methods perform when the difficulty of the retrieval task increases. The plots isolate vector truncation and percentile binning.
    • Pareto Plots compare all used methods at different compression ratios.

These methods are integrated into our open-source evaluation framework, making it easy to reproduce results and test new methods.

Getting Started

To run the evaluation framework, follow these steps:

1. Clone the repository

git clone https://github.com/padas-lab-de/CoRECT.git

2. Install dependencies

There are two ways you can install the dependencies to run the code.

Using Poetry (recommended)

If you have the Poetry package manager for Python installed already, you can simply set up everything with:

poetry install
source $(poetry env info --path)/bin/activate

After the installation of all dependencies, you will end up in a new shell with a loaded venv. In this shell, you can run the main corect command. You can exit the shell at any time with exit.

corect --help

To install new dependencies in an existing poetry environment, you can run the following commands with the shell environment being activated:

poetry lock
poetry install
Using Pip (alternative)

You can also create a venv yourself and use pip to install dependencies:

python3 -m venv venv
source venv/bin/activate
pip install .

3. Run Evaluation Code

The evaluation code currently supports two datasets: A transformed version of the MS MARCO v2 dataset, called CoRE, and public BeIR datasets. In addition to the dataset, the code also loads an embedding model to evaluate the defined compression techniques. The currently supported models are Jina V3 (jinav3), Multilingual-E5-Large-Instruct (e5), Snowflake-Arctic-Embed-m (snowflake), and Snowflake-Arctic-Embed-m-v2.0 (snowflakev2).

corect evaluate jinav3 core     # Evaluates Jina V3 on CoRE
corect evaluate e5 beir         # Evaluates E5-Multilingual on BeIR

The code downloads the respective datasets from Hugging Face and uses the chosen model to generate the embeddings. By default, the embeddings are stored on device for later re-evaluation. To avoid this, change line 177 in the evaluation.py script to None. The embeddings will then be compressed using the compression methods specified in compression_registry.py.

After running the evaluation code, you will find the results in the results folder. The results are stored in a JSON file in a folder structure organized by model name and dataset. To share the results, copy the respective JSON file to the share_results folder. Default folders for storing results and embeddings can be changed in config.py. Results are stored in the following format:

{
    "ndcg_at_1": 0.38462,
    "ndcg_at_3": 0.33752,
    ...
    "rc_at_1000": {
        "relevant": 10.0,
        "distractor": 99.63077
    },
    "rankings": {
        "qid1": {
            "relevant": {"cid1": 0, "cid9": 5, ...},
            "distractor": {"cid3": 2, "cid5": 3, ...},
            "random": {"cid17": 1, "cid15": 11, ...}
        },
        ...
    }
}

4. Extend CoRECT

Add New Compression Technique

The currently implemented compression techniques can be found in the quantization folder. To add a new method, implement a class that extends AbstractCompression and add your custom compression technique via the compress() method:

import torch
from corect.quantization.AbstractCompression import AbstractCompression

PRECISION_TYPE = {
    "float16": 16,
    "bfloat16": 16,
}

class FloatingCompression(AbstractCompression):
    def __init__(self, precision_type: str = "float16"):
        assert precision_type in PRECISION_TYPE
        self.precision_type = precision_type

    def compress(self, embeddings: torch.Tensor) -> torch.Tensor:
        if self.precision_type == "float16":
            return embeddings.type(torch.float16)
        elif self.precision_type == "bfloat16":
            return embeddings.type(torch.bfloat16)
        else:
            raise NotImplementedError(
                f"Cannot convert embedding to invalid precision type {self.precision_type}!"
            )

To include your class in the evaluation, modify the add_compressions method in the compression registry to register your class with the compression methods dictionary:

from typing import Dict
from corect.quantization.AbstractCompression import AbstractCompression
from corect.quantization.FloatCompression import PRECISION_TYPE, FloatCompression

class CompressionRegistry:
    _compression_methods: Dict[str, AbstractCompression] = {}

    @classmethod
    def get_compression_methods(cls) -> Dict[str, AbstractCompression]:
        return cls._compression_methods

    @classmethod
    def clear(cls):
        cls._compression_methods.clear()

    @classmethod
    def add_baseline(cls):
        cls._compression_methods["32_full"] = FloatCompression("full")

    @classmethod
    def add_compressions(cls):
        # Add your compression method here to use it for evaluation.
        for precision, num_bits in PRECISION_TYPE.items():
            cls._compression_methods[f"{num_bits}_{precision}"] = FloatCompression(precision)

You should now be able to evaluate your compression technique by running the evaluation script as described above.

Add New Model

New embedding models can be added by implementing the AbstractModelWrapper class, which requires implementing encoding functions for queries and documents. Any model available via transformers can be added easily. For reference, consider the example below:

from typing import List, Union
import torch
from transformers import AutoModel, AutoTokenizer
from corect.model_wrappers import AbstractModelWrapper
from corect.utils import cos_sim

def _last_token_pool(last_hidden_states: torch.Tensor, attention_mask: torch.Tensor) -> torch.Tensor:
    left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
    if left_padding:
        return last_hidden_states[:, -1]
    else:
        sequence_lengths = attention_mask.sum(dim=1) - 1
        batch_size = last_hidden_states.shape[0]
    return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]

class Qwen3Wrapper(AbstractModelWrapper):
    def __init__(self, pretrained_model_name="Qwen/Qwen3-Embedding-0.6B"):
        super().__init__()
        self.encoder = AutoModel.from_pretrained(pretrained_model_name, trust_remote_code=True, torch_dtype=torch.float16)
        self.encoder.cuda()
        self.encoder.eval()
        self.tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name, trust_remote_code=True, padding_side='left')

    def _encode_input(self, sentences: List[str]) -> torch.tensor:
        inputs = self.tokenizer(sentences, padding=True, truncation=True, return_tensors='pt', max_length=8192)
        inputs.to('cuda')
        model_outputs = self.encoder(**inputs)
        outputs = _last_token_pool(model_outputs.last_hidden_state, inputs['attention_mask'])
        outputs = torch.nn.functional.normalize(outputs, p=2, dim=1)
        return outputs

    def encode_queries(self, queries: List[str], **kwargs) -> torch.tensor:
        return self._encode_input(queries)

    def encode_corpus(self, corpus: Union[str, List[str]], **kwargs) -> torch.tensor:
        if isinstance(corpus, str):
            corpus = [corpus]
        return self._encode_input(corpus)

    def similarity(self, embeddings_1: torch.Tensor, embeddings_2: torch.Tensor) -> torch.Tensor:
        return cos_sim(embeddings_1, embeddings_2)

    @property
    def name(self) -> str:
        return "Qwen3Wrapper"

The wrapper then needs to be registered in the get_model_wrapper() method of the evaluation script:

from typing import Tuple
from corect.model_wrappers import AbstractModelWrapper, JinaV3Wrapper, Qwen3Wrapper

def get_model_wrapper(model_name: str) -> Tuple[AbstractModelWrapper, int]:
    if model_name == "jinav3":
        return JinaV3Wrapper(), 1024
    elif model_name == "qwen3":
        return Qwen3Wrapper(), 1024     # 1024 is the embedding dimension of the Qwen model.
    else:
        raise NotImplementedError(f"Model {model_name} not supported!")

The model can then be evaluated as follows:

corect evaluate qwen3 core
Add New Dataset

Our framework supports the addition of any HuggingFace retrieval datasets with corpus, queries and qrels splits. To add a custom dataset, navigate to the dataset utils script, add a load function for your new dataset and register it in the load_data() function. You also need to add information on the new dataset to the datasets dictionary in this class in the form of datasets[<dataset_name>]=[<dataset_name>]. The example below adds a new dataset called my_ir_dataset:

from collections import defaultdict
from typing import Dict, Tuple
from datasets import load_dataset

CoRE = {"passage": {"pass_core": 10_000, "pass_10k": 10_000}}
DATASET = ["my_dataset_name"]
CoRE_NAME = "core"
DATASET_NAME = "my_ir_dataset"
DATASETS = {CoRE_NAME: CoRE, DATASET_NAME: DATASET}

def _load_core_data(dataset_sub_corpus: str):
    # Code for loading CoRE
    ...

def _load_my_dataset(dataset_name: str) -> Tuple[defaultdict, Dict[str, str], defaultdict, defaultdict]:
    dataset_queries = load_dataset(f"hf_repo/my_dataset", "queries")
    dataset_qrels = load_dataset(f"hf_repo/my_dataset", "default")
    dataset_corpus = load_dataset(f"hf_repo/my_dataset", "corpus")

    qrels = defaultdict(dict)
    for q in dataset_qrels:
        query_id = q["query-id"]
        corpus_id = q["corpus-id"]
        qrels[query_id][corpus_id] = int(q["score"])

    queries = {q["_id"]: q["text"] for q in dataset_queries["queries"] if q["_id"] in qrels.keys()}

    corpora = defaultdict(dict)
    for d in dataset_corpus["corpus"]:
        corpora[dataset_name][d["_id"]] = {"title": d["title"], "text": d["text"]}

    return corpora, queries, qrels, qrels

def load_data(dataset_name: str, dataset_sub_corpus: str):
    if dataset_name == CoRE_NAME:
        return _load_core_data(dataset_sub_corpus)
    elif dataset_name == DATASET_NAME:
        return _load_my_dataset(dataset_sub_corpus)
    else:
        raise NotImplementedError(f"Cannot load data for unsupported dataset {dataset_name}!")

Running the evaluation script on the new dataset can then be achieved by executing the following command:

corect evaluate jinav3 my_ir_dataset

Please select from the options below to view the plots.