This project accompanies our paper on evaluating embedding compression techniques for dense retrieval at scale. We present CoRECT, a framework designed to systematically measure the impact of compression strategies on retrieval performance.
Our work addresses two main objectives:
The CoRE benchmark is a key component of our framework. It enables controlled experiments by varying corpus size and document length independently. Built upon MS MARCO v2 and human relevance judgments (TREC DL 2023), CoRE provides:
To ensure realistic evaluation:
This design makes CoRE robust for analyzing how retrieval performance is affected by corpus complexity and size.
We implement and benchmark a range of compression techniques, including:
These methods are integrated into our open-source evaluation framework, making it easy to reproduce results and test new methods.
To run the evaluation framework, follow these steps:
git clone https://github.com/padas-lab-de/CoRECT.gitThere are two ways you can install the dependencies to run the code.
If you have the Poetry package manager for Python installed already, you can simply set up everything with:
poetry install
source $(poetry env info --path)/bin/activateAfter the installation of all dependencies, you will end up in a new shell with a loaded venv. In this shell, you can run the main corect command. You can exit the shell at any time with exit.
corect --helpTo install new dependencies in an existing poetry environment, you can run the following commands with the shell environment being activated:
poetry lock
poetry installYou can also create a venv yourself and use pip to install dependencies:
python3 -m venv venv
source venv/bin/activate
pip install .The evaluation code currently supports two datasets: A transformed version of the MS MARCO v2 dataset, called CoRE, and public BeIR datasets. In addition to the dataset, the code also loads an embedding model to evaluate the defined compression techniques. The currently supported models are Jina V3 (jinav3), Multilingual-E5-Large-Instruct (e5), Snowflake-Arctic-Embed-m (snowflake), and Snowflake-Arctic-Embed-m-v2.0 (snowflakev2).
corect evaluate jinav3 core     # Evaluates Jina V3 on CoRE
corect evaluate e5 beir         # Evaluates E5-Multilingual on BeIRThe code downloads the respective datasets from Hugging Face and uses the chosen model to generate the embeddings.
          By default, the embeddings are stored on device for later re-evaluation.
          To avoid this, change line 177 in the evaluation.py script to None.
          The embeddings will then be compressed using the compression methods specified in compression_registry.py.
        
          After running the evaluation code, you will find the results in the results folder.
          The results are stored in a JSON file in a folder structure organized by model name and dataset.
          To share the results, copy the respective JSON file to the share_results folder.
          Default folders for storing results and embeddings can be changed in config.py.
          Results are stored in the following format:
        
{
    "ndcg_at_1": 0.38462,
    "ndcg_at_3": 0.33752,
    ...
    "rc_at_1000": {
        "relevant": 10.0,
        "distractor": 99.63077
    },
    "rankings": {
        "qid1": {
            "relevant": {"cid1": 0, "cid9": 5, ...},
            "distractor": {"cid3": 2, "cid5": 3, ...},
            "random": {"cid17": 1, "cid15": 11, ...}
        },
        ...
    }
}
          The currently implemented compression techniques can be found in the
          quantization folder.
          To add a new method, implement a class that extends
          AbstractCompression
          and add your custom compression technique via the compress() method:
        
import torch
from corect.quantization.AbstractCompression import AbstractCompression
PRECISION_TYPE = {
    "float16": 16,
    "bfloat16": 16,
}
class FloatingCompression(AbstractCompression):
    def __init__(self, precision_type: str = "float16"):
        assert precision_type in PRECISION_TYPE
        self.precision_type = precision_type
    def compress(self, embeddings: torch.Tensor) -> torch.Tensor:
        if self.precision_type == "float16":
            return embeddings.type(torch.float16)
        elif self.precision_type == "bfloat16":
            return embeddings.type(torch.bfloat16)
        else:
            raise NotImplementedError(
                f"Cannot convert embedding to invalid precision type {self.precision_type}!"
            )
          To include your class in the evaluation, modify the add_compressions method in the
          compression registry
          to register your class with the compression methods dictionary:
        
from typing import Dict
from corect.quantization.AbstractCompression import AbstractCompression
from corect.quantization.FloatCompression import PRECISION_TYPE, FloatCompression
class CompressionRegistry:
    _compression_methods: Dict[str, AbstractCompression] = {}
    @classmethod
    def get_compression_methods(cls) -> Dict[str, AbstractCompression]:
        return cls._compression_methods
    @classmethod
    def clear(cls):
        cls._compression_methods.clear()
    @classmethod
    def add_baseline(cls):
        cls._compression_methods["32_full"] = FloatCompression("full")
    @classmethod
    def add_compressions(cls):
        # Add your compression method here to use it for evaluation.
        for precision, num_bits in PRECISION_TYPE.items():
            cls._compression_methods[f"{num_bits}_{precision}"] = FloatCompression(precision)You should now be able to evaluate your compression technique by running the evaluation script as described above.
          New embedding models can be added by implementing the
          AbstractModelWrapper
          class, which requires implementing encoding functions for queries and documents.
          Any model available via transformers can be added easily. For reference, consider the example below:
        
from typing import List, Union
import torch
from transformers import AutoModel, AutoTokenizer
from corect.model_wrappers import AbstractModelWrapper
from corect.utils import cos_sim
def _last_token_pool(last_hidden_states: torch.Tensor, attention_mask: torch.Tensor) -> torch.Tensor:
    left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
    if left_padding:
        return last_hidden_states[:, -1]
    else:
        sequence_lengths = attention_mask.sum(dim=1) - 1
        batch_size = last_hidden_states.shape[0]
    return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]
class Qwen3Wrapper(AbstractModelWrapper):
    def __init__(self, pretrained_model_name="Qwen/Qwen3-Embedding-0.6B"):
        super().__init__()
        self.encoder = AutoModel.from_pretrained(pretrained_model_name, trust_remote_code=True, torch_dtype=torch.float16)
        self.encoder.cuda()
        self.encoder.eval()
        self.tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name, trust_remote_code=True, padding_side='left')
    def _encode_input(self, sentences: List[str]) -> torch.tensor:
        inputs = self.tokenizer(sentences, padding=True, truncation=True, return_tensors='pt', max_length=8192)
        inputs.to('cuda')
        model_outputs = self.encoder(**inputs)
        outputs = _last_token_pool(model_outputs.last_hidden_state, inputs['attention_mask'])
        outputs = torch.nn.functional.normalize(outputs, p=2, dim=1)
        return outputs
    def encode_queries(self, queries: List[str], **kwargs) -> torch.tensor:
        return self._encode_input(queries)
    def encode_corpus(self, corpus: Union[str, List[str]], **kwargs) -> torch.tensor:
        if isinstance(corpus, str):
            corpus = [corpus]
        return self._encode_input(corpus)
    def similarity(self, embeddings_1: torch.Tensor, embeddings_2: torch.Tensor) -> torch.Tensor:
        return cos_sim(embeddings_1, embeddings_2)
    @property
    def name(self) -> str:
        return "Qwen3Wrapper"
          The wrapper then needs to be registered in the get_model_wrapper() method of the
          evaluation script:
        
from typing import Tuple
from corect.model_wrappers import AbstractModelWrapper, JinaV3Wrapper, Qwen3Wrapper
def get_model_wrapper(model_name: str) -> Tuple[AbstractModelWrapper, int]:
    if model_name == "jinav3":
        return JinaV3Wrapper(), 1024
    elif model_name == "qwen3":
        return Qwen3Wrapper(), 1024     # 1024 is the embedding dimension of the Qwen model.
    else:
        raise NotImplementedError(f"Model {model_name} not supported!")The model can then be evaluated as follows:
corect evaluate qwen3 core
          Our framework supports the addition of any HuggingFace retrieval datasets with corpus, queries and qrels splits.
          To add a custom dataset, navigate to the
          dataset utils script, add a load function for your new dataset and register it in the load_data() function.
          You also need to add information on the new dataset to the datasets dictionary in this class in the form of
          datasets[<dataset_name>]=[<dataset_name>]. The example below adds a new dataset called my_ir_dataset:
        
from collections import defaultdict
from typing import Dict, Tuple
from datasets import load_dataset
CoRE = {"passage": {"pass_core": 10_000, "pass_10k": 10_000}}
DATASET = ["my_dataset_name"]
CoRE_NAME = "core"
DATASET_NAME = "my_ir_dataset"
DATASETS = {CoRE_NAME: CoRE, DATASET_NAME: DATASET}
def _load_core_data(dataset_sub_corpus: str):
    # Code for loading CoRE
    ...
def _load_my_dataset(dataset_name: str) -> Tuple[defaultdict, Dict[str, str], defaultdict, defaultdict]:
    dataset_queries = load_dataset(f"hf_repo/my_dataset", "queries")
    dataset_qrels = load_dataset(f"hf_repo/my_dataset", "default")
    dataset_corpus = load_dataset(f"hf_repo/my_dataset", "corpus")
    qrels = defaultdict(dict)
    for q in dataset_qrels:
        query_id = q["query-id"]
        corpus_id = q["corpus-id"]
        qrels[query_id][corpus_id] = int(q["score"])
    queries = {q["_id"]: q["text"] for q in dataset_queries["queries"] if q["_id"] in qrels.keys()}
    corpora = defaultdict(dict)
    for d in dataset_corpus["corpus"]:
        corpora[dataset_name][d["_id"]] = {"title": d["title"], "text": d["text"]}
    return corpora, queries, qrels, qrels
def load_data(dataset_name: str, dataset_sub_corpus: str):
    if dataset_name == CoRE_NAME:
        return _load_core_data(dataset_sub_corpus)
    elif dataset_name == DATASET_NAME:
        return _load_my_dataset(dataset_sub_corpus)
    else:
        raise NotImplementedError(f"Cannot load data for unsupported dataset {dataset_name}!")Running the evaluation script on the new dataset can then be achieved by executing the following command:
corect evaluate jinav3 my_ir_datasetPlease select from the options below to view the plots.