🧑‍⚖️ "Replacing Judges with Juries" using distilabel
TL;DR
distilabel
is a framework to build pipelines for synthetic data generation and AI Feedback (AIF) as a Direct Acyclic Graph (DAG) using LLMs that comes with a growing collection of pre-defined tasks. "Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models" is a recent publication from Cohere that explores the problematic around using a single large LLM as a judge for the generations, and proposes the usage of a Panel of LLm evaluators (PoLL), the so called juries, composed of more and smaller LLMs, leading to a more diverse, less intra-bias, and less expensive generation judgement.
Introduction
"Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models" is a paper published by Cohere (Path Verga et al.) that explores the problematic around using a single large model like GPT-4 from OpenAI to judge / score either a single LLM generation or a comparison between multiple LLM generations, since they claim it introduces intra-model bias and most of the times using models that large is often unnecessary. So on, they propose what they call a Panel of LLm evaluators (PoLL), the so called "juries", which is a pool of more and smaller LLMs to judge / score the LLM outputs and then use an aggregation or average pooling of those scores instead of the single score provided by the larger LLM, the so called "judge".
Using the proposed PoLL is not only cheaper, but also avoids somehow the intra-model bias due to its composition fo disjoint model families; in the paper being Claude Haiku from Anthropic, GTP-3.5 from OpenAI, and Command R Plus from Cohere; the latter being open-source, while the rest are commercial / proprietary models.
The idea of this post is to reproduce a similar pipeline where some LLMs (Gemma 1.1 7B Instruct, Llama 3 8B Instruct, Phi 3 Mini (4K) Instruct, and Mistral 7B v0.2 Instruct; all being open and available within the Hugging Face Hub) are used to generate completions for a given collection of instructions / prompts, and then have other LLMs (Claude Haiku, GTP-3.5, and Command R Plus) judge those using the UltraFeedback prompt, to finally aggregate the scores so as to calculate the avarage score for each generation and use that score to binarize the dataset into a preference dataset based on the PoLL scores instead of solely on the GPT-4 score as formerly done in UltraFeedback.
What's distilabel?
distilabel
is a framework to build pipelines for synthetic data generation and AI Feedback (AIF), defining a series of steps and connecting them as a Direct Acyclic Graph (DAG), so as to easily combine data processing steps with steps running LLMs for diverse tasks such as text generation, preference rating, etc.
This post covers the implementation assuming
distilabel
v1.0.0 is used, since the previous versions were still experimental.
The basic concepts of distilabel
are the following:
- Step: a step is a process that receives data in batches as input and produces or alters the recieved data as output, and it is the most basic step.
- GeneratorStep: a
Step
that only generates data i.e. that doesn't receive any input. - GlobalStep: a
Step
that receives inputs and produces outputs as the defaultStep
, but it's global, meaning that it's blocking and it won't be executed until all the batches from the previous steps are processed. - Task: a task is a special type of
Step
that contains a mandatory arg which is theLLM
and will handle the processing so that when called, the input data will be prepared and streamed to theLLM
as inputs, and then the outputs generated by theLLM
will be handled and formatted according to the task. - Pipeline: a pipeline is the main class that orchestrates the execution of all the steps defined as part of the
Pipeline
, and will handle the batching of the data as well as the validation, logging, and any other related logic.
For more details about distilabel
, I'd recommend you to go check distilabel - Documentation
, specifically the section dedicated to "Learn".
Installation
To install it you can use pip
as follows, which will also install both the anthropic
, hf-inference-endpoints
, and openai
extras, which are required for the Anthropic, Inference Endpoints and OpenAI integrations, respectively.
distilabel
will be installed from thedevelop
branch since it has some features used within this post, but feel free to pin it to v1.1.0 once it's released. See the GitHub Milestone at https://github.com/argilla-io/distilabel/milestone/8
pip install "distilabel[anthropic,hf-inference-endpoints,openai] @ git+https://github.com/argilla-io/distilabel.git@develop"
Additionally, you will need to set the following environment variables to run the Pipeline
below:
ANTHROPIC_API_KEY
: is the Anthropic API Key required to send requests to the Anthropic models via their API.HF_TOKEN
: is the Hugging Face authentication token required to use the Inference Endpoints and to finally push the generateddistilabel.Distiset
to the Hugging Face Hub.OPENAI_API_KEY
: is the OpenAI API Key required to send requests to the OpenAI models via their API.
Code
Building blocks
LoadHubDataset: is a
GeneratorStep
that will load a dataset from the Hugging Face Hub and stream that in batches provided to the follow up steps. In this case, since the dataset we're using isHuggingFaceH4/instruction-dataset
, we will need to rename the columnprompt
toinstruction
as that's what theTextGeneration
task expects as input.TextGeneration: is a
Task
that will generate an assistant response for a giveninstruction
provided as input, generating thegeneration
column in the output. TheTextGeneration
task expects anLLM
as an arg, and in this case we'll be using:- InferenceEndpointsLLM: is an
LLM
implementation for Hugging Face Inference Endpoints, that supports serverless, dedicated and TGI endpoints. In this case we'll be using the following models for the generations:google/gemma-1.1-7b-it
,meta-llama/Meta-Llama-3-8B-Instruct
,mistralai/Mistral-7B-Instruct-v0.2
, andmicrosoft/Phi-3-mini-4k-instruct
.
- InferenceEndpointsLLM: is an
CombineColumns: since the
TextGeneration
tasks connected to this step run in parallel, the outputs are not combined, so this step will receive that as an input i.e. all the outputs from all the previous steps, and merge those into a list. So that for eachgeneration
we'll have a list namedgenerations
that contains eachgeneration
value for the received inputs. This is also useful in order to prepare the data for the next step,UltraFeedback
, as it expects aninstruction
and a list ofgenerations
as input.UltraFeedback: is a
Task
that implements the UltraFeedback prompts and post-processing, so as to judge a list of generations for a given instruction using GPT-4, but in this case we'll use it with smaller LLMs as mentioned in the introduction, since that's what the paper is about. In this case we'll use the followingLLM
implementations:- InferenceEndpointsLLM: already mentioned above, and in this case it will run
CohereForAI/c4ai-command-r-plus
as opposed toCohereForAI/c4ai-command-r
as mentioned in the paper, but only because Command R+ has a serverless endpoint available in the Hugging Face Hub. - AnthropicLLM: is an
LLM
implementation for Anthropic's models, and in this case we'll be using Claude 3 Haiku, their fastest and most compact model, designed for near-instant responsiveness and seamless AI experiences that mimic human interactions. Even though for the currentUltraFeedback
prompts hasn't proven to be so strong, will be detailed further on. - OpenAILLM: is an
LLM
implementation for OpenAI's models via their client, which can be extended to more APIs that are comliant with OpenAI's client; even though in this case we'll use it for GPT-3.5 (gpt-3.5-turbo-0125
).
- InferenceEndpointsLLM: already mentioned above, and in this case it will run
CombineColumns: as mentioned below, is a step that expects inputs from more than one step and combines the provided columns into a list under another column name, and in this case we'll group the
ratings
,rationales
, andmodel_name
to calculate the average of the ratings, while leaving the rest for reference.AvgPooling: is a custom defined
Step
via thestep
decorator, that will expectpoll_ratings
as input, and will calculate the average of those ratings and put the average for each generation under a list that will match the length of the generations, in this case four. It also showcases how easy it is to create customStep
implementations withdistilabel
via thestep
decorator.
Implementation
from distilabel.llms import InferenceEndpointsLLM, OpenAILLM
from distilabel.pipeline import Pipeline
from distilabel.steps import (
CombineColumns,
KeepColumns,
LoadHubDataset,
StepInput,
step,
)
from distilabel.steps.formatting import FormatTextGenerationDPO
from distilabel.steps.tasks import TextGeneration, UltraFeedback
from distilabel.steps.typing import StepOutput
@step(inputs=["poll_ratings"], outputs=["avg_poll_ratings"])
def AveragePooling(*inputs: StepInput) -> StepOutput:
"""Custom `Step` that calculates the average of the ratings for each generation."""
for input in inputs:
for item in input:
item["avg_poll_ratings"] = [
sum(col) / len(col) for col in zip(*item["poll_ratings"])
]
yield input
if __name__ == "__main__":
# We use `Pipeline` context manager to ensure all the steps defined inside
# are included as part of the `pipeline`
with Pipeline(name="replacing-judges-with-juries") as pipeline:
# First we load the dataset from the Hugging Face Hub, but for local testing
# one could just define a dataset a list of dicts and provide that to `LoadDataFromDicts`
load_dataset = LoadHubDataset(
name="load_dataset",
repo_id="HuggingFaceH4/instruction-dataset",
split="test",
num_examples=10,
output_mappings={"prompt": "instruction"},
)
# We create a `TextGeneration` task running Llama 3 on serverless endpoints
text_generation_llama3 = TextGeneration(
name="text_generation_llama3",
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3-8B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3-8B-Instruct",
),
input_batch_size=10,
output_mappings={"model_name": "generation_model"},
)
# We create a `TextGeneration` task running Gemma 1.1 on serverless endpoints
text_generation_gemma = TextGeneration(
name="text_generation_gemma",
llm=InferenceEndpointsLLM(
model_id="google/gemma-1.1-7b-it",
),
input_batch_size=10,
output_mappings={"model_name": "generation_model"},
)
# We create a `TextGeneration` task running Phi 3 on serverless endpoints
text_generation_phi3 = TextGeneration(
name="text_generation_phi3",
llm=InferenceEndpointsLLM(
model_id="microsoft/Phi-3-mini-4k-instruct",
),
input_batch_size=10,
output_mappings={"model_name": "generation_model"},
)
# We create a `TextGeneration` task running Mistral v0.2 on serverless endpoints
text_generation_mistral = TextGeneration(
name="text_generation_mistral",
llm=InferenceEndpointsLLM(
model_id="mistralai/Mistral-7B-Instruct-v0.2",
),
input_batch_size=10,
output_mappings={"model_name": "generation_model"},
)
# Combine the `generation` and `generation_model` columns from the previous step
# under a single column name as a list
combine_generation_columns = CombineColumns(
name="combine_generation_columns",
columns=["generation", "generation_model"],
output_columns=["generations", "generation_models"],
)
# We create the UltraFeedback task with the `instruction-following` aspect to evaluate
# the LLM capabilities on following instructions, running Command R+ on serverless
# endpoints and GPT-3.5 from OpenAI
ultrafeedback_cmdr_plus = UltraFeedback(
name="ultrafeedback_cmdr_plus",
llm=InferenceEndpointsLLM(
model_id="CohereForAI/c4ai-command-r-plus",
),
input_batch_size=5,
aspect="instruction-following",
)
ultrafeedback_gpt35 = UltraFeedback(
name="ultrafeedback_gpt35",
llm=OpenAILLM(
model="gpt-3.5-turbo-0125",
),
input_batch_size=5,
aspect="instruction-following",
)
# Then we combine again the generated `ratings` and `rationales` into a single column
combine_ultrafeedback_columns = CombineColumns(
name="combine_ultrafeedback_columns",
columns=["ratings", "rationales", "model_name"],
output_columns=["poll_ratings", "poll_rationales", "poll_models"],
)
# Finally, we call our custom task to calculate the average of the ratings for each generation
avg_pooling = AveragePooling(name="avg_pooling", input_batch_size=1)
# Here we define the orchestration of the steps using the `rshift` operator showing how the
# different steps are connected to each other in the `Pipeline`
(
load_dataset
>> [text_generation_llama3, text_generation_gemma, text_generation_phi3, text_generation_mistral]
>> combine_generation_columns
>> [ultrafeedback_cmdr_plus, ultrafeedback_gpt35]
>> combine_ultrafeedback_columns
>> avg_pooling
)
Finally, once the Pipeline
has been defined, you can run it as it follows, defining some runtime parameters to mainly control the generation of the LLMs.
distiset = pipeline.run(
parameters={
"text_generation_llama3": {
"llm": {
"generation_kwargs": {
"temperature": 0.7,
"max_new_tokens": 1024,
"stop_sequences": ["<|eot_id|>", "<|end_of_text|>"],
},
},
},
"text_generation_gemma": {
"llm": {
"generation_kwargs": {
"temperature": 0.7,
"max_new_tokens": 1024,
"stop_sequences": ["<eos>", "<end_of_turn>"],
},
},
},
"text_generation_phi3": {
"llm": {
"generation_kwargs": {
"temperature": 0.7,
"max_new_tokens": 1024,
"stop_sequences": ["</s>", "<|endoftext|>"],
},
},
},
"text_generation_mistral": {
"llm": {
"generation_kwargs": {
"temperature": 0.7,
"max_new_tokens": 1024,
"stop_sequences": ["</s>"],
},
},
},
# "ultrafeedback_haiku": {
# "llm": {"generation_kwargs": {"temperature": 0.7, "max_tokens": 4096}},
# },
"ultrafeedback_cmdr_plus": {
"llm": {
"generation_kwargs": {
"temperature": 0.7,
"max_new_tokens": 4096,
"stop_sequences": ["<EOS_TOKEN>", "<|END_OF_TURN_TOKEN|>"],
},
},
},
"ultrafeedback_gpt35": {
"llm": {
"generation_kwargs": {"temperature": 0.7, "max_new_tokens": 4096}
},
},
}
)
Finally, we can optionally push the generated dataset i.e. distilabel.Distiset
, to the Hugging Face Hub via the push_to_hub
method, so that each subset generated in the leaf steps is pushed to the Hub, in this case since there's only one leaf step, only that will be pushed; but if there were many, then each leaf step would be pushed under a different configuration in the Hub.
distiset.push_to_hub("replacing-judges-with-juries-distilabel")
🤗 Dataset available at alvarobartt/replacing-judges-with-juries-distilabel
Notes (as of May 3rd, 2024)
Note that you can replace the LLMs used below with the ones from your choice, the idea of using those was because the ones used for the
TextGeneration
task are provided as serverless endpoints within the Hugging Face Hub and the ones used forUltraFeedback
are the same ones as used in the official paper.
In order to use extensively the serverless Inference Endpoints deployed in the Hugging Face Hub, subscribing to Pro is recommended (see pricing), since Inference for PROs will be enabled and you will have improved rate limits for the usage of the free Inference API.
I've encounter issues when using Claude Haiku with UltraFeedback prompts, since apparently it's not able to generate something that's compliant with the expected formatting, but I'll investigate that; in the meantime, the code for running Claude Haiku with UltraFeedback has been commented. That could be fixed by just ignoring the values with
rating=None
, but until further investigation is done, I feel like it's better to leave that aside for the moment.
What's next?
Currently we're actively working on distilabel
v1.1.0 trying to make it as developer friendly as possible, encouraging everyone in the community to build with distilabel
, as well as aiming to bridge the gap on sythetic data generation with open models and on consumer hardware.