You need to agree to share your contact information to access this dataset
This repository is publicly accessible, but you have to accept the conditions to access its files and content.
By clicking “Access repository” below, you confirm your understanding that this resource is permitted for use as a test set, but not as a training set, and should not be uploaded to the internet where web-crawlers can access it (such as plain-text in github, or in an academic PDF). Please ensure adherence to the terms detailed in the paper. If you are unsure about your specific case, don't hesitate to contact: alonjacovi@gmail.com.
Log in or Sign Up to review the conditions and access this dataset content.
Reveal: A Benchmark for Verifiers of Reasoning Chains
Paper: A Chain-of-Thought Is as Strong as Its Weakest Link: A Benchmark for Verifiers of Reasoning Chains
Link: https://arxiv.org/abs/2402.00559
Website: https://reveal-dataset.github.io/
Abstract: Prompting language models to provide step-by-step answers (e.g., "Chain-of-Thought") is the prominent approach for complex reasoning tasks, where more accurate reasoning chains typically improve downstream task performance. Recent literature discusses automatic methods to verify reasoning steps to evaluate and improve their correctness. However, no fine-grained step-level datasets are available to enable thorough evaluation of such verification methods, hindering progress in this direction. We introduce Reveal: Reasoning Verification Evaluation, a new dataset to benchmark automatic verifiers of complex Chain-of-Thought reasoning in open-domain question answering settings. Reveal includes comprehensive labels for the relevance, attribution to evidence passages, and logical correctness of each reasoning step in a language model's answer, across a wide variety of datasets and state-of-the-art language models.
Usage
To load the dataset:
! pip install datasets
from datasets import load_dataset
reveal = load_dataset("google/reveal")
reveal_eval = reveal['eval'] # select Reveal-Eval, the evaluation split
reveal_open = reveal['open'] # select Reveal-Open, the hard-cases split with low-confidence annotations
Note: The above provides a table from eval/reveal_eval.csv
for easily working at scale with the data. There is another file eval/reveal_eval.json
with a more intuitive json structure, if you prefer this format.
Some examples of how to handle the data by deriving step-level tasks:
import pandas as pd
reveal_eval = pd.DataFrame(reveal_eval)
# Step Attribution task
eval_attr = reveal_eval[~reveal_eval.evidence.isna()].reset_index(drop=True)
eval_attr['decontextualized_step'] = eval_attr['decontextualized_step'].fillna(eval_attr['step'])
# Fields:
# Premise: [evidence]
# Hypothesis: [decontextualized_step]
# Gold label: [attribution_label]
# Step Logic task
def _make_history(row):
return row['question'] + ' ' + row['full_answer'].split(row['step'].strip())[0]
eval_logic = reveal_eval.drop_duplicates(subset=['answer_id', 'step_idx']).reset_index(drop=True)
eval_logic = eval_logic[(eval_logic['type_label'] == 'Logical step.') & (eval_logic['logic_relevance_label'] == 'Relevant') & (~eval_logic['correctness_label'].isna())]
eval_logic['history'] = eval_logic.apply(_make_history, axis=1)
# Fields:
# Premise: [history]
# Hypothesis: [step]
# Gold label: [correctness_label]
# Step Relevance task
eval_relevance = reveal_eval.drop_duplicates(subset=['answer_id', 'step_idx']).reset_index(drop=True)
eval_relevance['relevance_label'] = (eval_relevance['logic_relevance_label'] == 'Relevant') | (eval_relevance['attribution_relevance_label'] == 'Yes')
# Fields:
# Question: [question]
# Answer: [full_answer]
# Step: [step]
# Gold label: [relevance_label]
# Step Type task
eval_type = reveal_eval.drop_duplicates(subset=['answer_id', 'step_idx']).reset_index(drop=True)
# Fields:
# Question: [question]
# Answer: [full_answer]
# Step: [step]
# Gold label: [type_label]
# CoT Full Correctness task
# Get a list of the final rated evidence passages for each answer_id and concatenate the list into one string:
rated_evidence_per_answer = {
answer_id: reveal_eval[(reveal_eval.answer_id == answer_id) & reveal_eval.is_final_rated_evidence_for_step]['evidence']
for answer_id in reveal_eval['answer_id'].unique()
}
rated_evidence_per_answer = {
k: '\n'.join([f'Evidence {i+1}: {e}' for i, e in enumerate(v)]) for k, v in rated_evidence_per_answer.items()
}
# Prepare the eval DataFrame:
answer_correctness_eval = reveal_eval.drop_duplicates(subset=['answer_id']).reset_index(drop=True)
answer_correctness_eval['all_rated_evidence'] = answer_correctness_eval['answer_id'].apply(lambda x: rated_evidence_per_answer[x])
answer_correctness_eval = answer_correctness_eval[['answer_id','question','full_answer','all_rated_evidence','answer_is_fully_attributable','answer_is_logically_correct','answer_is_fully_attributable_and_correct']]
This is an evaluation benchmark. It should not be included in training data for NLP models.
Please do not redistribute any part of the dataset without sufficient protection against web-crawlers.
An identifier 64-character string is added to each instance in the dataset to assist in future detection of contamination in web-crawl corporta.
The reveal dataset's string is: Reveal:Mn12GAs2I3S0eWjbTUFC0Y51ijGFB7rGBLnzGGhCQ7OtJPfVg7e6qt9zb5RPL36U
The same has been done to the few-shot prompting demonstrations, to detect whether these demonstrations have been in a model's training data (if so, these demonstrations should not be used for few-shot evaluation of that model).
The few-shot demonstrations' string is: Reveal:HlyeWxw8BRcQ2dPGShTUUjn03uULZOyeNbzKzRIg4QihZ45k1lrye46OoUzi3kkW
Fields and Descriptions
- dataset: Source dataset
- question_id: ID of the original instance from the source dataset
- question: The question text
- answer_model: Model which generated the CoT answer
- answer_id: ID of a particular model's answer to a question (question_id + answer_model)
- step_idx: Step index in the answer for this row
- full_answer: Full CoT answer generated by the model
- step: The step from the full CoT answer which matches "step_idx", the subject of the row
- decontextualized_step: The decontextualized version of the step that we used for evidence retrieval (and for the NLI classification evaluations settings)
- attribution_relevance_label: Majority label for the relevance annotations in the attribution task
- attribution_relevance_majority: Max # of raters which agreed with each other for this rating
- attribution_relevance_annotations: The annotations for each rater (ordered list)
- attribution_relevance_raters: The raters (ordered list)
- attribution_relevance_num_ratings: The number of raters/ratings
- evidence_id: The evidence id (from 1 to 3) used for the annotation in this row
- evidence: The evidence used for the annotation in this row
- attribution_label: The majority label for whether the evidence supports the step
- attribution_majority: Max # of raters which agreed with each other for this rating
- attribution_annotations: The annotations for each rater (ordered list)
- attribution_raters: The raters (ordered list)
- attribution_num_ratings: The number of raters/ratings
- attribution_justifications: The justifications of each rater (ordered list) - note that the raters gave one justification for every step, not for every evidence
- annotated_in_attribution_batch: Which batch this was annotated in (we had 5 annotation batches)
- type_label: Majority label for whether the step is an attribution step, logical step or both
- type_majority: Max # of raters which agreed with each other for this rating
- type_annotations: The annotations for each rater (ordered list)
- type_raters: The raters (ordered list)
- type_num_ratings: The number of raters/ratings
- logic_relevance_label: Majority label for relevance annotations in the logic task
- logic_relevance_majority: Max # of raters which agreed with each other for this rating
- logic_relevance_annotations: The annotations for each rater (ordered list)
- logic_relevance_raters: The raters (ordered list)
- logic_relevance_num_ratings: The number of raters/ratings
- logic_justifications: Justifications of each rater (ordered list) - note that the raters gave one justification to all ratings of every step (i.e., one justification for the ratings of type + relevance + correctness together)
- annotated_in_logic_batch: Which batch this was annotated in (we had 5 annotation batches)
- correctness_label: Majority label for whether the step is logically correct given the question + previous steps
- correctness_majority: Max # of raters which agreed with each other for this rating
- correctness_annotations: The annotations for each rater (ordered list)
- correctness_raters: The raters (ordered list)
- correctness_num_ratings: The number of raters/ratings
- agreement_majority_all_steps: Minimum agreement majority across the attribution and logic ratings for all steps
- is_low_agreement_hard_case: agreement_majority_all_steps <= 2. This boolean indicates whether the annotations for this answer contain a step with non-trustworthy annotations. This is the difference between Reveal-Eval and Reveal-Open.
- contamination_identifier: An identification string for contamination detection.
- is_final_rated_evidence_for_step: Whether this step-evidence pair is the final attribution rating for this step (we try 3 evidences, and stop when we find a supporting or contradicting evidence. The rating in this row is the final attribution rating for the ste pacross all evidence passages)
- answer_is_fully_attributable: Whether all attribution steps in the answer are fully attributable to some evidence
- answer_is_logically_correct: Whether all logic steps are logically correct
- answer_is_fully_attributable_and_correct: Whether all steps are correct (fully attributable or logical)
- Downloads last month
- 34