Reveal: A Benchmark for Verifiers of Reasoning Chains

Paper: A Chain-of-Thought Is as Strong as Its Weakest Link: A Benchmark for Verifiers of Reasoning Chains

Website: https://reveal-dataset.github.io/

Abstract: Prompting language models to provide step-by-step answers (e.g., "Chain-of-Thought") is the prominent approach for complex reasoning tasks, where more accurate reasoning chains typically improve downstream task performance. Recent literature discusses automatic methods to verify reasoning steps to evaluate and improve their correctness. However, no fine-grained step-level datasets are available to enable thorough evaluation of such verification methods, hindering progress in this direction. We introduce Reveal: Reasoning Verification Evaluation, a new dataset to benchmark automatic verifiers of complex Chain-of-Thought reasoning in open-domain question answering settings. Reveal includes comprehensive labels for the relevance, attribution to evidence passages, and logical correctness of each reasoning step in a language model's answer, across a wide variety of datasets and state-of-the-art language models.

Usage

To load the dataset:

! pip install datasets
from datasets import load_dataset

reveal = load_dataset("google/reveal")
reveal_eval = reveal['eval']  # select Reveal-Eval, the evaluation split
reveal_open = reveal['open']  # select Reveal-Open, the hard-cases split with low-confidence annotations

Note: The above provides a table from eval/reveal_eval.csv for easily working at scale with the data. There is another file eval/reveal_eval.json with a more intuitive json structure, if you prefer this format.

Some examples of how to handle the data by deriving step-level tasks:

import pandas as pd
reveal_eval = pd.DataFrame(reveal_eval)

# Step Attribution task
eval_attr = reveal_eval[~reveal_eval.evidence.isna()].reset_index(drop=True)
eval_attr['decontextualized_step'] = eval_attr['decontextualized_step'].fillna(eval_attr['step'])
# Fields:
# Premise: [evidence]
# Hypothesis: [decontextualized_step]
# Gold label: [attribution_label]

# Step Logic task
def _make_history(row):
  return row['question'] + ' ' + row['full_answer'].split(row['step'].strip())[0]
eval_logic = reveal_eval.drop_duplicates(subset=['answer_id', 'step_idx']).reset_index(drop=True)
eval_logic = eval_logic[(eval_logic['type_label'] == 'Logical step.') & (eval_logic['logic_relevance_label'] == 'Relevant') & (~eval_logic['correctness_label'].isna())]
eval_logic['history'] = eval_logic.apply(_make_history, axis=1)
# Fields:
# Premise: [history]
# Hypothesis: [step]
# Gold label: [correctness_label]

# Step Relevance task
eval_relevance = reveal_eval.drop_duplicates(subset=['answer_id', 'step_idx']).reset_index(drop=True)
eval_relevance['relevance_label'] = (eval_relevance['logic_relevance_label'] == 'Relevant') | (eval_relevance['attribution_relevance_label'] == 'Yes')
# Fields:
# Question: [question]
# Answer: [full_answer]
# Step: [step]
# Gold label: [relevance_label]

# Step Type task
eval_type = reveal_eval.drop_duplicates(subset=['answer_id', 'step_idx']).reset_index(drop=True)
# Fields:
# Question: [question]
# Answer: [full_answer]
# Step: [step]
# Gold label: [type_label]

# CoT Full Correctness task
# Get a list of the final rated evidence passages for each answer_id and concatenate the list into one string:
rated_evidence_per_answer = {
    answer_id: reveal_eval[(reveal_eval.answer_id == answer_id) & reveal_eval.is_final_rated_evidence_for_step]['evidence']
    for answer_id in reveal_eval['answer_id'].unique()
}
rated_evidence_per_answer = {
    k: '\n'.join([f'Evidence {i+1}: {e}' for i, e in enumerate(v)]) for k, v in rated_evidence_per_answer.items()
}
# Prepare the eval DataFrame:
answer_correctness_eval = reveal_eval.drop_duplicates(subset=['answer_id']).reset_index(drop=True)
answer_correctness_eval['all_rated_evidence'] = answer_correctness_eval['answer_id'].apply(lambda x: rated_evidence_per_answer[x])
answer_correctness_eval = answer_correctness_eval[['answer_id','question','full_answer','all_rated_evidence','answer_is_fully_attributable','answer_is_logically_correct','answer_is_fully_attributable_and_correct']]

This is an evaluation benchmark. It should not be included in training data for NLP models.

Please do not redistribute any part of the dataset without sufficient protection against web-crawlers.

An identifier 64-character string is added to each instance in the dataset to assist in future detection of contamination in web-crawl corporta.

The reveal dataset's string is: Reveal:Mn12GAs2I3S0eWjbTUFC0Y51ijGFB7rGBLnzGGhCQ7OtJPfVg7e6qt9zb5RPL36U

The same has been done to the few-shot prompting demonstrations, to detect whether these demonstrations have been in a model's training data (if so, these demonstrations should not be used for few-shot evaluation of that model).

The few-shot demonstrations' string is: Reveal:HlyeWxw8BRcQ2dPGShTUUjn03uULZOyeNbzKzRIg4QihZ45k1lrye46OoUzi3kkW

Fields and Descriptions

dataset: Source dataset
question_id: ID of the original instance from the source dataset
question: The question text
answer_model: Model which generated the CoT answer
answer_id: ID of a particular model's answer to a question (question_id + answer_model)
step_idx: Step index in the answer for this row
full_answer: Full CoT answer generated by the model
step: The step from the full CoT answer which matches "step_idx", the subject of the row
decontextualized_step: The decontextualized version of the step that we used for evidence retrieval (and for the NLI classification evaluations settings)
attribution_relevance_label: Majority label for the relevance annotations in the attribution task
attribution_relevance_majority: Max # of raters which agreed with each other for this rating
attribution_relevance_annotations: The annotations for each rater (ordered list)
attribution_relevance_raters: The raters (ordered list)
attribution_relevance_num_ratings: The number of raters/ratings
evidence_id: The evidence id (from 1 to 3) used for the annotation in this row
evidence: The evidence used for the annotation in this row
attribution_label: The majority label for whether the evidence supports the step
attribution_majority: Max # of raters which agreed with each other for this rating
attribution_annotations: The annotations for each rater (ordered list)
attribution_raters: The raters (ordered list)
attribution_num_ratings: The number of raters/ratings
attribution_justifications: The justifications of each rater (ordered list) - note that the raters gave one justification for every step, not for every evidence
annotated_in_attribution_batch: Which batch this was annotated in (we had 5 annotation batches)
type_label: Majority label for whether the step is an attribution step, logical step or both
type_majority: Max # of raters which agreed with each other for this rating
type_annotations: The annotations for each rater (ordered list)
type_raters: The raters (ordered list)
type_num_ratings: The number of raters/ratings
logic_relevance_label: Majority label for relevance annotations in the logic task
logic_relevance_majority: Max # of raters which agreed with each other for this rating
logic_relevance_annotations: The annotations for each rater (ordered list)
logic_relevance_raters: The raters (ordered list)
logic_relevance_num_ratings: The number of raters/ratings
logic_justifications: Justifications of each rater (ordered list) - note that the raters gave one justification to all ratings of every step (i.e., one justification for the ratings of type + relevance + correctness together)
annotated_in_logic_batch: Which batch this was annotated in (we had 5 annotation batches)
correctness_label: Majority label for whether the step is logically correct given the question + previous steps
correctness_majority: Max # of raters which agreed with each other for this rating
correctness_annotations: The annotations for each rater (ordered list)
correctness_raters: The raters (ordered list)
correctness_num_ratings: The number of raters/ratings
agreement_majority_all_steps: Minimum agreement majority across the attribution and logic ratings for all steps
is_low_agreement_hard_case: agreement_majority_all_steps <= 2. This boolean indicates whether the annotations for this answer contain a step with non-trustworthy annotations. This is the difference between Reveal-Eval and Reveal-Open.
contamination_identifier: An identification string for contamination detection.
is_final_rated_evidence_for_step: Whether this step-evidence pair is the final attribution rating for this step (we try 3 evidences, and stop when we find a supporting or contradicting evidence. The rating in this row is the final attribution rating for the ste pacross all evidence passages)
answer_is_fully_attributable: Whether all attribution steps in the answer are fully attributable to some evidence
answer_is_logically_correct: Whether all logic steps are logically correct
answer_is_fully_attributable_and_correct: Whether all steps are correct (fully attributable or logical)

Datasets:

google
/

reveal

You need to agree to share your contact information to access this dataset

Reveal: A Benchmark for Verifiers of Reasoning Chains

Paper: A Chain-of-Thought Is as Strong as Its Weakest Link: A Benchmark for Verifiers of Reasoning Chains

Usage

This is an evaluation benchmark. It should not be included in training data for NLP models.

Fields and Descriptions