Author
stringlengths
6
28
Birth Year
int64
1.85k
2k
# of sitelinks
int64
0
190
WikiData ID
stringlengths
5
9
OpenLibrary ID
stringlengths
7
11
Gabriel García Márquez
1,927
190
Q5878
OL4586796A
Toni Morrison
1,931
122
Q72334
OL31120A
Erich Maria Remarque
1,898
119
Q47293
OL122169A
Nadine Gordimer
1,923
117
Q47619
OL20580A
Isabel Allende
1,942
91
Q83566
OL228079A
Arundhati Roy
1,961
85
Q212801
OL104867A
Nikos Kazantzakis
1,883
82
Q214622
OL29174A
Oriana Fallaci
1,929
73
Q153700
OL781814A
Edith Stein
1,891
71
Q76749
OL51184A
Thomas Pynchon
1,937
70
Q35155
OL4423376A
Michael Ende
1,929
69
Q76498
OL296646A
Amin Maalouf
1,949
69
Q115243
OL46671A
Jean Dubuffet
1,901
66
Q170076
OL143386A
Julia Kristeva
1,941
65
Q159876
OL31606A
Joseph Heller
1,923
64
Q208101
OL33512A
Amos Oz
1,939
64
Q151872
OL170730A
Romain Gary
1,914
63
Q157322
OL123692A
Julia Child
1,912
62
Q214477
OL218264A
Pierre Boulez
1,925
61
Q156193
OL273016A
Mika Waltari
1,908
60
Q193111
OL2688714A
Gianni Rodari
1,920
59
Q193018
OL299925A
Thomas Bernhard
1,931
58
Q44336
OL4326320A
Manuel Azaña
1,880
57
Q203708
OL149031A
Christiaan Barnard
1,922
57
Q188803
OL361761A
Oskar Kokoschka
1,886
54
Q154260
OL28175A
James Hopwood Jeans
1,877
54
Q315545
OL166245A
David Foster Wallace
1,962
53
Q313246
OL448939A
Ivan Illich
1,926
51
Q84186
OL428194A
bell hooks
1,952
50
Q259507
OL2631291A
Kevin Smith
1,970
49
Q489831
OL2721414A
Carson McCullers
1,917
47
Q230591
OL22420A
Brendan Behan
1,923
47
Q313063
OL143442A
Peter Weiss
1,916
46
Q52191134
OL396053A
Marcel Aymé
1,902
46
Q318026
OL75696A
Olaf Stapledon
1,886
45
Q337373
OL538087A
Murray Bookchin
1,921
45
Q315910
OL333834A
Marianne Moore
1,887
44
Q278495
OL545371A
Veronica Roth
1,988
43
Q328212
OL6895646A
Leopoldo Alas
1,852
43
Q312747
OL28169A
Carl Zuckmayer
1,896
43
Q76820
OL75772A
Heinrich Harrer
1,912
42
Q84211
OL207981A
Frank McCourt
1,930
42
Q208869
OL26363A
David Suzuki
1,936
42
Q354534
OL18944A
Hermann Broch
1,886
41
Q84150
OL61295A
Richard Hammond
1,969
40
Q297265
OL5572088A
Maeve Binchy
1,940
40
Q152690
OL21305A
Ignazio Silone
1,900
40
Q168431
OL124945A
Herman Wouk
1,915
40
Q49072
OL4352886A
Eudora Welty
1,909
40
Q259364
OL32584A
Viktor Suvorov
1,947
39
Q130786
OL284950A
Knud Rasmussen
1,879
39
Q312769
OL18679A
Gary Snyder
1,930
39
Q315963
OL22849A
Frederick Jackson Turner
1,861
39
Q548462
OL146604A
Edith Nesbit
1,858
39
Q231708
OL18053A
Colm Tóibín
1,955
38
Q470758
OL82249A
Sarah Kane
1,971
37
Q231141
OL1614632A
Martin Andersen Nexø
1,869
36
Q168569
OL137086A
Timothy Garton Ash
1,955
35
Q311729
OL81428A
Nevil Shute
1,899
35
Q356639
OL410117A
Kostis Palamas
1,859
35
Q317967
OL5868580A
Fan S. Noli
1,882
35
Q366307
OL46244A
Arnold Wesker
1,932
35
Q202385
OL22347A
Daniel Ellsberg
1,931
34
Q431085
OL1260683A
Peter Shaffer
1,926
33
Q318188
OL73801A
Nancy Mitford
1,904
33
Q260026
OL288327A
Michael Ignatieff
1,947
33
Q311684
OL235573A
Leon Uris
1,924
33
Q269129
OL19269A
Gianni Vattimo
1,936
33
Q159648
OL37112A
Colin Dexter
1,930
33
Q457092
OL34485A
Pyotr Krasnov
1,869
32
Q35448
OL188777A
Linn Ullmann
1,966
32
Q256738
OL31551A
Ali Smith
1,962
32
Q468523
OL6496199A
Peter Atkins
1,940
31
Q369627
OL3409121A
Joanne Harris
1,964
31
Q234718
OL25453A
Dodie Smith
1,896
31
Q449085
OL161177A
Ernst Troeltsch
1,865
30
Q60285
OL173237A
Yu Hua
1,960
28
Q379520
OL528199A
Yrsa Sigurðardóttir
1,963
28
Q262253
OL2631877A
Stephen E. Ambrose
1,936
28
Q443953
OL29987A
Paolo Soleri
1,919
28
Q447351
OL1123646A
Guglielmo Ferrero
1,871
28
Q689713
OL115322A
Eric Temple Bell
1,883
28
Q548140
OL766341A
Cornelius Ryan
1,920
28
Q463975
OL482577A
Beverly Cleary
1,916
28
Q1316719
OL22132A
Karin Fossum
1,954
27
Q256789
OL41672A
John Perkins
1,945
27
Q465028
OL1542161A
Charles Fort
1,874
27
Q443325
OL21506A
Andre Gunder Frank
1,929
27
Q58040
OL392296A
William Beebe
1,877
26
Q956868
OL155998A
Murray Leinster
1,896
26
Q550449
OL1232076A
Mary Midgley
1,919
26
Q2898525
OL448425A
John Hersey
1,914
26
Q535812
OL394640A
Flora Nwapa
1,931
26
Q5460344
OL2703239A
Dietrich von Hildebrand
1,889
26
Q14678
OL152394A
Lauren Weisberger
1,977
25
Q176049
OL1427597A
Farley Mowat
1,921
25
Q966679
OL30012A
Ernst Haas
1,921
25
Q78767
OL830687A
August Derleth
1,909
25
Q509002
OL6925253A
Andrew Young
1,932
25
Q959635
OL534748A
Xiao Hong
1,911
24
Q464825
OL1126811A

Overview

🕮 KITAB is a challenging dataset and a dynamic data collection approach for testing abilities of Large Language Models (LLMs) in answering information retrieval queries with constraint filters. A filtering query with constraints can be of the form "List all books written by Toni Morrison that were published between 1970-1980". The dataset was originally contributed by the paper "KITAB: Evaluating LLMs on Constraint Satisfaction for Information Retrieval" Marah I Abdin, Suriya Gunasekar, Varun Chandrasekaran, Jerry Li, Mert Yuksekgonul, Rahee Ghosh Peshawaria, Ranjita Naik, and Besmira Nushi. 2023. The dataset is named after the word kitab, which is the word for "book" in Arabic, Swahili, Urdu, Hindi and various Indian and Turkic languages.

KITAB consists of book-related data across more than 600 authors and 13,000 queries with varying number of constraints and complexity. In each query in the dataset, the first constraint is always fixed to an author and the following can vary among the following types of book constraints to test for different constraint satisfaction capabilities:

  • lexical (title starts or ends with a letter, word count in title)
  • temporal (published between start and end year)
  • named entity (city or human name present or not present in title)

What is available in this repository?

This repository contains the following artifacts:

  • All data for the KITAB sample used in the original paper. This consists of the set of authors, their corresponding books, and the set of queries with constraints.
  • Example code for generating a new sample with a different set of authors. Here the sampling and data collection steps do not include the generation of queries as these may change according to the evaluation usage needs for the data. The example code also shows how to evaluate a potential model output with a list of books against the provided ground truth in KITAB, by following the same evaluation process as in the original paper. Note that this evaluation tends to relax some of the constraint satisfaction requirements in particular when the model may come up with only a partial title.
  • All prompts that were used in the original paper to evaluate GPT-4 and GPT-3.5.

Data

  • KITAB-ONE-BOOK-CONSTRAINTS.json and KITAB-TWO-BOOK-CONSTRAINTS.json - correspond to queries with one and two book constraints. Each file has all the sufficient information that can be used to recreate a prompt query including the author, their birth year, number of sitelinks on WikiData, the constraint type(s), the constraint(s) expressed in natural language, the list of all books by the author, and the mapped list of books by the author that satisfy the constraint(s).
KITAB-ONE-BOOK-CONSTRAINTS_features = {
    "Author": "author name",
    "Birth Year": "author birth year",
    "# of sitelinks": "number of external links related to the author",
    "constraint_id": "unique id for the constraint",
    "constraint_type": "type of the constraint",
    "constraints": "the constraint",
    "mapped_books": "list of books by the author mapped to the constraint",
    "all_books": "full list of books by author post cleaning from openlibrary",
    "raw_books": "raw list of books by author from openlibrary",
}
  • KITAB-author-metadata.json - contains the set of 611 authors along with their birth year, the number of sitelinks in Wikidata, and their corresponding Open Library and WikiData identifiers.
  • KITAB-book-metadata.tar.gz - contains a json file per author with all books retrieved from OpenLibrary for that author. The files contain the following information per title: the Open Library Id for the book, the Wikidata ID (if it exists), list of languages in which it was published, number of editions, number of words in the title, the earliest publishing year, city names found in the title (if any), a modified version of the title in lowercase that stripes stop words like "A" and "The" from the title, a set of of other redundant versions of the same title as found in Open Library (if any).

Code and evaluation scripts

Example notebooks included in this repository:

  • collect_authors_from_wikidata.py and wikidata_open_library_author_profiling.ipynb - example code for generating a new author sample from WikiData and OpenLibrary. Here, we also make available the longer list of authors that was originally sampled from WikiData to facilitate the sampling process although future work may also choose to repeat this step as needed. The full list can be found in: wikidata_authors_crawl.csv.
  • fetch_book_data.py - example code for collecting book data for the set of authors sampled in the previous steps. Pulls data from OpenLibrary and WikiData to curate and clean the sample.
  • evaluation.ipynb - example code for evaluating model outputs from our prompts against ground truth KITAB data. Here, we also make available the GPT-4 output on human name detection, although as models improve future work may also choose to repeat this step as needed. Results can be found in: gpt_4_name_data_processed.csv.

Prompts

We use the following prompt templates for different experimental conditions on the KITAB data:

ALL-BOOKS (Template 1): List all books from the author. This condition enables us to estimate an upper bound of model performance in retrieving relevant information for all queries, regardless of other constraints.

NO-CONTEXT (Template 2a): List all books from the author that also satisfy other book constraints.

WITH-CONTEXT (Template 2b): First, provide a full list of books from the author as input context to the model. Then, ask the model to list all books from the author that also satisfy other book constraints.

SELF-CONTEXT (Template 3): Ask the model to first self-retrieve all books from the author, and then use that list to find those that also satisfy book constraints.

NAME-CHECK (Template 4): Ask the model to find all book in a given list that contain a human name.

Data Collection and Statistics

The author list was initially randomly sampled from WikiData and then filtered down to 611 authors to avoid potentially inaccurate data and extreme outliers. For example, this involved removing authors that have very few or too many books and authors that were born before 1850. The collected book data was derived from Open Library and contains all books from the author that are tagged to be in English by Open Library or detected to be in English by the Language Detection service from the Azure Cognitive Services API. More details about author sampling and book data collection and cleaning are present in the paper.

Since there exists a large number of constraint instances depending on their cardinality, we subsample from the potential large set of queries in a way that ensures a balanced representation across constraint types, and a variety of constraints that have different constrainedness (i.e., defined as the complement of the ratio between the number of books that satisfy the constraints with the total number of all books from the author). The dataset also contains “unsatisfiable” constraints, which do not match any book titles in our data. This constitutes 7.99% of the queries with only one book constraint. The final dataset contains 8239 single-constraint queries and 4750 double-constraint queries. The table below shows how these queries are distributed across different constraint types. For all double-constraint queries, both constraints are individually satisfiable and generated by combining our single constraint data. Only 0.76% of the queries are jointly unsatisfiable across both constraints.

Distribution of KITAB queries across author popularity as measured by the number of sitelinks on Wikidata, for queries with a single book constraint (left) and two book constraints (right).
Distribution of queries across author constrainedness as measured by the complement of the ratio between the number of books that satisfy the book constraints and the total number of books from the author. Distribution is shown for queries with a single book constraint (left) and two book constraints (right). Note that most of the distribution in the lower range of constrainedness is dominated by constraints that require no human name or no city name in the title, which are naturally easier to satisfy.

Responsible AI Considerations

Data Cleaning: Despite our best efforts in collecting a complete and accurate set of books, we also faced a variety of challenges in retrieval and cleaning, which we further describe in Appendix C.1 in the paper. To estimate the extent of which potential data cleaning issues may impact the data quality of KITAB and further evaluation, we also undertook a manual data annotation exercise during which we searched on the web for titles provided by GPT4 and GPT3.5 but that were marked as not from the author in our dataset. In summary, we find that based on a manual annotation of a subsample of queries, less than 5% of the queries to GPT4 and less than 6% of the queries to GPT3.5 may potentially be affected by cases where the model finds a book title that is not in KITAB and that will consequentially be marked as not from the author during our evaluation. While this can be remediated by using further data sources, the impact of missing information on model comparison is minor.

Human Names: Entity recognition for human names was done using both Azure Cognitive Services API and GPT4 (Template 4 in Appendix D in the paper), as we found the two approaches to be complementary for detecting names from different cultures. Note that even after using both these resources, there may still be names that are not recognized by either of these APIs, which is a testimony that more work is required in improving the quality of service of entity recognition for fairness across different languages and cultures.

City Names: For city names, we use Azure Cognitive Services API along with Geonames, a database of cities with more than 1000 inhabitants.

Author representation: The list of authors in KITAB was sampled randomly from a large set of authors present in Open Library. We see that the rate of irrelevant information generated by current models increases with a lower number of sitelinks in Wikidata. Since the number of sitelinks may also correlate with the age (birth year) of the author or even their nationality and how well their community is linked to the World Wide Web, this observation has important implications on model quality of service across different geographical regions and author popularity and age. While KITAB naturally does contain more authors with a lower number of sitelinks (as indicated by its long-tail distribution of author count vs. their popularity), future fairness measurement investigations in this regard may also need to oversample explicitly from cohorts belonging to given demographic and geographical attributes.

State-of-the-art results on KITAB

How to cite

@inproceedings{abdin2023kitab,
  title={KITAB: Evaluating LLMs on Constraint Satisfaction for Information Retrieval},
  author={Abdin, Marah I and Gunasekar, Suriya and Chandrasekaran, Varun and Li, Jerry and Yuksekgonul, Mert and Peshawaria, Rahee Ghosh and Naik, Ranjita and Nushi, Besmira},
  journal={arXiv preprint arXiv:2310.15511},
  year={2023}
}

Contributors

Marah I Abdin, Suriya Gunasekar, Varun Chandrasekaran, Jerry Li, Mert Yuksekgonul, Rahee Ghosh Peshawaria, Ranjita Naik, Besmira Nushi

Downloads last month
169
Edit dataset card