You need to agree to share your contact information to access this dataset

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Access to this dataset is automatically granted upon accepting the AI2 ImpACT License - Low Risk Artifacts (“LR Agreement”) and completing all fields below.

Log in or Sign Up to review the conditions and access this dataset content.

Dataset Card for WildChat

Dataset Description

Dataset Summary

WildChat is a collection of 1 million conversations between human users and ChatGPT, alongside demographic data, including state, country, hashed IP addresses, and request headers. We collected WildChat by offering online users free access to OpenAI's GPT-3.5 and GPT-4. In this version, 25.53% of the conversations come from the GPT-4 chatbot, while the rest come from the GPT-3.5 chatbot. The dataset contains a broad spectrum of user-chatbot interactions that are not previously covered by other instruction fine-tuning datasets: for example, interactions include ambiguous user requests, code-switching, topic-switching, political discussions, etc. WildChat can serve both as a dataset for instructional fine-tuning and as a valuable resource for studying user behaviors. Note that this dataset contains both toxic and non-toxic user inputs/ChatGPT responses.

WildChat has been openly released under AI2's ImpACT license as a low-risk artifact. The use of WildChat to cause harm is strictly prohibited.

Languages

68 languages were detected in WildChat.

Personal and Sensitive Information

The data has been de-identified with Microsoft Presidio and hand-written rules by the authors.

Data Fields

  • conversation_hash (string): The hash of each conversation's content. This is not a unique key, as different conversations with the same content will share the same hash. For unique identifiers, use turn_identifier within each turn.
  • model (string): The underlying OpenAI model, such as gpt-3.5-turbo or gpt-4.
  • timestamp (timestamp): The timestamp of the last turn in the conversation in UTC.
  • conversation (list): A list of user/assistant utterances. Each utterance is a dictionary containing the role of the speaker (user or assistant), the content of the utterance, the detected language of the utterance, whether the content of the utterance is considered toxic, and whether PII has been detected and anonymized (redacted). For user turns, there's also the hashed IP address hashed_ip of the turn, the state state and country country inferred from the original IP address, and the request headers header (which might be useful for linking multiple conversations from the same user when used in conjunction with hashed_ip). For assistant turns, there's a field timestamp which is the time when the backend server receives the full response from ChatGPT. For both user and assistant turns, there's a unique idenifier turn_identifier.
  • turn (int): The number of turns in the conversation. A turn refers to one round of user-assistant interaction.
  • language (string): The language of the conversation. Note that this is the most frequently detected language in the utterances of the conversation.
  • openai_moderation (list): A list of OpenAI Moderation results. Each element in the list corresponds to one utterance in the conversation. When the content of an utterance is an empty string, the corresponding moderation reult is set to be an empty dictionary.
  • detoxify_moderation (list): A list of Detoxify results. Each element in the list corresponds to one utterance in the conversation. When the content of an utterance is an empty string, the corresponding Detoxify reult is set to be an empty dictionary.
  • toxic (bool): Whether this conversation contains any utterances considered to be toxic by either OpenAI Moderation or Detoxify.
  • redacted (bool): Whether this conversation contains any utterances in which PII is detected and anonymized.
  • state (string): The state inferred from the most common IP address in the conversation. Its value is sometimes None when GeoIP2 does not identify the state of an IP address.
  • country (string): The country inferred from the most common IP address in the conversation. Its value is sometimes None when GeoIP2 does not identify the country of an IP address.
  • hashed_ip (string): The most common hashed IP address in the conversation.
  • header (string): The request header containing information about operating system, browser versions, and accepted languages. This field might be useful for linking multiple conversations from the same user when used in conjunction with hashed_ip. Note that every turn in a conversation has the same header, as this is the way we linked turns into conversations.

Empty User Inputs

This dataset includes a small subset of conversations where users submitted empty inputs, sometimes leading to hallucinated responses from the assistant. This issue, first noticed by @yuchenlin, arises from the design of our Huggingface chatbot used for data collection, which did not restrict the submission of empty inputs. As a result, users could submit without entering any text, causing the assistant to generate responses without any user prompts. This occurs in a small fraction of the dataset.

Licensing Information

WildChat is made available under the AI2 ImpACT License - Low Risk Artifacts ("LR Agreement")

Citation Information

Please consider citing our paper if you find this dataset useful:

@inproceedings{
zhao2024wildchat,
title={WildChat: 1M Chat{GPT} Interaction Logs in the Wild},
author={Wenting Zhao and Xiang Ren and Jack Hessel and Claire Cardie and Yejin Choi and Yuntian Deng},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=Bl8u7ZRlbM}
}
Downloads last month
7
Edit dataset card