Tasks

Sentence Similarity

Sentence Similarity is the task of determining how similar two texts are. Sentence similarity models convert input texts into vectors (embeddings) that capture semantic information and calculate how close (similar) they are between them. This task is particularly useful for information retrieval and clustering/grouping.

Inputs
Source sentence

Machine learning is so easy.

Sentences to compare to

Deep learning is so straightforward.

This is so difficult, like rocket science.

I can't believe how much I struggled with this.

Sentence Similarity Model
Output
Deep learning is so straightforward.
0.623
This is so difficult, like rocket science.
0.413
I can't believe how much I struggled with this.
0.256

About Sentence Similarity

Use Cases πŸ”

Information Retrieval

You can extract information from documents using Sentence Similarity models. The first step is to rank documents using Passage Ranking models. You can then get to the top ranked document and search it with Sentence Similarity models by selecting the sentence that has the most similarity to the input query.

The Sentence Transformers library

The Sentence Transformers library is very powerful for calculating embeddings of sentences, paragraphs, and entire documents. An embedding is just a vector representation of a text and is useful for finding how similar two texts are.

You can find and use hundreds of Sentence Transformers models from the Hub by directly using the library, playing with the widgets in the browser or using Inference Endpoints.

Task Variants

Passage Ranking

Passage Ranking is the task of ranking documents based on their relevance to a given query. The task is evaluated on Mean Reciprocal Rank. These models take one query and multiple documents and return ranked documents according to the relevancy to the query. πŸ“„

You can infer with Passage Ranking models using Inference Endpoints. The Passage Ranking model inputs are a query for which we look for relevancy in the documents and the documents we want to search. The model will return scores according to the relevancy of these documents for the query.

import json
import requests

API_URL = "https://api-inference.huggingface.co/models/sentence-transformers/msmarco-distilbert-base-tas-b"
headers = {"Authorization": f"Bearer {api_token}"}

def query(payload):
    response = requests.post(API_URL, headers=headers, json=payload)
    return response.json()

data = query(
    {
        "inputs": {
            "source_sentence": "That is a happy person",
            "sentences": [
                "That is a happy dog",
                "That is a very happy person",
                "Today is a sunny day"
            ]
        }
    }
## [0.853, 0.981, 0.655]

Semantic Textual Similarity

Semantic Textual Similarity is the task of evaluating how similar two texts are in terms of meaning. These models take a source sentence and a list of sentences in which we will look for similarities and will return a list of similarity scores. The benchmark dataset is the Semantic Textual Similarity Benchmark. The task is evaluated on Pearson’s Rank Correlation.

import json
import requests

API_URL = "https://api-inference.huggingface.co/models/sentence-transformers/all-MiniLM-L6-v2"
headers = {"Authorization": f"Bearer {api_token}"}

def query(payload):
    response = requests.post(API_URL, headers=headers, json=payload)
    return response.json()

data = query(
    {
        "inputs": {
            "source_sentence": "I'm very happy",
            "sentences":["I'm filled with happiness", "I'm happy"]
        }
    })

## [0.605, 0.894]

You can also infer with the models in the Hub using Sentence Transformer models.

pip install -U sentence-transformers

from sentence_transformers import SentenceTransformer, util
sentences = ["I'm happy", "I'm full of happiness"]

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

#Compute embedding for both lists
embedding_1= model.encode(sentences[0], convert_to_tensor=True)
embedding_2 = model.encode(sentences[1], convert_to_tensor=True)

util.pytorch_cos_sim(embedding_1, embedding_2)
## tensor([[0.6003]])

Useful Resources

Would you like to learn more about Sentence Transformers and Sentence Similarity? Awesome! Here you can find some curated resources that you may find helpful!

Compatible libraries

Sentence Similarity demo
Models for Sentence Similarity
Browse Models (4,192)

Note This model works well for sentences and paragraphs and can be used for clustering/grouping and semantic searches.

Datasets for Sentence Similarity
Browse Datasets (239)

Note Bing queries with relevant passages from various web sources.

Spaces using Sentence Similarity

Note An application that leverages sentence similarity to answer questions from YouTube videos.

Note An application that retrieves relevant PubMed abstracts for a given online article which can be used as further references.

Note An application that leverages sentence similarity to summarize text.

Metrics for Sentence Similarity
Mean Reciprocal Rank
Reciprocal Rank is a measure used to rank the relevancy of documents given a set of documents. Reciprocal Rank is the reciprocal of the rank of the document retrieved, meaning, if the rank is 3, the Reciprocal Rank is 0.33. If the rank is 1, the Reciprocal Rank is 1
Cosine Similarity
The similarity of the embeddings is evaluated mainly on cosine similarity. It is calculated as the cosine of the angle between two vectors. It is particularly useful when your texts are not the same length