led-large-book-summary
This model is a fine-tuned version of allenai/led-large-16384 on the BookSum
dataset (kmfoda/booksum
). It aims to generalize well and be useful in summarizing lengthy text for both academic and everyday purposes.
- Handles up to 16,384 tokens input
- See the Colab demo linked above or try the demo on Spaces
Note: Due to inference API timeout constraints, outputs may be truncated before the fully summary is returned (try python or the demo)
Basic Usage
To improve summary quality, use encoder_no_repeat_ngram_size=3
when calling the pipeline object. This setting encourages the model to utilize new vocabulary and construct an abstractive summary.
Load the model into a pipeline object:
import torch
from transformers import pipeline
hf_name = 'pszemraj/led-large-book-summary'
summarizer = pipeline(
"summarization",
hf_name,
device=0 if torch.cuda.is_available() else -1,
)
Feed the text into the pipeline object:
wall_of_text = "your words here"
result = summarizer(
wall_of_text,
min_length=16,
max_length=256,
no_repeat_ngram_size=3,
encoder_no_repeat_ngram_size=3,
repetition_penalty=3.5,
num_beams=4,
early_stopping=True,
)
Important: For optimal summary quality, use the global attention mask when decoding, as demonstrated in this community notebook, see the definition of generate_answer(batch)
.
If you're facing computing constraints, consider using the base version pszemraj/led-base-book-summary
.
Training Information
Data
The model was fine-tuned on the booksum dataset. During training, the chapter
was the input col, while the summary_text
was the output.
Procedure
Fine-tuning was run on the BookSum dataset across 13+ epochs. Notably, the final four epochs combined the training and validation sets as 'train' to enhance generalization.
Hyperparameters
The training process involved different settings across stages:
- Initial Three Epochs: Low learning rate (5e-05), batch size of 1, 4 gradient accumulation steps, and a linear learning rate scheduler.
- In-between Epochs: Learning rate reduced to 4e-05, increased batch size to 2, 16 gradient accumulation steps, and switched to a cosine learning rate scheduler with a 0.05 warmup ratio.
- Final Two Epochs: Further reduced learning rate (2e-05), batch size reverted to 1, maintained gradient accumulation steps at 16, and continued with a cosine learning rate scheduler, albeit with a lower warmup ratio (0.03).
Versions
- Transformers 4.19.2
- Pytorch 1.11.0+cu113
- Datasets 2.2.2
- Tokenizers 0.12.1
Simplified Usage with TextSum
To streamline the process of using this and other models, I've developed a Python package utility named textsum
. This package offers simple interfaces for applying summarization models to text documents of arbitrary length.
Install TextSum:
pip install textsum
Then use it in Python with this model:
from textsum.summarize import Summarizer
model_name = "pszemraj/led-large-book-summary"
summarizer = Summarizer(
model_name_or_path=model_name, # you can use any Seq2Seq model on the Hub
token_batch_length=4096, # tokens to batch summarize at a time, up to 16384
)
long_string = "This is a long string of text that will be summarized."
out_str = summarizer.summarize_string(long_string)
print(f"summary: {out_str}")
Currently implemented interfaces include a Python API, a Command-Line Interface (CLI), and a demo/web UI.
For detailed explanations and documentation, check the README or the wiki
Related Models
Check out these other related models, also trained on the BookSum dataset:
- LED-large continued - experiment with further fine-tuning
- Long-T5-tglobal-base
- BigBird-Pegasus-Large-K
- Pegasus-X-Large
- Long-T5-tglobal-XL
There are also other variants on other datasets etc on my hf profile, feel free to try them out :)
- Downloads last month
- 29,382
Dataset used to train pszemraj/led-large-book-summary
Spaces using pszemraj/led-large-book-summary 13
Collection including pszemraj/led-large-book-summary
Evaluation results
- ROUGE-1 on kmfoda/booksumtest set verified31.731
- ROUGE-2 on kmfoda/booksumtest set verified5.331
- ROUGE-L on kmfoda/booksumtest set verified16.146
- ROUGE-LSUM on kmfoda/booksumtest set verified29.088
- loss on kmfoda/booksumtest set verified4.816
- gen_len on kmfoda/booksumtest set verified154.904
- ROUGE-1 on samsumtest set verified33.448
- ROUGE-2 on samsumtest set verified10.425
- ROUGE-L on samsumtest set verified24.580
- ROUGE-LSUM on samsumtest set verified29.823
- loss on samsumtest set verified4.176
- gen_len on samsumtest set verified65.400
- ROUGE-1 on billsumtest set verified40.584
- ROUGE-2 on billsumtest set verified17.340
- ROUGE-L on billsumtest set verified25.126
- ROUGE-LSUM on billsumtest set verified34.662
- loss on billsumtest set verified4.793
- gen_len on billsumtest set verified163.939
- ROUGE-1 on multi_newstest set verified39.083
- ROUGE-2 on multi_newstest set verified11.404
- ROUGE-L on multi_newstest set verified19.181
- ROUGE-LSUM on multi_newstest set verified35.158
- loss on multi_newstest set verified4.655
- gen_len on multi_newstest set verified186.249
- ROUGE-1 on cnn_dailymailtest set verified32.877
- ROUGE-2 on cnn_dailymailtest set verified13.371
- ROUGE-L on cnn_dailymailtest set verified20.436
- ROUGE-LSUM on cnn_dailymailtest set verified30.441
- loss on cnn_dailymailtest set verified5.349
- gen_len on cnn_dailymailtest set verified181.833