In the realm of natural language processing (NLP), both BERT and Sumy have established their presence. We described the highlights of analyzing stories in a previous musing. While both process text, they have distinct purposes and characteristics, making the tools unique. To recap, BERT is a deep learning model renowned for understanding word context, thanks to its Transformer architecture. After training on extensive text corpora, BERT can be tailored for various NLP tasks, including summarization. Sumy, on the other hand, is a Python tool specifically crafted for automatic text summarization. Eschewing the intricacies of deep learning, Sumy leans on traditional NLP algorithms like LSA, LexRank, and TextRank. It offers a compact, efficient solution for those seeking to distill lengthy texts.
From recent testing, Sumy is incredible at describing the heart of the tale, or overview, in an efficient manner. However, BERT seems to grab the turn, where the plot changes. Many of the works that come across the desk, although of length, are structured in an efficient manner. Often, characters are defined and the text well formatted.
In the data analysis field, the clarity and structure of the source document play a pivotal role in the performance of both BERT and Sumy. Here’s why:
- Contextual Understanding: BERT thrives on context. A disjointed or muddled text can impede BERT’s ability to pinpoint main ideas and curate a coherent summary.
- Heuristic Dependence: Sumy’s traditional algorithms hinge on heuristics, from keyword frequency to sentence positioning. An unstructured text can diminish the efficacy of these heuristics.
- Fragmented Sentences: Incoherent sentences or text fragments can lead BERT to produce seemingly truncated summaries. Sumy might fare slightly better here, as it often pulls existing sentences from the text.
- Noise in the Data: Irrelevant details or inconsistencies can mislead both models. While BERT might be somewhat resilient due to its contextual prowess, it’s not entirely immune.
- Document Length: Extremely lengthy documents pose unique challenges. BERT has inherent token limits, necessitating chunking for long texts and risking context loss. Sumy, being extractive, can handle length more gracefully.
While BERT stands out for its versatility across NLP tasks and Sumy excels in its niche of text summarization, the quality of the input text is a significant determinant of their success. Proper preprocessing, ensuring text clarity, and structured content can greatly enhance the outcomes of both tools.
To test this theory, we grabbed an onerous document of length—the Durham report. To make it more challenging, the document was converted from PDF, without cleanup. And then the machine hummed away on an expansive amount of text. Note, our preprocessing works to distill the results down to two to three sentences, no matter the document size. Here is what the query returned:
Given the significant quantity of materials the FBI and other government agencies did in fact receive during the 2016 presidential election season and afterwards that originated with and/or were funded by the Clinton campaign or affiliated persons (i.e., the Steele Dossier reports, the Alfa Bank allegations, and the Yotaphone allegations), the Clinton Plan intelligence prompted the Office to consider (i) whether there was in fact a plan by the Clinton campaign to tie Trump to Russia in order to “stir up a scandal” in advance of the 2016 presidential election, and (ii) if such a plan existed, whether an aspect or component of that plan was to intentionally provide knowingly false and/or misleading information to the FBI or other agencies in furtherance of such a plan.
Second, the Clinton Plan intelligence was also highly relevant to the Office’s review and investigation because it was part of the mosaic of information that became known to certain U.S. officials at or before the time they made critical decisions in the Crossfire Hurricane case and in related law enforcement and intelligence efforts. Despite these concerns, the fact that Steele’s information was being financed by the DNC and/or the Clinton campaign was not included in the affidavit’s source description of Steele. The failure to provide this information to the FISC was a major omission in that the information clearly had the potential to affect the analysis of any bias in the reporting.
BERT came to a far different conclusion and churned over the same document for a longer period of time. Still, it was also interesting.
They never asked and don’t want to ask. October 14, 2016.
The Suspense of the Story
Now, both tools provide different analyses—summaries of the same record. That being said, Sumy is a tool that churns through documents, tosses data challenges aside, and provides a summary leveraging sentences scattered throughout the text. Still, there is something about BERT and the output. Who knows what happened on October 14th, 2016? There is a certain amount of suspense here. A hook or turn. There are different ways to craft a simple, yet compelling tale.
Everyone has a story to tell,
Second Act Fables Team