As part of testing on the Fable Forge application, the team has played around with different large language models (LLMS), those of various shapes and sizes including Bert, Llama, and Mistral. Tinkering leads to solving different problems. One that continued to rise to the top was overused words.
Why the fuss?
Ever read something that felt as monotonous as chewing the same flavor of gum until it loses its taste? Overused words can have that effect on writing, rendering it dull and uninteresting. They’re like unwelcome guests at a party, popping up more than they should.
Sure, a little repetition is fine.
But simple mistakes that find their way into texts can reduce the quality of what might otherwise be a great story. It’s crazy common to have a book proofread multiple times only to later catch phantoms in a first edition. What’s a phantom? Think duplicated words—the the. But having thousands of proofreaders isn’t exactly feasible either.

To solve the challenge, the team modeled a significant amount of Python code to find, break apart, and recreate repeated words. One would think this is simple. But there are varying types of repeated words. There are phrases. Often-used entities too. And then n-Grams, anything that’s repeated. Also, when pulling these together, stop words should be taken into account. Words like the, and, a, or, etc., bounce around everywhere. They should. Still, the nuance in filtering is deceptively complicated.
There are solid tools on the market able to identify this challenge in short chapters. But when analyzing a work of 100,000 words, complication enters the fold. During development, the Fable Forge team bantered about numerous approaches before settling on z-score, p-values, and deviation while leveraging Python visualization techniques. So, for example, let’s say an author overuses, “back to you.” Or, “just to the.” It’s like playing poker. Everyone has a tell. In the movie Rounders, Teddy KGB was an unstoppable force until Matt Damon discovered the way he ate Oreos tipped off his hand.
Yeah, blind spots exist even for master Russian poker players.
But now that an author has discovered these blind spots, what next? With the immense number of editing applications on the market, how do you find and adjust efficiently?
Enter Regex.
The ultimate search tool for combing through text documents. It’s a language that finds search patterns using a sequence of characters, which can include symbols representing specific types of characters. Learning this can be intimidating but extremely valuable. Here are a few simple commands to get started:
- A simple spell: \b(so|very|just|really)\b. This charm will find any instances of the words “so,” “very,” “just,” or “really” in your text, allowing you to see how often you’re using these common modifiers. One can use the above to find any word.
- Need to find all upper case words in a large text? Try: \b[A-Z]+\b
- For finding pesky extra spaces, try the following: [ \t]+
- How about an adverb hunter? Here: \b\w+ly\b
Unfortunately, not every tool on the market leverages the Regex language. In the land of text editors and word processors, several heroes wield the power of Regex:
- Notepad++ (Windows): A trusty sidekick for disciples of Gates, ideal for quick edits and coding, with a knack for Regex searches.
- Sublime Text and Visual Studio Code: These editors are like the Swiss Army knives in your toolbox, versatile for coding and text editing.
- TextEdit and BBEdit (macOS): These are for Apple aficionados.
- GNU Emacs and Vim: The old guards of the text-editing realm, complex yet powerful.
- Google Docs: A friendly giant, not the most adept at Regex, but offers minimal support.
- LibreOffice Writer: The open-source champion.
Once you settle on tool and commands, you’ll find the Regex language a powerful ally in a writer’s journey. It’s not just about finding words; it’s about refining your craft, making your writing more engaging and effective.
