Fourier Transform Applications in Literary Analysis

How mathematics and data analysis can offer a head start to analysing poetry, before even reading the words. The post Fourier Transform Applications in Literary Analysis appeared first on Towards Data Science.

Mar 14, 2025 - 01:30

Fourier Transform Applications in Literary Analysis

Poetry is often seen as a pure art form, ranging from the rigid structure of a haiku to the fluid, unconstrained nature of free-verse poetry. In analysing these works, though, to what extent can mathematics and Data Analysis be used to glean meaning from this free-flowing literature? Of course, rhetoric can be analysed, references can be found, and word choice can be questioned, but can the underlying– even subconscious– thought process of an author be found using analytic tactics on literature? As an initial exploration into compute-assisted literature analysis, we’ll attempt to use a Fourier transforming program to search for periodicity in a poem. To test our code, we’ll use two case studies: “Do Not Go Gentle into That Good Night” by Dylan Thomas, followed by Lewis Carroll’s “Jabberwocky.”

1. Data acquisition

a. Line splitting and word count

Before doing any calculations, all necessary data must be collected. For our purposes, we’ll want a data set of the number of letters, words, syllables, and visual length of each line. First, we need to parse the poem itself (which is inputted as a plain text file) into substrings of each line. This is quite easily done in Python with the .split() method; passing the delimiter “n” into the method will split the file by line, returning a list of strings for each line. (The full method is poem.split(“n”)). Counting the number of words is as simple as splitting the lines, and follows nicely from it: first, iterating across all lines, apply the .split() method again– this time with no delimiter– so that it will default to splitting on whitespaces, turning each line string into a list of word strings. Then, to count the number of words on any given line simply call the built-in len() function on each line; since each line has been broken into a list of words, len() will return the number of items in the line list, which is the word count.

b. Letter count

To calculate the number of letters in each line, all we need to do is take the sum of the letter count of each word, so for a given line we iterate over each word, calling len()to get the character count of a given word. After iterating over all words in a line, the characters are summed for the total number of characters on the line; the code to perform this is sum(len(word) for word in words).

c. Visual length

Calculating the visual length of each line is simple; assuming a monospace font, the visual length of each line is simply the total number of characters (including spaces!) present on the line. Therefore, the visual length is simply len(line). However, most fonts are not monospace, especially common literary fonts like Caslon, Garamond, and Georgia — this presents an issue because without knowing the exact font that an author was writing with, we can’t calculate the precise line length. While this assumption does leave room for error, considering the visual length in some capacity is important, so the monospace assumption will have to be used.

d. Syllable count

Getting the syllable count without manually reading each line is the most challenging part of data collection. To identify a syllable, we’ll use vowel clusters. Note that in my program I defined a function, count_syllables(word), to count the syllables in each word. To preformat the word, we set it to all lowercase using word = word.lower() and remove any punctuation that may be contained in the word using word = re.sub(r'[^a-z]', '', word). Next, find all vowels or vowel clusters– each should be a syllable, as a single syllable is expressly defined as a unit of pronunciation containing one continuous vowel sound surrounded by consonants. To find each vowel cluster, we can use the regex of all vowels, including y: syllables = re.findall(r'[aeiouy]+', word). After defining syllables, it will be a list of all vowel clusters in a given word. Finally, there must be at least one syllable per word, so even if you input a vowelless word (Cwm, for example), the function will return one syllable. The function is:

def count_syllables(word):
    """Estimate syllable count in a word using a simple vowel-grouping method."""
    word = word.lower()
    word = re.sub(r'[^a-z]', '', word)  # Remove punctuation
    syllables = re.findall(r'[aeiouy]+', word)  # Find vowel clusters
    return max(1, len(syllables))  # At least one syllable per word

That function will return the count of syllables for any inputted word, so to find the syllable count for a full line of text, return to the previous loop (used for data collection in 1.a-1.c), and iterate over the words list which will return the syllable count in each word. Summing the syllable counts will give the count for the full line: num_syllables = sum(count_syllables(word) for word in words).

e. Data collection summary

The data collection algorithm is compiled into a single function, which begins at splitting the inputted poem into its lines, iterates over each line of the poem performing all of the previously described operations, and appends each data point to a designated list for that data set, and finally generates a dictionary to store all data points for a single line and appends it to a master data set. While the time complexity is effectively irrelevant for the small amounts of input data being used, the function runs in linear time, which is helpful in the case that it is used to analyze large amounts of data. The data collection function in its entirety is:

def analyze_poem(poem):
    """Analyzes the poem line by line."""
    data = []
    lines = poem.split("n")

    for line in lines:
        words = line.split()
        num_words = len(words)
        num_letters = sum(len(word) for word in words)
        visual_length = len(line)  # Approximate visual length (monospace)
        num_syllables = sum(count_syllables(word) for word in words)
        word.append(num_words)
        letters.append(num_letters)
        length.append(visual_length)
        sylls.append(num_syllables)

        data.append({
            "line": line,
            "words": num_words,
            "letters": num_letters,
            "visual_length": visual_length,
            "syllables": num_syllables
        })

    return data

2. Discrete Fourier transform

Preface: This section assumes an understanding of the (discrete) Fourier Transform; for a relatively brief and manageable introduction, try this article by Sho Nakagome.

a. Specific DFT algorithm

To address with some specificity the particular DFT algorithm I’ve used, we need to touch on the NumPy fast Fourier transform method. Suppose N is the number of discrete values being transformed: If N is a power of 2, NumPy uses the radix-2 Cooley-Tukey Algorithm, which recursively splits the input into even and odd indices. If N is not a power of 2, NumPy applies a mixed-radix approach, where the input is factorized into smaller prime factors, and FFTs are computed using efficient base cases.

b. Applying the DFT

To apply the DFT to the previously collected data, I’ve created a function fourier_analysis, which takes only the master data set (a list of dictionaries with all data points for each line) as an argument. Luckily, since NumPy is so adept at mathematics, the code is simple. First, find N, being the number of data points to be transformed; this is simply N = len(data). Next, apply NumPy’s FFT algorithm to the data using the method np.fft.fft(data), which returns an array of the complex coefficients representing the amplitude and phase of the Fourier series. Finally, the np.abs(fft_result) method extracts the magnitudes of each coefficient, representing its strength in the original data. The function returns the Fourier magnitude spectrum as a list of frequency-magnitude pairs.

def fourier_analysis(data):
    """Performs Fourier Transform and returns frequency data."""
    N = len(data)
    fft_result = np.fft.fft(data)  # Compute Fourier Transform
    frequencies = np.fft.fftfreq(N)  # Get frequency bins
    magnitudes = np.abs(fft_result)  # Get magnitude of FFT coefficients

    return list(zip(frequencies, magnitudes))  # Return (freq, magnitude) pairs

The full code can be found here, on GitHub.

3. Case studies

a. Introduction

We’ve made it through all of the code and tongue-twister algorithms, it’s finally time to put the program to the test. For the sake of time, the literary analysis done here will be minimal, putting the stress on the data analysis. Note that while this Fourier transform algorithm returns a frequency spectrum, we want a period spectrum, so the relationship ( T = frac{1}{f} ) will be used to obtain a period spectrum. For the purpose of comparing different spectrums’ noise levels, we’ll be using the metric of signal-to-noise ratio (SNR). The average signal noise is calculated as an arithmetic mean, given by ( P_{noise} = frac{1}{N-1} sum_{k=0}^{N-1} |X_k| ), where ( X_k ) is the coefficient for any index ( k ), and the sum excludes ( X_{peak} ), the coefficient of the signal peak. To find the SNR, simply take ( frac{X_{peak}}{P_{noise}} ); a higher SNR means a higher SNR means a higher SNR means a higher signal strength relative to background noise. SNR is a strong choice for detecting poetic periodicity because it quantifies how much of the signal (i.e., structured rhythmic patterns) stands out against background noise (random variations in word length or syllable count). Unlike variance, which measures overall dispersion, or autocorrelation, which captures repetition at specific lags, SNR directly highlights how dominant a periodic pattern is relative to irregular fluctuations, making it ideal for identifying metrical structures in poetry.

b. “Do Not Go Gentle into That Good Night” – Dylan Thomas

This work has a definite and visible periodic structure, so it is great testing data. Unfortunately, the syllable data here won’t find anything interesting here (Thomas’s poem is written in iambic pentameter); the word count data, on the other hand, has the highest SNR value out of any of the four metrics, 6.086.

Chart — Figure 1. Note that this figure and all that follow were generated using Google Sheets.

The spectrum above shows a dominant signal at a 4 line period, and relatively little noise in the other period ranges. Furthermore, considering its highest SNR value compared to letter-count, syllable-count, and visual length gives an interesting observation: the poem follows a rhyme scheme of ABA(blank); this means the word count of each line repeats perfectly in tandem with the rhyme scheme. The SNRs of the other two relevant spectrums are not far behind the word-count SNR, with the letter-count at 5.724 and the visual length at 5.905. Those two spectrums also have their peaks at a period of 4 lines, indicating that they also match the poem’s rhyme scheme.

c. “Jabberwocky” – Lewis Carroll

Carrol’s writing is also mostly periodic in structure, but has some irregularities; in the word period spectrum there is a distinct peak at ~5 lines, but the considerably low noise (SNR = 3.55) is broken by three distinct sub-peaks at 3.11 lines, 2.54 lines, and 2.15 lines. This secondary peak is shown in figure 2, implying that there is a significant secondary repeating pattern in the words Carroll used. Furthermore, due to the increasing nature of the peaks as they approach a period of 2 lines, one conclusion is that Carroll has a structure of alternating word counts in his writing.

This alternating pattern is reflected in the period spectrums of visual length and letter count, both having secondary peaks at 2.15 lines. However, the syllable spectrum shown in figure 3 shows a low magnitude at the 2.15 line period, indicating that the word count, letter count, and visual length of each line are correlated, but not the syllable count.

Interestingly, as the poem follows an ABAB rhyme scheme, suggesting a connection between the visual length of each line and the rhyming pattern itself. One possible conclusion is that Carroll found it more visually appealing when writing for the rhyming ends of words to line up vertically on the page. This conclusion, that the visual aesthetic of each line altered Carroll’s writing style, can be drawn before ever reading the text.

4. Conclusion

Applying Fourier analysis to poetry reveals that mathematical tools can uncover hidden structures in literary works—patterns that may reflect an author’s stylistic tendencies or even subconscious choices. In both case studies, a quantifiable relationship was found between the structure of the poem and metrics (word-count, etc.) that are often overlooked in literary analysis. While this approach does not replace traditional literary analysis, it provides a new way to explore the formal qualities of writing. The intersection of mathematics, computer science, data analytics and Literature is a promising frontier, and this is just one way that technology can lead to new discoveries, holding potential in broader data science fields like stylometry, sentiment and emotion analysis, and topic modeling. \[\]

The post Fourier Transform Applications in Literary Analysis appeared first on Towards Data Science.