Text Chunker

This script works its way through a long text file and “chunks” the text into smaller files, each with 100 sentences. You can easily change the length of the chunked files.

Usage

This is useful for creating smaller text files for summarisation, or for converting a long book into smaller sections and then using a text to speech script HERE to create bite-size audible files.

Prerequisites

You will need to install the “pdfminer” library first, for this to work:

pip install pdfminer.six

How to use it

  1. Save the script below in the folder of your choice
  2. Rename the text file you want to chunk as “test.txt” and save it in the same folder as the script
  3. Run the script
  4. The chunked files are saved in the “chunks” folder
  5. You can change the length of each chunk by adjusting the parameter currently set to “100”.
import nltk
import re
import os

# Define the path to the input file
input_path = 'test.txt'

# Read in the input file
with open(input_path, 'r', encoding='utf-8', errors='ignore') as f:
    text = f.read()
    text = text.replace('\uf0b7', '#')  # replace problematic character with #
	
# Remove line breaks and page breaks
text = re.sub(r'\n|\f', '', text)

# Use NLTK to split the text into individual sentences
sentences = nltk.sent_tokenize(text)

# Use regular expressions to split the sentences into chunks of 100
chunks = [sentences[i:i+100] for i in range(0, len(sentences), 100)]

# Create a directory to store the output files
if not os.path.exists('chunks'):
    os.mkdir('chunks')

# Loop through the chunks and save each one as a separate text file
for i, chunk in enumerate(chunks):
    chunk_text = ' '.join(chunk)
    chunk_num = i + 1
    output_path = f'chunks/chunk {chunk_num}.txt'
    with open(output_path, 'w') as f:
        f.write(chunk_text)