Text Chunker

This script works its way through a long text file and “chunks” the text into smaller files, each with 100 sentences. You can easily change the length of the chunked files.

Usage

This is useful for creating smaller text files for summarisation, or for converting a long book into smaller sections and then using a text to speech script HERE to create bite-size audible files.

Prerequisites

You will need to install the “pdfminer” library first, for this to work:

pip install pdfminer.six

How to use it

Save the script below in the folder of your choice
Rename the text file you want to chunk as “test.txt” and save it in the same folder as the script
Run the script
The chunked files are saved in the “chunks” folder
You can change the length of each chunk by adjusting the parameter currently set to “100”.

import nltk
import re
import os

# Define the path to the input file
input_path = 'test.txt'

# Read in the input file
with open(input_path, 'r', encoding='utf-8', errors='ignore') as f:
    text = f.read()
    text = text.replace('\uf0b7', '#')  # replace problematic character with #
	
# Remove line breaks and page breaks
text = re.sub(r'\n|\f', '', text)

# Use NLTK to split the text into individual sentences
sentences = nltk.sent_tokenize(text)

# Use regular expressions to split the sentences into chunks of 100
chunks = [sentences[i:i+100] for i in range(0, len(sentences), 100)]

# Create a directory to store the output files
if not os.path.exists('chunks'):
    os.mkdir('chunks')

# Loop through the chunks and save each one as a separate text file
for i, chunk in enumerate(chunks):
    chunk_text = ' '.join(chunk)
    chunk_num = i + 1
    output_path = f'chunks/chunk {chunk_num}.txt'
    with open(output_path, 'w') as f:
        f.write(chunk_text)

Text Chunker

Text Chunker

Usage

Prerequisites

How to use it

Leave a Reply
Cancel reply

Leave a Reply

Text Chunker

Usage

Prerequisites

How to use it

Leave a Reply Cancel reply

Leave a Reply

Leave a Reply
Cancel reply