The NLTK module in Python can be used to load a text, or corpus. In nltk_data folder, you can find the included texts. This assumes all the data files have been downloaded to the computer using nltk.download().
Here Shakespeare’s Julius Caesar is read as a raw string. We may also use the xml loader which will allow parsing the tree, for example the <LINE> elements.
The <LINE> elements are extracted using regular expressions. Only a subset of the lines are printed; those with the word ‘Pompey’.
from __future__ import print_function, division
from nltk.corpus import shakespeare
sp = " " * 2
jc = shakespeare.raw("j_caesar.xml")
jc_lines = re.findall(r"<LINE>.+</LINE>", jc)
for line in jc_lines:
lin = line[6:-7]
# Knew you not Pompey? Many a time and oft
# To see great Pompey pass the streets of Rome:
# That comes in triumph over Pompey's blood? Be gone!
# In Pompey's porch: for now, this fearful night,
# Repair to Pompey's porch, where you shall find us.
# That done, repair to Pompey's theatre.
# Who rated him for speaking well of Pompey:
# That now on Pompey's basis lies along
# Even at the base of Pompey's statua,
# As Pompey was, am I compell'd to set