nlp5. Word Tokenization in Python NLTK

Usually, we want a text to be broken into words, which is done by nltk.tokenize.word_tokenize.

As we can see from the DocString, it uses sentence tokenizing as well.

Instead of using enumerate, we can always iterate over the indices.


# nlp5.py
from __future__ import print_function, division
from nltk.tokenize import word_tokenize
lines = """This is the first sentence. Dr. Brown gave a speech.
Finally, he praised Python! At 8 o'clock, he went home."""

A = word_tokenize(lines)
print("DocString for %s:n%s" % ("word_tokenize",
word_tokenize.__doc__.strip()))
for i in range(len(A)):
print(i,A[i])

# DocString for word_tokenize:
# Return a tokenized copy of *text*,
# using NLTK's recommended word tokenizer
# (currently :class:`.TreebankWordTokenizer`
# along with :class:`.PunktSentenceTokenizer`).
# 0 This
# 1 is
# 2 the
# 3 first
# 4 sentence
# 5 .
# 6 Dr.
# 7 Brown
# 8 gave
# 9 a
# 10 speech
# 11 .
# 12 Finally
# 13 ,
# 14 he
# 15 praised
# 16 Python
# 17 !
# 18 At
# 19 8
# 20 o'clock
# 21 ,
# 22 he
# 23 went
# 24 home
# 25 .

Leave a Reply

Your email address will not be published. Required fields are marked *