nlp11. RegexpTokenizer in Python NLTK

We can use RegexpTokenizer to write our own tokenizers.

Our sentences here are alternating numbers and words. The regular expression splits the numbers and words. It will consider a period (.) to be a number.

This only tokens selected have a period, digits, and letters. Thus ? or ! will not be selected.


# nlp11.py
from __future__ import print_function, division
from nltk.tokenize import RegexpTokenizer
A = "I'll3finish45my987project2.2today!3a"
tok = RegexpTokenizer("([a-zA-z']+|[0-9.]+)")
B = tok.tokenize(A)
for b in B: print('t'+b)
# I'll
# 3
# finish
# 45
# my
# 987
# project
# 2.2
# today
# 3
# a

Leave a Reply

Your email address will not be published. Required fields are marked *