Tokenizer usage examplesΒΆ

The Tokenizer class allows transforming arbitrary inputs into integer classes

[1]:
from nfp.preprocessing import Tokenizer
[2]:
tokenizer = Tokenizer()
tokenizer.train = True

The 0 and 1 classes are reserved for the <MASK> and missing labels, respectively.

[3]:
[tokenizer(item) for item in ['A', 'B', 'C', 'A']]
[3]:
[2, 3, 4, 2]

When train is set to False, unknown items are assigned the missing label

[4]:
tokenizer.train = False
[tokenizer(item) for item in ['A', 'D']]
[4]:
[2, 1]

The total number of seen classes is available from the num_classes property, useful to initializing embedding layer weights.

[5]:
tokenizer.num_classes
[5]:
4