Quantcast
Viewing latest article 21
Browse Latest Browse All 25

Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP

In this survey, we connect several lines of work from the pre-neural and neural era, by showing how hybrid approaches of words and characters as well as subword-based approaches based on learned segmentation have been proposed and evaluated. We conclude that there is and likely will never be a silver bullet singular solution for all applications and that thinking seriously about tokenization remains important for many applications

Viewing latest article 21
Browse Latest Browse All 25

Trending Articles