Play all

Intro

Dhivehi Project

Hurdles for Low Resource Domains

Dhivehi Dataset

Download Dhivehi Corpus

Tokenizer Components

Normalizer Component

Pre-tokenization Component

Post-tokenization Component

Decoder Component

Tokenizer Implementation

Tokenizer Training

Post-processing Implementation

Decoder Implementation

Saving for Transformers

Tokenizer Test and Usage

Download Dhivehi Models

First Steps

Description:

Learn how to build an effective WordPiece tokenizer for Dhivehi, a low-resource language with a complex writing system. Explore the challenges of applying NLP to Dhivehi and follow along as the process of creating a custom tokenizer is demonstrated step-by-step. Discover the key components of tokenizer design, including normalization, pre-tokenization, post-tokenization, and decoding. Implement and train the tokenizer, test its functionality, and gain insights into working with low-resource languages in NLP. By the end of this tutorial, you'll have a solid understanding of tokenizer development for unique linguistic contexts and be able to apply these techniques to other low-resource languages.

Building Transformer Tokenizers - Dhivehi NLP #1

James Briggs

Add to list

#Computer Science #Artificial Intelligence #Natural Language Processing (NLP) #Machine Learning #Transformer Models #Humanities #Linguistics #Low-Resource Languages

0:00 / 0:00