Главная
Study mode:
on
1
Intro
2
Dhivehi Project
3
Hurdles for Low Resource Domains
4
Dhivehi Dataset
5
Download Dhivehi Corpus
6
Tokenizer Components
7
Normalizer Component
8
Pre-tokenization Component
9
Post-tokenization Component
10
Decoder Component
11
Tokenizer Implementation
12
Tokenizer Training
13
Post-processing Implementation
14
Decoder Implementation
15
Saving for Transformers
16
Tokenizer Test and Usage
17
Download Dhivehi Models
18
First Steps
Description:
Learn how to build an effective WordPiece tokenizer for Dhivehi, a low-resource language with a complex writing system. Explore the challenges of applying NLP to Dhivehi and follow along as the process of creating a custom tokenizer is demonstrated step-by-step. Discover the key components of tokenizer design, including normalization, pre-tokenization, post-tokenization, and decoding. Implement and train the tokenizer, test its functionality, and gain insights into working with low-resource languages in NLP. By the end of this tutorial, you'll have a solid understanding of tokenizer development for unique linguistic contexts and be able to apply these techniques to other low-resource languages.

Building Transformer Tokenizers - Dhivehi NLP #1

James Briggs
Add to list
0:00 / 0:00