Play all

- Introduction to Dataset building for fine-tuning.

- The Reddit dataset options Torrent, Archive.org, BigQuery

- Exporting BigQuery Reddit and some other data

- Decompressing all of the gzip archives

- Re-combining the archives for target subreddits

- How to structure the data

- Building training samples and saving to database

- Creating customized training json files

- QLoRA training and results

Description:

Learn how to build a QLoRA fine-tuning dataset for language models in this comprehensive video tutorial. Explore various Reddit dataset options, including torrent files, Archive.org, and BigQuery. Follow step-by-step instructions on exporting BigQuery Reddit data, decompressing gzip archives, and recombining archives for target subreddits. Discover the proper data structure, build training samples, and save them to a database. Create customized training JSON files and dive into QLoRA training and results. Gain valuable insights into dataset building for fine-tuning language models through practical demonstrations and explanations.

Building an LLM Fine-Tuning Dataset - From Reddit Comments to QLoRA Training

sentdex

Add to list

#Computer Science #Machine Learning #Programming #Cloud Computing #Google Cloud Platform (GCP) #BigQuery #Programming Languages #Javascript #JSON #Data Science #Data Preprocessing #Language Models #Fine-Tuning #QLoRA

0:00 / 0:00