Play all

How to generate synthetic data for fine-tuning

Video Overview fine-tune OpenAI or Llama 3

: Synthetic Question Generation

Synthetic Answer Generation

Why chain of thought is important in Synthetic Data

Augmented Synthetic Data

Generating Synthetic Data from Documents

Synthetic Data from Structured Data

Generating data from user conversations

GPU and notebook setup for Notebooks

OpenAI Notebook: Data Generation and Fine-tuning

Data extraction from pdfs

Synthetic Data Generation for GPT-4o-mini fine-tuning

Generating synthetic questions using structured outputs

Generating synthetic answers

Saving data in jsonl format for OpenAI fine-tuning

How to fine-tune an openai model on a synthetic dataset

Using an LLM as a judge for evaluation

Evaluation of gpt-4o-mini versus fine-tuned model

How to increase and improve the training data

Fine-tuning Open Source Models like Llama 3

Pushing a synthetic dataset to HuggingFace

Loading a model with transformers or Unsloth

Setting generation parameters incl. temperature and top p

Batch generation with transformers or unsloth, incl. padding and chat templating

Llama 3.2 8B model performance before fine-tuning

Fine-tuning on synthetic data with unsloth or transformers

LoRA adapter setup, rescaled LoRa, choice of rank and alpha

Dataset preparation for fine-tuning, incl. prompt formatting

SFTTrainer trainer setup incl. epochs, batch size, gradient accumulation

Defining a custom learning schedule with annealing

How to train on completions only like openai’s default

Running training on Llama 3.2 1B

Performance evaluation after fine-tuning Llama 3.2

Using augmented synthetic data to improve Maths performance Advanced / Speculative!

Evaluating the baseline maths performance of Llama 3.2 1B

Fine-tuning on a training split of the lighteval/MATH dataset

Training on synthetic data from Llama 3.1 8B instead of the training split

Comparing results of training on a training split vs on synthetic Llama 3.1 8B answers

Training on an augmented synthetic dataset generated with Llama 3.1 8B and ground truth answers

Comparing all results, base vs fine-tuned on the raw dataset vs 8B synth vs 8B synth with augmentation

How to use augmented data if you have access to user conversations or feedback

Description:

Dive into an extensive tutorial on synthetic data generation and fine-tuning techniques for large language models like OpenAI GPT-4o and Llama 3. Learn how to create synthetic questions and answers, implement chain of thought reasoning, and augment data from various sources including documents and structured data. Explore GPU setup, data extraction from PDFs, and the process of fine-tuning both OpenAI and open-source models. Master advanced concepts such as LoRA adapters, custom learning schedules, and performance evaluation methods. Discover strategies to improve model performance in specific domains like mathematics using augmented synthetic datasets. Gain practical insights on leveraging user conversations and feedback to enhance model capabilities.

Synthetic Data Generation and Fine-tuning for OpenAI GPT-4 or Llama 3

Trelis Research

Add to list

#Computer Science #Machine Learning #Fine-Tuning #Artificial Intelligence #Natural Language Processing (NLP) #LLM (Large Language Model) #LLaMA (Large Language Model Meta AI) #Information Technology #Data Management #Structured Data #High Performance Computing #Parallel Computing #GPU Computing #Data Augmentation