Chat Data Format for LLMs - Unsloth Templates
unknown
python
a month ago
1.5 kB
18
Indexable
Never
#[NOTE] To train only on completions (ignoring the user's input) read TRL's docs here. #We use our get_chat_template function to get the correct chat template. We support zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old and our own optimized unsloth template. #Normally one has to train <|im_start|> and <|im_end|>. We instead map <|im_end|> to be the EOS token, and leave <|im_start|> as is. This requires no additional training of additional tokens. #Note ShareGPT uses {"from": "human", "value" : "Hi"} and not {"role": "user", "content" : "Hi"}, so we use mapping to map it. #For text completions like novel writing, try this notebook. from unsloth.chat_templates import get_chat_template tokenizer = get_chat_template( tokenizer, chat_template = "chatml", # Supports zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, unsloth mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"}, # ShareGPT style map_eos_token = True, # Maps <|im_end|> to </s> instead ) def formatting_prompts_func(examples): convos = examples["conversations"] texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos] return { "text" : texts, } pass from datasets import load_dataset dataset = load_dataset("philschmid/guanaco-sharegpt-style", split = "train") dataset = dataset.map(formatting_prompts_func, batched = True,)
Leave a Comment