So I've been wanting to get into finetuning, after all, it seems to be the cause for the huge drop in GPU rent prices.


TLDR; foundational models are ultra expensive to train and its not worth the effort/expense for the most part of cases. Some cases do need more specific models (IMO, most cases just need off the shelf model with some prompting) and finetuning is the answer. And for sufficiently small models, we can do the base work on a consumer GPU! 


I was looking for use cases for finetuning, and I think that the base case that can both give me the most return (it's important for me to have some "visual" feedback, that is, to actually see a notable difference on the model outputs) and that can fit on my GPU (3050 Ti, only 4GB VRAM, very gpu poor) was to do a finetuning on Llama 3.2 1B base model to make it an Instruct model. This is great because the different is 100% clear, and you know when it worked or not. 

For example, if you load this model from Hugging Face hub Meta repo using HF transformers and do some inference:

from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

model_name = "meta-llama/Llama-3.2-1B"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

pipeline = pipeline("text-generation", model=model, tokenizer=tokenizer, device=0)
a = pipeline("This is a piece of text where I want ")
print(a[0]['generated_text'])

We get 'This is a piece of text where I want 3 different things to be printed in a single', given that these models are initially trained only to predict the next words, not answering questions. 

And this was pretty simple to do, even with my GPU poor status, but as soon as i tried to do some training on it, i felt the pain through this:

RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

And the thing is, as we can see here, the memory to load a LLM is not huge, and I can easily load the 1B and even the 3B easily, because:



But the thing is, as shown here, in order to train the same model, the memory footprint is around 3-4x the one needed for inference! So it becomes impossible for me even to do it on 1B. 

Thakfully to a great ML community, there is a solution for this problem, and its open-source (at least a part of it). Unsloth is a framework that allows us to use up to 70% less memory to train LLMs and do it up to 30x faster.

They enter more into the details here (and you can also check the code), but the main thing on the open-source framework is that they do a bunch of optimizations in comparison to the base Pytorch Autograd (the mechanism that computes the gradients of the math functions that tells us in each direction to change the model params, which is the main thing behing backpropagation), such as:

  1. Manual Derivation: The Unsloth team manually derived all compute-heavy mathematical steps involved in the gradient calculations.
  2. Handwritten GPU Kernels: They handwrote GPU kernels specifically for these calculations, resulting in highly optimized code for training.
  3. Memory Efficiency: Unsloth's implementation uses significantly less memory compared to standard approaches like Flash Attention 2.
  4. GPU Optimization: The system is optimized for a wide range of GPUs, including NVIDIA, AMD, and Intel models.

Kindly enough they have provided Notebooks on Colab showing how to finetune using their framework, and its pretty straightfoward. There is no clear benchmark for 1B but I guess its the same as 3B, that is, 2.4x faster and 58% less memory. 

I did a small modification and cleanup on their Notebook and changed from the base alpaca dataset to the iamtarun/python_code_instructions_18k_alpaca that is used to instruct training on Python coding. I saved the Lora adpater here and the notebook is available here. As colab is super great, you can freely use a T4 GPU which is pretty enough to do this finetuning.

I messed up on the first try because I malformed the prompt, which basically destroys all the training. The final loss was about 3x the one I got with the good prompt formatting, and the model just couldn't generate python code at all! So yes, it matters a lot. 

As an example, i asked the model to generate code to remove some letters from a word, and it works just great:

['<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nCreate python function to remove first and last letter from word\n\n### Input:\ndiego\n\n### Response:\ndef remove_first_and_last_letter(word):\n    return word[1:-1]\n\nprint(remove_first_and_last_letter("diego"))\n# 2\nprint(remove_first_and_last_letter("hello"))\n# 2\nprint(remove_first_and_last_letter("world"))\n# 2\n<|end_of_text|>']

To compare, if we do the same prompt on the base model:

### Instruction:
Create python function to remove first and last letter from word
### Input:
diego
### Response: 
dgo
### Explanation:
A simple function to remove the first and last letter from a string

pretty cool! and as its super fast to test, it gives me lots of ideas to learn/test next and write about them here.