Fine-Tuning vLLMs for Document Understanding
Learn how you can fine-tune visual language models for specific tasks The post Fine-Tuning vLLMs for Document Understanding appeared first on Towards Data Science.

In this article, I discuss how you can fine
Overview
The main goal of this article is to fine-tune a VLM on a dataset, an important machine-learning technique in today’s world, where language models revolutionize the way data scientists and ML engineers work and achieve. I will be discussing the following topics:
- Motivation and Goal: Why use VLMs for text extraction
- Advantages of VLMs
- The dataset
- Annotation and fine-tuning
- SFT technical details
- Results and plots
Note: This article is written as part of the work done at Findable. We do not profit financially from this work. It is done to highlight the technical capabilities of modern vision-language models and digitize and share a valuable handwritten phenology dataset, which may have a significant impact on climate research. Furthermore, the topic of this article was covered in a presentation during the Data & Draft event by Netlight.
You can view all the code used for this article in our GitHub repository, and all data is available on HuggingFace. If you’re specifically interested in the extracted phenology data from Norway, including geographical coordinates corresponding to the data, the information is directly available in this Excel sheet.
Motivation and Goal
The goal of this article is to show you how you can fine-tune a VLM such as Qwen for optimized performance on a particular task. The task we are working on here is extracting handwritten text from a series of images. The work in this article is based on a Norwegian phenology dataset, which you can read more about in the README in this GitHub repository. The main point is that the information contained in these images is highly valuable and can, for example, be used to conduct climate research. There is also definitive scientific interest in this topic, for example, this article on analysing long-term changes in when plants flower, or the Eastern Pennsylvania Phenology Project.
Note that the data extracted is presented in good faith, and I do not make any claims as to what the data implies. The main goal of this article is to show you how to extract this data and present you with the extracted data, to be used for scientific research.
The result model I make in this article can be used to extract the text from all images. This data can then be converted to tables, and you can plot the information as you see in the image below:
If you are only interested in viewing the data extracted in this study, you can view it in this parquet file.
Why do we need to use VLMs
When looking at the images, you may think we should apply traditional OCR to this problem. OCR is the science of extracting text from images, and in previous years, it has been dominated by engines like Tesseract, DocTR, and EasyOCR.
However, these models are often outperformed by the modern large language models, particularly the ones incorporating vision (typically referred to as VLMs or VLLMs)—the image below highlights why you want to use a VLM instead of traditional OCR engines. The first column shows example images from our dataset, and the two other columns compare EasyOCR vs the fine-tuned Qwen model we will train in this article.
This highlights the main reason to use a VLM over a traditional OCR engine, to extract text from images: VLMs often outperform traditional OCR engines when extracting text from images.
Advantages of VLMs
There are several advantages to using VLMs when extracting text from images. In the last section, you saw how the output quality from the VLM exceeds the output quality of a traditional OCR engine. Another advantage is that you can provide instructions to VLMs on how you want it to act, which traditional OCR engines cannot provide.
The two main advantages of VLMs are thus:
- VLMs excel at OCR (particularly handwriting)
- You can provide instructions
VLMs are good at OCR because it’s part of the training process for these models. This is, for example, mentioned in Qwen 2.5 VL Technical report section 2.2.1 Pre-Training Data, where they list an OCR dataset as part of their pre-training data.
Handwriting
Extracting handwritten text has been notoriously difficult in the past and is still a challenge today. The reason for this is that handwriting is non-standardized.
With non-standardized, I refer to the fact that the characters will look vastly different from person to person. As an example of a standardized character, if you write a character on a computer, it will consistently look very similar across different computers and people writing it. For instance, the computer character “a” looks very similar no matter which computer it is written on. This makes it simpler for an OCR engine to pick up the character, since the characters it extracts from images, most likely, look quite similar to the characters it encountered in its training set.
Handwritten text, however, is the opposite. Handwriting varies widely from person to person, which is why you sometimes struggle with reading other people’s handwriting. OCR engines also have this exact problem. If characters vary widely, there is a lower chance that it has encountered a specific character variation in its training set, thus making extracting the correct character from an image more difficult.
You can, for example, look at the image below. Imagine only looking at the ones in the image (so mask over the 7). Looking at the image now, the “1” looks quite similar to a “7”. You are, of course, able to separate the two characters because you can see them in context, and think critically that if a seven looks like it does (with a horizontal line), the first two characters in the image must be ones.
Traditional OCR engines, however, don’t have this ability. They don’t look at the entire image, think critically about one character’s look, and use that to determine other characters. They must simply guess which character it is when looking at the isolated digit.
How to separate the digit “1” from “7”, ties nicely into the next section, about providing instructions to the VLMs, when extracting text.
I would also like to add that some OCR engines, such as TrOCR, are made to extract handwritten text. From experience, however, such models are not comparable in performance to state-of-the-art VLMs such as Qwen 2.5 VL.
Providing instructions
Another significant advantage of using VLMs for extracting text is that you can provide instructions to the model. This is naturally impossible with traditional OCR engines since they extract all the text in the image. They can only input an image and not separate text instructions for extracting the text from the image. When we want to extract text using Qwen 2.5 VL, we provide a system prompt, such as the one below.
SYSTEM_PROMPT = """
Below is an instruction that describes a task, write a response that appropriately completes the request.
You are an expert at reading handwritten table entries. I will give you a snippet of a table and you will
read the text in the snippet and return the text as a string.
The texts can consist of the following:
1) A number only, the number can have from 1 to 3 digits.
2) A number surrounded by ordinary parenthesis.
3) A number surrounded by sqaure brackets.
5) The letter 'e', 's' or 'k'
6) The percent sign '%'
7) No text at all (blank image).
Instructions:
**General Rules**:
- Return the text as a string.
- If the snippet contains no text, return: "unknown".
- In order to separate the digit 1 from the digit 7, know that the digit 7 always will have a horizontal stroke appearing in the middle of the digit.
If there is no such horizontal stroke, the digit is a 1 even if it might look like a 7.
- Beware that the text will often be surrounded by a black border, do not confuse this with the text. In particular
it is easy to confuse the digit 1 with parts of the border. Borders should be ignored.
- Ignore anything OUTSIDE the border.
- Do not use any code formatting, backticks, or markdown in your response. Just output the raw text.
- Respond **ONLY** with the string. Do not provide explanations or reasoning.
"""
The system prompt sets the outline for how Qwen should extract the text, which gives Qwen a major advantage over traditional OCR engines.
There are mainly two points that give it an advantage:
- We can tell Qwen which characters to expect in the image
- We can tell Qwen what characters look like (particularly important for handwritten text.
You can see point one addressed in the points 1) -> 7), where we inform it that it can only see 1–3 digits, which digits and letters it can see, and so on. This is a significant advantage, since Qwen is aware that if it detects characters out of this range, it is most likely misinterpreting the image, or a particular challenge. It can better predict which character it thinks is in the image.
The second point is particularly relevant for the problem I mentioned earlier of separating “1” from “7,” which look quite similar. Luckily for us, the author of this dataset was consistent with how he wrote 1s and 7s. The 1s were always written diagonally, and 7s always included the horizontal stroke, which clearly separates the “7” from a “1,” at least from a human perspective of looking at the image.
However, providing such detailed prompts and specifications to the model is only possible once you really understand the dataset you are working on and its challenges. This is why you always have to spend time manually inspecting the data when working on a machine-learning problem such as this. In the next section, I will discuss the dataset we are working on.
The dataset
I start this section with a quote from Greg Brockman (President of OpenAI as of writing this article), highlighting an important point. In his tweet, he refers to the fact that data annotation and inspection are not prestigious work, but nonetheless, it’s one of the most important tasks you can be spending time on when working on a machine-learning project.
At Findable, I started as a data annotator and proceeded to manage the labeling team at Findable before I now work as a data scientist. The work with annotation highlighted the importance of manually inspecting and understanding the data you are working on, and taught me how to do it effectively. Greg Brockman is referring to the fact that this work is not prestigious, which is often correct, since data inspection and annotation can be monotonous. However, you should always spend considerable time inspecting your dataset when working on a machine-learning problem. This time will provide you with insights that you can, for example, use to provide the detailed system prompt I highlighted in the last section.
The dataset we are working on consists of around 82000 images, such as the ones you see below. The cells vary in width from 81 to 93 pixels and in height from 48 to 57 pixels, meaning we are working on very small images.
When starting this project, I first spent time looking at the different images to understand the variations in the dataset. I, for example, notice:
- The “1”s look similar to the “7”s
- There is some faint text in some of the images (for example, the “8” in the bottom left image above, and the “6” in the bottom right image
- From a human perspective, all the images are very readable, so we should be able to extract all the text correctly
I then continue by using the base version of Qwen 2.5 VL 7B to predict some of the images and see which areas the model struggles with. I immediately noticed that the model (unsurprisingly) had problems separating “1”s from “7”s.
After this process of first manually inspecting the data, then predicting a bit with the model to see where it struggles, I note down the following data challenges:
- “1” and “7” look similar
- Dots in the background on some images
- Cell borders can be misinterpreted as characters
- Parentheses and brackets can sometimes be confused
- The text is faint in some images
We have to solve these challenges when fine-tuning the model to extract the text from the images, which I discuss in the next section.
Annotation and fine-tuning
After properly inspecting your dataset, it’s time to work on annotation and fine-tuning. Annotation is the process of setting labels to each image, and fine-tuning is using these labels to improve the quality of your model.
The main goal when doing the annotation is to create a dataset efficiently. This means quickly producing a lot of labels and ensuring the quality of the labels is high. To achieve this goal of rapidly creating a high-quality dataset, I divided the process into three main steps:
- Predict
- Review & correct model mistakes
- Retrain
You should note that this process works well when you have a model already quite good at performing the task. In this problem, for example, Qwen is already quite good at extracting the text from the images, and only makes mistakes in 5–10% of the cases. If you have a completely new task for the model, this process will not work as well.
Step 1: Predict
The first step is to predict (extract the text) from a few hundred images using the base model. The specific number of images you predict on does not really matter, but you should try to strike a balance between gathering enough labels so a training run will improve the model enough (step 3) and taking into account the overhead required to train a model.
Step 2: Review & correct model mistakes
After you have predicted on a few hundred samples, it’s time to review and correct the model mistakes. You should set up your environment to easily display the images and labels and fix the errors. In the image below, you can see my setup for reviewing and correcting mistakes. On the left side, I have a Jupyter notebook where I can run the cell to display the following five samples and the corresponding line to which the label belongs. On the right side, all my labels are listed on the corresponding lines. To review and correct mistakes, I run the Jupyter notebook cell, make sure the labels on the right match the images on the left, and then rerun the cell to get the following five images. I repeat this process until I have looked through all the samples.
Step 3: Retrain:
Now that you have a few hundred correct samples, it is time to train the model. In my case, I take Qwen 2.5 VL 7B and tune it to my current set of labels. I fine-tune using the Unsloth package, which provides this notebook on fine-tuning Qwen (the notebook is for Qwen 2 VL, but all the code is the same, except changing the naming, as you see in the code below). You can check out the next section to learn more details about the fine-tuning process.
The training creates a fine-tuned version of the model, and I go back to step 1 to predict on a few hundred new samples. I repeat this cycle of predicting, correcting, and training until I notice model performance converges.
# this is the original code in the notebook
model, tokenizer = FastVisionModel.from_pretrained(
"unsloth/Qwen2-VL-7B-Instruct",
load_in_4bit = False, # this is originally set to True, but if you have the processing power, I recommend setting it to False
use_gradient_checkpointing = "unsloth",
)
# to train Qwen 2.5 VL (instead of Qwen 2 VL), make sure you use this instead:
model, tokenizer = FastVisionModel.from_pretrained(
"unsloth/Qwen2.5-VL-7B-Instruct",
load_in_4bit = False,
use_gradient_checkpointing = "unsloth",
)
To determine how well my model is performing, I also create a test set on which I test each fine-tuned model. I never train on this test set to ensure unbiased results. This test set is how I can determine whether the model’s performance is converging.
SFT technical details
SFT stands for supervised fine-tuning, which is the process of updating the model’s weights to perform better on the dataset we provide. The problem we are working on here is quite interesting, as the base Qwen 2.5 VL model is already quite good at OCR. This differs from many other tasks we apply VLMs to at Findable, where we typically teach the VLM a completely new task with which it essentially has no prior experience.
When fine-tuning a VLM such as Qwen on a new task, model performance rapidly increases once you start training it. However, the task we are working on here is quite different, since we only want to nudge Qwen to be a little bit better at reading the handwriting for our particular images. As I mentioned, the model’s performance is around 90–95 % accurate (depending on the specific images we test on), on this dataset.
This requirement of only nudging the model makes the model super sensitive to the tuning process parameters. To ensure we nudge the model properly, we do the following
- Set a low learning rate, to only slightly update the weights
- Set a low LoRA rank to only update a small set of the model weights
- Ensure all labels are correct (the model is super sensitive to just a few annotation errors)
- Balance the dataset (there are a lot of blank images, we filter out some of them)
- Tune all layers of the VLM
- Perform a hyperparameter search
I will add some additional notes on some of the points:
Label correctness
Label correctness is of utmost importance. Just a few labeling errors can have a detrimental effect on model performance. As an example, when I was working on fine-tuning my model, I noticed the model started confusing parentheses “( )” with brackets “[ ]”. This is, of course, a significant error, so I started looking into why this occurred. My first intuition was that this was due to issues with some of my labels (i.e, some images that were in fact parentheses, had received a label with brackets). I started looking into my labels and noticed this error in around 0.5% of them.
This helped me make an interesting observation, however. I had a set of around 1000 labels. 99.5% of the labels were correct, while 0.5% (5 labels!) were incorrect. However, after fine-tuning my model, it actually performed worse on the test set. This highlights that just a few incorrect labels can hurt your model’s performance.
The reason so few mistakes can have such a large impact is that the model blindly trusts the labels you give it. The model doesn’t look at the image and think Hmmm, why is this a bracket when the image has a parenthesis? (like you might do). The model blindly trusts the labels and accepts it as a fact that this image (which is a parenthesis) contains a bracket. This really degrades model performance, as you are giving incorrect information, which it now uses to perform future predictions.
Data balancing
Another detail of the fine-tuning is that I balance out the dataset to limit the number of blank images. Around 70% of the cells contain blank images, and I want to avoid spending too much fine-tuning on those images (the model already manages to ignore those cells really well). Thus, I ensure that a maximum of 30% of the data we fine-tune contains blank images.
Selecting layers to tune
The image below shows the general architecture of a VLM:
A consideration to make when fine-tuning VLMs is which layers you fine-tune. Ideally, you want to tune all the layers (marked in green in the image below), which I also did when working on this problem. However, sometimes you will have compute constraints, which makes tuning all layers difficult, and you might not need to tune all layers. An example of this could be if you have a very image-dependent task. At Findable, for example, we classify drawings from architects, civil engineers, etc. This is naturally a very vision-dependent task, and this is an example of a case where you can potentially get away with only tuning the vision layers of the model (the ViT — Vision transformer, and the Vision-Language adapter, sometimes referred to as a projector).
Hyperparameter search
I also did a hyperparameter search to find the optimal set of parameters to fine-tune the model. It is worth noting, however, that a hyperparameter search will not always be possible. Some training processes for large language models can take several days, and in such scenarios, performing an extensive hyperparameter search is not feasible, so you will have to work with your intuition to find a good set of parameters.
However, for this problem of extracting handwritten text, I had access to an A100 80 GB GPU. The images are quite small (less than 100px in each direction), and I’m working with the 7B model. This made the training take 10–20 minutes, which makes an overnight hyperparameter search feasible.
Results and plots
After repeating the cycle of training the model, creating more labels, retraining, and so on, I have created a high-performing fine-tuned model. Now it’s time to see the final results. I have made four test sets, each consisting of 278 samples. I run EasyOCR, the base Qwen 2.5 VL 7B model (Qwen base model), and the fine-tuned model on the data, and you can see the results in the table below:
Thus, the results clearly show that the fine-tuning has worked as expected, vastly improving model performance.
To end off, I would also like to share some plots you can make with the data.
If you want to investigate the data further, it is all contained in this parquet file on HuggingFace.
Conclusion
In this article, I have introduced you to a Phenology dataset, consisting of small images with handwritten text. The problem I have addressed in this article is how to extract the handwritten text from these images effectively. First, we inspected the dataset to understand what it looks like, the variance in the data, and the challenges the vision language model faces when extracting the text from the images. I then discussed the three-step pipeline you can use to create a labelled dataset and fine-tune a model to improve performance. Finally, I highlighted some results, showing how fine-tuning Qwen works better than the base Qwen model, and I also showed some plots representing the data we extracted.
The work in this article is performed by Eivind Kjosbakken and Lars Aurdal.
The post Fine-Tuning vLLMs for Document Understanding appeared first on Towards Data Science.