How to Benchmark DeepSeek-R1 Distilled Models on GPQA Using Ollama and OpenAI’s simple-evals

Set up and run the GPQA-Diamond benchmark on DeepSeek-R1's distilled models locally to evaluate its reasoning capabilities. The post How to Benchmark DeepSeek-R1 Distilled Models on GPQA Using Ollama and OpenAI’s simple-evals appeared first on Towards Data Science.

Apr 24, 2025 - 04:11

How to Benchmark DeepSeek-R1 Distilled Models on GPQA Using Ollama and OpenAI’s simple-evals

The recent launch of the DeepSeek-R1 model sent ripples across the global AI community. It delivered breakthroughs on par with the reasoning models from Meta and OpenAI, achieving this in a fraction of the time and at a significantly lower cost.

Beyond the headlines and online buzz, how can we assess the model’s reasoning abilities using recognized benchmarks?

Deepseek’s user interface makes it easy to explore its capabilities, but using it programmatically offers deeper insights and more seamless integration into real-world applications. Understanding how to run such models locally also provides enhanced control and offline access.

In this article, we explore how to use Ollama and OpenAI’s simple-evals to evaluate the reasoning capabilities of DeepSeek-R1’s distilled models based on the famous GPQA-Diamond benchmark.

(1) What are Reasoning Models?
(2) What is DeepSeek-R1?
(3) Understanding Distillation and DeepSeek-R1 Distilled Models
(4) Selection of Distilled Model
(5) Benchmarks for Evaluating Reasoning
(6) Tools Used
(7) Results of Evaluation
(8) Step-by-Step Walkthrough

Here is the link to the accompanying GitHub repo for this article.

(1) What are Reasoning Models?

Reasoning models, such as DeepSeek-R1 and OpenAI’s o-series models (e.g., o1, o3), are large language models (LLMs) trained using reinforcement learning to perform reasoning.

Reasoning models think before they answer, producing a long internal chain of thought before responding. They excel in complex problem-solving, coding, scientific reasoning, and multi-step planning for agentic workflows.

(2) What is DeepSeek-R1?

DeepSeek-R1 is a state-of-the-art open-source LLM designed for advanced reasoning, introduced in January 2025 in the paper “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning”.

The model is a 671-billion-parameter LLM trained with extensive use of reinforcement learning (RL), based on this pipeline:

Two reinforcement stages aimed at discovering improved reasoning patterns and aligning with human preferences
Two supervised fine-tuning stages serving as the seed for the model’s reasoning and non-reasoning capabilities.

To be precise, DeepSeek trained two models:

The first model, DeepSeek-R1-Zero, a reasoning model trained with reinforcement learning, generates data for training the second model, DeepSeek-R1.
It achieves this by producing reasoning traces, from which only high-quality outputs are retained based on their final results.
It means that, unlike most models, the RL examples in this training pipeline are not curated by humans but generated by the model.

The outcome is that the model achieved performance comparable to leading models like OpenAI’s o1 model across tasks such as mathematics, coding, and complex reasoning.

(3) Understanding Distillation and DeepSeek-R1’s Distilled Models

Alongside the full model, they also open-sourced six smaller dense models (also named DeepSeek-R1) of different sizes (1.5B, 7B, 8B, 14B, 32B, 70B), distilled from DeepSeek-R1 based on Qwen or Llama as the base model.

Distillation is a technique where a smaller model (the “student”) is trained to replicate the performance of a larger, more powerful pre-trained model (the “teacher”).

Illustration of DeepSeek-R1 distillation process | Image by author

In this case, the teacher is the 671B DeepSeek-R1 model, and the students are the six models distilled using these open-source base models:

DeepSeek-R1 was used as the teacher model to generate 800,000 training samples, a mix of reasoning and non-reasoning samples, for distillation via supervised fine-tuning of the base models (1.5B, 7B, 8B, 14B, 32B, and 70B).

So why do we do distillation in the first place?

The goal is to transfer the reasoning abilities of larger models, such as DeepSeek-R1 671B, into smaller, more efficient models. This empowers the smaller models to handle complex reasoning tasks while being faster and more resource-efficient.

Furthermore, DeepSeek-R1 has a massive number of parameters (671 billion), making it challenging to run on most consumer-grade machines.

Even the most powerful MacBook Pro, with a maximum of 128GB of unified memory, is inadequate to run a 671-billion-parameter model.

As such, distilled models open up the possibility of being deployed on devices with limited computational resources.

Unsloth achieved an impressive feat by quantizing the original 671B-parameter DeepSeek-R1 model down to just 131GB — a remarkable 80% reduction in size. However, a 131GB VRAM requirement remains a significant hurdle.

(4) Selection of Distilled Model

With six distilled model sizes to choose from, selecting the right one largely depends on the capabilities of the local device hardware.

For those with high-performance GPUs or CPUs and a need for maximum performance, the larger DeepSeek-R1 models (32B and up) are ideal — even the quantized 671B version is viable.

However, if one has limited resources or prefers quicker generation times (as I do), the smaller distilled variants, such as 8B or 14B, are a better fit.

For this project, I will be using the DeepSeek-R1 distilled Qwen-14B model, which aligns with the hardware constraints I faced.

(5) Benchmarks for Evaluating Reasoning

LLMs are typically evaluated using standardized benchmarks that assess their performance across various tasks, including language understanding, code generation, instruction following, and question answering. Common examples include MMLU, HumanEval, and MGSM.

To measure an LLM’s capacity for reasoning, we need more challenging, reasoning-heavy benchmarks that go beyond surface-level tasks. Here are some popular examples focused on evaluating advanced reasoning capabilities:

(i) AIME 2024 — Competition Math

The American Invitational Mathematics Examination (AIME) 2024 serves as a strong benchmark for evaluating an LLM’s mathematical reasoning capabilities.
It is a challenging math contest with complex, multi-step problems that test an LLM’s ability to interpret intricate questions, apply advanced reasoning, and perform precise symbolic manipulation.

(ii) Codeforces — Competition Code

The Codeforces Benchmark evaluates an LLM’s reasoning ability using real competitive programming problems from Codeforces, a platform known for algorithmic challenges.
These problems test an LLM’s capacity to comprehend complex instructions, perform logical and mathematical reasoning, plan multi-step solutions, and generate correct, efficient code.

(iii) GPQA Diamond — PhD-Level Science Questions

GPQA-Diamond is a curated subset of the most difficult questions from the broader GPQA (Graduate-Level Physics Question Answering) benchmark, specifically designed to push the limits of LLM reasoning in advanced PhD-level topics.
While GPQA includes a range of conceptual and calculation-heavy graduate questions, GPQA-Diamond isolates only the most challenging and reasoning-intensive ones.
It is considered Google-proof, meaning that they are difficult to answer even with unrestricted web access.
Here is an example of a GPQA-Diamond question:

In this project, we use GPQA-Diamond as the reasoning benchmark, as OpenAI and DeepSeek used it to evaluate their reasoning models.

(6) Tools Used

For this project, we primarily use Ollama and OpenAI’s simple-evals.

(i) Ollama

Ollama is an open-source tool that simplifies running LLMs on our computer or a local server.

It acts as a manager and runtime, handling tasks such as downloads and environment setup. This allows users to interact with these models without requiring a constant internet connection or relying on cloud services.

It supports many open-source LLMs, including DeepSeek-R1, and is cross-platform compatible with macOS, Windows, and Linux. Additionally, it offers a straightforward setup with minimal fuss and efficient resource utilization.

Important: Ensure your local device has GPU access for Ollama, as this dramatically accelerates performance and makes subsequent benchmarking exercises much more efficient as compared to CPU. Run nvidia-smi in terminal to check if GPU is detected.

(ii) OpenAI simple-evals

simple-evals is a lightweight library designed to evaluate language models using a zero-shot, chain-of-thought prompting approach. It includes famous benchmarks like MMLU, MATH, GPQA, MGSM, and HumanEval, aiming to reflect realistic usage scenarios.

Some of you may know about OpenAI’s more famous and comprehensive evaluation library called Evals, which is distinct from simple-evals.

In fact, the README of simple-evals also specifically indicates that it is not intended to replace the Evals library.

So why are we using simple-evals?

The simple answer is that simple-evals comes with built-in evaluation scripts for the reasoning benchmarks we are targeting (such as GPQA), which are missing in Evals.

Additionally, I did not find any other tools or platforms, other than simple-evals, that provide a straightforward, Python-native way to run numerous key benchmarks, such as GPQA, particularly when working with Ollama.

(7) Results of Evaluation

As part of the evaluation, I selected 20 random questions from the GPQA-Diamond 198-question set for the 14B distilled model to work on. The total time taken was 216 minutes, which is ~11 minutes per question.

The outcome was admittedly disappointing, as it scored only 10%, far below the reported 73.3% score for the 671B DeepSeek-R1 model.

The main issue I noticed is that during its intensive internal reasoning, the model often either failed to produce any answer (e.g., returning reasoning tokens as the final lines of output) or provided a response that did not match the expected multiple-choice format (e.g., Answer: A).

Evaluation output printout from the 20 examples benchmark run | Image by author

As shown above, many outputs ended up as None because the regex logic in simple-evals could not detect the expected answer pattern in the LLM response.

While the human-like reasoning logic was interesting to observe, I had expected stronger performance in terms of question-answering accuracy.

I have also seen online users mention that even the larger 32B model does not perform as well as o1. This has raised doubts about the utility of distilled reasoning models, especially when they struggle to give correct answers despite generating long reasoning.

That said, GPQA-Diamond is a highly challenging benchmark, so these models could still be useful for simpler reasoning tasks. Their lower computational demands also make them more accessible.

Furthermore, the DeepSeek team recommended conducting multiple tests and averaging the results as part of the benchmarking process — something I omitted due to time constraints.

(8) Step-by-Step Walkthrough

At this point, we’ve covered the core concepts and key takeaways.

If you’re ready for a hands-on, technical walkthrough, this section provides a deep dive into the inner workings and step-by-step implementation.

Check out (or clone) the accompanying GitHub repo to follow along. The requirements for the virtual environment setup can be found here.

(i) Initial Setup — Ollama

We begin by downloading Ollama. Visit the Ollama download page, select your operating system, and follow the corresponding installation instructions.

Once installation is complete, launch Ollama by double-clicking the Ollama app (for Windows and macOS) or running ollama serve in the terminal.

(ii) Initial Setup — OpenAI simple-evals

The setup of simple-evals is relatively unique.

While simple-evals presents itself as a library, the absence of __init__.py files in the repository means it is not structured as a proper Python package, leading to import errors after cloning the repo locally.

Since it is also not published to PyPI and lacks standard packaging files like setup.py or pyproject.toml, it cannot be installed via pip.

Fortunately, we can utilize Git submodules as a straightforward workaround.

A Git submodule lets us include contents of another Git repository inside our own project. It pulls the files from an external repo (e.g., simple-evals), but keeps its history separate.

You can choose one of two ways (A or B) to pull the simple-evals contents:

(A) If You Cloned My Project Repo

My project repo already includes simple-evals as a submodule, so you can just run:

git submodule update --init --recursive

(B) If You’re Adding It to a Newly Created Project
To manually add simple-evals as a submodule, run this:

git submodule add https://github.com/openai/simple-evals.git simple_evals

Note: The simple_evals at the end of the command (with an underscore) is crucial. It sets the folder name, and using a hyphen instead (i.e., simple–evals) can lead to import issues later.

Final Step (For Both Methods)

After pulling the repo contents, you must create an empty __init__.py in the newly created simple_evals folder so that it is importable as a module. You can create it manually, or use the following command:

touch simple_evals/__init__.py

(iii) Pull DeepSeek-R1 model via Ollama

The next step is to locally download the distilled model of your choice (e.g., 14B) using this command:

ollama pull deepseek-r1:14b

The list of DeepSeek-R1 models available on Ollama can be found here.

(iv) Define configuration

We define the parameters in a configuration YAML file, as shown below:

The model temperature is set to 0.6 (as opposed to the typical default value of 0). This follows DeepSeek’s usage recommendations, which suggest a temperature range of 0.5 to 0.7 (0.6 recommended) to prevent endless repetitions or incoherent outputs.

Do check out the interestingly unique DeepSeek-R1 usage recommendations — especially for benchmarking — to ensure optimal performance when using DeepSeek-R1 models.

EVAL_N_EXAMPLES is the parameter for setting the number of questions from the full 198-question set to use for evaluation.

(v) Set up Sampler code

To support Ollama-based language models within the simple-evals framework, we create a custom wrapper class named OllamaSampler saved inside utils/samplers/ollama_sampler.py.

In this context, a sampler is a Python class that generates outputs from a language model based on a given prompt.

Since existing samplers in simple-evals only cover providers like OpenAI and Claude, we need a sampler class that provides a compatible interface for Ollama.

The OllamaSampler extracts the GPQA question prompt, sends it to the model with a specified temperature, and returns the plain text response.

The _pack_message method is included to ensure the output format matches what the evaluation scripts in simple-evals expect.

(vi) Create evaluation run script

The following code sets up the evaluation execution in main.py, including the use of the GPQAEval class from simple-evals to run GPQA benchmarking.

The run_eval() function is a configurable evaluation runner that tests LLMs through Ollama on benchmarks like GPQA.

It loads settings from the config file, sets up the appropriate evaluation class from simple-evals, and runs the model through a standardized evaluation process. It is saved in main.py, which can be executed with python main.py.

Following the steps above, we have successfully set up and executed the GPQA-Diamond benchmarking on the DeepSeek-R1 distilled model.

Wrapping It Up

In this article, we showcased how we can combine tools like Ollama and OpenAI’s simple-evals to explore and benchmark DeepSeek-R1’s distilled models.

The distilled models may not yet rival the 671B parameter original model on challenging reasoning benchmarks like GPQA-Diamond. Still, they demonstrate how distillation can expand access to LLM reasoning capabilities.

Despite subpar scores in complex PhD-level tasks, these smaller variants may remain viable for less demanding scenarios, paving the way for efficient local deployment on a wider range of hardware.

Before you go

I welcome you to follow my GitHub and LinkedIn to stay updated with more engaging and practical content. Meanwhile, have fun benchmarking LLMs with Ollama and simple-evals!

The post How to Benchmark DeepSeek-R1 Distilled Models on GPQA Using Ollama and OpenAI’s simple-evals appeared first on Towards Data Science.