RavenDB 7.1: Clocking at 200 fsync/second

I have been delaying the discussion about the performance numbers for a reason. Once we did all the work that I described in the previous posts, we put it to an actual test and ran it against the benchmark suite. In particular, we were interested in the following scenario:High insert rate, with about 100 indexes active at the same time. Target: Higher requests / second, lowered latencyPreviously, after hitting a tipping point, we would settle at under 90 requests/second and latency spikes that hit over 700(!) ms. That was the trigger for much of this work, after all. The initial results were quite promising, we were showing massive improvement across the board, with over 300 requests/second and latency peaks of 350 ms.On the one hand, that is a really amazing boost, over 300% improvement. On the other hand, just 300 requests/second - I hoped for much higher numbers. When we started looking into exactly what was going on, it became clear that I seriously messed up.Under load, RavenDB would issue fsync() at a rate of over 200/sec. That is… a lot, and it means that we are seriously limited in the amount of work we can do. That is weird since we worked so hard on reducing the amount of work the disk has to do. Looking deeper, it turned out to be an interesting combination of issues.Whenever Voron changes the active journal file, we’ll register the new journal number in the header file, which requires an fsync() call. Because we are using shared journals, the writes from both the database and all the indexes go to the same file, filling it up very quickly. That meant we were creating new journal files at a rate of more than one per second. That is quite a rate for creating new journal files, mostly resulting from the sheer number of writes that we funnel into a single file. The catch here is that on each journal file creation, we need to register each one of the environments that share the journal. In this case, we have over a hundred environments participating, and we need to update the header file for each environment. With the rate of churn that we have with the new shared journal mode, that alone increases the number of fsync() generated.It gets more annoying when you realize that in order to actually share the journal, we need to create a hard link between the environments. On Linux, for example, we need to write the following code:bool create_hard_link_durably(const char src, const char dest) { if (link(src, dest) == -1) return false; int dirfd = open(dest, O_DIRECTORY); if (dirfd == -1) return false; int rc = fsync(dirfd); close(dirfd); return rc != -1; }You need to make 4 system calls to do this properly, and most crucially, one of them is fsync() on the destination directory. This is required because the man page states:Calling fsync() does not necessarily ensure that the entry in the directory containing the file has also reached disk. For that an explicit fsync() on a file descriptor for the directory is also needed.Shared journals mode requires that we link the journal file when we record a transaction from that environment to the shared journal. In our benchmark scenario, that means that each second, we’ll write from each environment to each journal. We need an fsync() for the directory of each environment per journal, and in total, we get to over 200 fsync() per journal file, which we replace at a rate of more than one per second.Doing nothing as fast as possible…Even with this cost, we are still 3 times faster than before, which is great, but I think we can do better. In order to be able to do that, we need to figure out a way to reduce the number of fsync() calls being generated. The first task to handle is updating the file header whenever we create a new journal. We are already calling fsync() on the directory when we create the journal file, so we ensure that the file is properly persisted in the directory. There is no need to also record it in the header file. Instead, we can just use the directory listing to handle this scenario. That change alone saved us about 100 fsync() / second.The second problem is with the hard links, we need to make sure that these are persisted. But calling fsync() for each one is cost-prohibitive. Luckily, we already have a transactional journal, and we can re-use that. As part of committing a set of transactions to the shared journal, we’ll also record an entry in the journal with the associated linked paths. That means we can skip calling fsync() after creating the hard link, since if we run into a hard crash, we can recover the linked journals during the journal recovery. That allows us to skip the other 100 fsync() / second.Another action we can take to reduce costs is to increase the size of the journal files. Since we are writing entries from both the database and indexes to the same file, we are going through them a lot faster now, so increasing the default maximum size will allow us to amortize the new file c

Feb 19, 2025 - 15:21

I have been delaying the discussion about the performance numbers for a reason. Once we did all the work that I described in the previous posts, we put it to an actual test and ran it against the benchmark suite. In particular, we were interested in the following scenario:

High insert rate, with about 100 indexes active at the same time. Target: Higher requests / second, lowered latency

Previously, after hitting a tipping point, we would settle at under 90 requests/second and latency spikes that hit over 700(!) ms. That was the trigger for much of this work, after all. The initial results were quite promising, we were showing massive improvement across the board, with over 300 requests/second and latency peaks of 350 ms.

On the one hand, that is a really amazing boost, over 300% improvement. On the other hand, just 300 requests/second - I hoped for much higher numbers. When we started looking into exactly what was going on, it became clear that I seriously messed up.

Under load, RavenDB would issue fsync() at a rate of over 200/sec. That is… a lot, and it means that we are seriously limited in the amount of work we can do. That is weird since we worked so hard on reducing the amount of work the disk has to do. Looking deeper, it turned out to be an interesting combination of issues.

Whenever Voron changes the active journal file, we’ll register the new journal number in the header file, which requires an fsync() call. Because we are using shared journals, the writes from both the database and all the indexes go to the same file, filling it up very quickly. That meant we were creating new journal files at a rate of more than one per second. That is quite a rate for creating new journal files, mostly resulting from the sheer number of writes that we funnel into a single file.

The catch here is that on each journal file creation, we need to register each one of the environments that share the journal. In this case, we have over a hundred environments participating, and we need to update the header file for each environment. With the rate of churn that we have with the new shared journal mode, that alone increases the number of fsync() generated.

It gets more annoying when you realize that in order to actually share the journal, we need to create a hard link between the environments. On Linux, for example, we need to write the following code:

bool create_hard_link_durably(const char *src, const char *dest) {
    if (link(src, dest) == -1) 
        return false;


    int dirfd = open(dest, O_DIRECTORY);
    if (dirfd == -1) 
        return false;


    int rc = fsync(dirfd);
    close(dirfd);
    return rc != -1;
}

You need to make 4 system calls to do this properly, and most crucially, one of them is fsync() on the destination directory. This is required because the man page states:

Calling fsync() does not necessarily ensure that the entry in the directory containing the file has also reached disk. For that an explicit fsync() on a file descriptor for the directory is also needed.

Shared journals mode requires that we link the journal file when we record a transaction from that environment to the shared journal. In our benchmark scenario, that means that each second, we’ll write from each environment to each journal. We need an fsync() for the directory of each environment per journal, and in total, we get to over 200 fsync() per journal file, which we replace at a rate of more than one per second.

Doing nothing as fast as possible…

Even with this cost, we are still 3 times faster than before, which is great, but I think we can do better. In order to be able to do that, we need to figure out a way to reduce the number of fsync() calls being generated. The first task to handle is updating the file header whenever we create a new journal.

We are already calling fsync() on the directory when we create the journal file, so we ensure that the file is properly persisted in the directory. There is no need to also record it in the header file. Instead, we can just use the directory listing to handle this scenario. That change alone saved us about 100 fsync() / second.

The second problem is with the hard links, we need to make sure that these are persisted. But calling fsync() for each one is cost-prohibitive. Luckily, we already have a transactional journal, and we can re-use that. As part of committing a set of transactions to the shared journal, we’ll also record an entry in the journal with the associated linked paths.

That means we can skip calling fsync() after creating the hard link, since if we run into a hard crash, we can recover the linked journals during the journal recovery. That allows us to skip the other 100 fsync() / second.

Another action we can take to reduce costs is to increase the size of the journal files. Since we are writing entries from both the database and indexes to the same file, we are going through them a lot faster now, so increasing the default maximum size will allow us to amortize the new file costs across more transactions.

The devil is in the details

The idea of delaying the fsync() of the parent directory of the linked journals until recovery is a really great one because we usually don’t recover. That means we delay the cost indefinitely, after all. Recovery is rare, and adding a few milliseconds to the recovery time is usually not meaningful.

However… There is a problem when you look at this closely. The whole idea behind shared journals is that transactions from multiple storage environments are written to a single file, which is hard-linked to multiple directories. Once written, each storage environment deals with the journal files independently.

That means that if the root environment is done with a journal file and deletes it before the journal file hard link was properly persisted to disk and there was a hard crash… on recovery, we won’t have a journal to re-create the hard links, and the branch environments will be in an invalid state.

That is a set of circumstances that is going to be unlikely, but that is something that we have to prevent, nevertheless.

The solution for that is to keep track of all the journal directories, and whenever we are about to delete a journal file, we’ll sync all the associated journal directories. The key here is that when we do that, we keep track of the current journal written to that directory. Instead of having to run fsync() for each directory per journal file, we can amortize this cost. Because we delayed the actual syncing, we have time to create more journal files, so calling fsync() on the directory ensures that multiple files are properly persisted.

We still have to sync the directories, but at least it is not to the tune of hundreds of times per second.

Performance results

After making all of those changes, we run the benchmark again. We are looking at over 500 req/sec and latency peaks that hover around 100 ms under load. As a reminder, that is almost twice as much as the improved version (and a hugeimprovement in latency).

If we compare this to the initial result, we increased the number of requests per second by over 500%.

But I have some more ideas about that, I’ll discuss them in the next post…