Bloom Filters Applied In Real-Life Application - Laravel Prototype
The Problem When we have something frequent and at the same time intensive it's just one of the biggest challenges we might face. A good example of that is something like uniqueness validation; we need to add a new email to the database but to do that we need to check for every single email that's already in the database. And that's the intensiveness of it. We perform a full table scan --or in the best case index-- so we make sure that this table doesn't have a single instance of that email. Take a case where you have millions of users in your table of users, and you need to make sure that the email is unique for each user who logs in. The validation will indeed take so long, and the user will be bothered. Not to mention the amount of risk you put on the database doing that huge scan on a huge amount of data on a simple task as validation. Taking our example of around 10 million users, we'll need ~10-60 seconds without indexing. While maybe with indexing it might take a few seconds. These numbers are not solid, just for referencing how much it might take. Being in the industry for a while you'll know --from the get-go--, that having less query on the database is always better. That is why we started Caching Layer like Redis (which will also be used in this blog). And search engine tools like Elastic search. And relying on message brokers like RabbitMQ for keeping a queue. Using other tools to minimize the need to do queries is something we often do in software engineering generally, and this case is no different. The best solution for such a case is Bloom Filters, which I'll explain in detail in the next section. Solution: Bloom Filters Taking about Bloom Filters we need to understand that it's not actually just a tool. It is really a data structure that is great for such cases as the one we are talking about here. General Principle Bloom Filter consists of these two main components: bit array multiple hash functions What we do is that when some new item is inserted we use each of the hash functions on it, and they have to give us an index on the bit array to set as 1. And when we apply all the hash functions we fill the bit array in certain places which will all be 1 if we checked with the same hash functions. So When we get some new insert after that we use the hash functions on them again and if all the indexes that the hash functions were already filled with 1 in the bit array that means that this new item "might be inserted before" (will elaborate this later). Example Let's take our email case as an example of that. When we insert a new email (for example "abdelrahman.dwedar.7350@gmail.com") we start by running the hash functions on it and let's just for simplicity imagine that the results were: 7, 1, 5. So our bit array will now look like this: [0, 1, 0, 0, 0, 1, 0, 1, 0, 0] Notice that we have 1 in the indexes resulting from the hash functions. If we now try to reinsert "abdelrahman.dwedar.7350@gmail.com" again we'll have all of them as 1 already so we suspect that it might have been inserted before. False-positive nature of bloom filters Bloom filters are a false-positive data structure, when the case is false it is indeed a solid false. Yet If it's true it might be action false. How does that happen exactly? Let's have an example on the same bit array with the indexes 7, 1, and 5 being filled. You might be inserting "new_email@gmail.com" yet your hash functions resulted also 7, 1, 5 which means it's filled in the bit array, which means that it might be inserted, so you'll now need to do a full scan. Yet the good part is that we're sure in case they were not all filled that the new item was never inserted before. Why use it? As bloom filters use a bit array and very fast hashing functions like FNV or Murmur, that results us a very fast check if it was inserted before with a false-positive result. Using bloom filters makes you make full scan much less, as whenever it gives you false you're 100% assured that it was not added the the database before. I recommend you read more about bloom filters to gain a better understanding of it, so I added a few valuable resources I used in the references. Bloom Filters: Redis Implementation Redis Implementation Redis is key-value data store. Most used as a cache layer, it stores the data in-memory which makes it faster to reach any time. And its nature as a key-value data store which is similar to hash maps it has great both read & write speed. I'll be using Redis for that. Redis by default doesn't have implementation for bloom filters, yet you can have it by adding the Redis Probabilistic Data Structures Module to Redis, or using the redis-stack docker image. (In my prototype I used the docker image). Using Redis will take the overhead of making the bloom filter from scratch in our application, and also helps with auto-scaling (which i

The Problem
When we have something frequent and at the same time intensive it's just one of the biggest challenges we might face.
A good example of that is something like uniqueness validation; we need to add a new email to the database but to do that we need to check for every single email that's already in the database.
And that's the intensiveness of it. We perform a full table scan --or in the best case index
-- so we make sure that this table doesn't have a single instance of that email.
Take a case where you have millions of users in your table of users, and you need to make sure that the email is unique for each user who logs in. The validation will indeed take so long, and the user will be bothered.
Not to mention the amount of risk you put on the database doing that huge scan on a huge amount of data on a simple task as validation.
Taking our example of around 10 million users, we'll need ~10-60 seconds
without indexing.
While maybe with indexing it might take a few seconds.
These numbers are not solid, just for referencing how much it might take.
Being in the industry for a while you'll know --from the get-go--, that having less query on the database is always better.
That is why we started Caching Layer like Redis (which will also be used in this blog). And search engine tools like Elastic search. And relying on message brokers like RabbitMQ for keeping a queue.
Using other tools to minimize the need to do queries is something we often do in software engineering generally, and this case is no different.
The best solution for such a case is Bloom Filters, which I'll explain in detail in the next section.
Solution: Bloom Filters
Taking about Bloom Filters we need to understand that it's not actually just a tool. It is really a data structure that is great for such cases as the one we are talking about here.
General Principle
Bloom Filter consists of these two main components:
- bit array
- multiple hash functions
What we do is that when some new item is inserted we use each of the hash functions on it, and they have to give us an index on the bit array to set as 1
. And when we apply all the hash functions we fill the bit array in certain places which will all be 1
if we checked with the same hash functions.
So When we get some new insert after that we use the hash functions on them again and if all the indexes that the hash functions were already filled with 1
in the bit array that means that this new item "might be inserted before" (will elaborate this later).
Example
Let's take our email case as an example of that.
When we insert a new email (for example "abdelrahman.dwedar.7350@gmail.com"
) we start by running the hash functions on it and let's just for simplicity imagine that the results were: 7
, 1
, 5
.
So our bit array will now look like this:
[0, 1, 0, 0, 0, 1, 0, 1, 0, 0]
Notice that we have 1
in the indexes resulting from the hash functions.
If we now try to reinsert "abdelrahman.dwedar.7350@gmail.com"
again we'll have all of them as 1
already so we suspect that it might have been inserted before.
False-positive nature of bloom filters
Bloom filters are a false-positive data structure, when the case is false
it is indeed a solid false
. Yet If it's true
it might be action false
.
How does that happen exactly?
Let's have an example on the same bit array with the indexes 7
, 1
, and 5
being filled. You might be inserting "new_email@gmail.com"
yet your hash functions resulted also 7
, 1
, 5
which means it's filled in the bit array, which means that it might be inserted, so you'll now need to do a full scan.
Yet the good part is that we're sure in case they were not all filled that the new item was never inserted before.
Why use it?
As bloom filters use a bit array and very fast hashing functions like FNV or Murmur, that results us a very fast check if it was inserted before with a false-positive result.
Using bloom filters makes you make full scan much less, as whenever it gives you false
you're 100% assured that it was not added the the database before.
I recommend you read more about bloom filters to gain a better understanding of it, so I added a few valuable resources I used in the references.
Bloom Filters: Redis Implementation
Redis Implementation
Redis is key-value data store. Most used as a cache layer, it stores the data in-memory which makes it faster to reach any time. And its nature as a key-value data store which is similar to hash maps it has great both read & write speed.
I'll be using Redis for that. Redis by default doesn't have implementation for bloom filters, yet you can have it by adding the Redis Probabilistic Data Structures Module to Redis, or using the redis-stack docker image. (In my prototype I used the docker image).
Using Redis will take the overhead of making the bloom filter from scratch in our application, and also helps with auto-scaling (which is out of the scope of this article) for later.
Redis Commands
If you ever used Redis before you'll know that we use commands like APPEND {key} {value}
.
We have some commands for Bloom filters in Redis as well, all of them start with BF
.
I won't be mentioning every single one of them, but I'll tell you the most important ones.
-
BF.RESERVE {key} {error_rate} {capacity} [EXPANSION expansion]
: Creates an empty Bloom filter with a single sub-filter for the initial specified capacity and with an upper bounderror_rate
. -
BF.ADD {key} {item}
: Adds an item to a Bloom filter. -
BF.INSERT {key} ITEMS {item} [item...]
: Creates a new Bloom filter if the key does not exist using the specified error rate, capacity, and expansion, then adds all specified items to the Bloom Filter. (accepts at least one item) -
BF.EXISTS {key} {item}
: Determines whether a given item was added to a Bloom filter.
I think by just the naming you can understand how it works. We'll use BF.RESERVE
to create a new Bloom filter, and then we use BF.ADD
or BF.INSERT
to add new items to it (and the hash function will run internally). Whenever we want to make a check if an item exists in the Bloom filter we use BF.EXISTS
(which also runs the hash functions internally).
Simply like that we now have a ready-to-go Bloom filter for a high-scale application also powered by the auto-scaling features of Redis.
Laravel Custom Validation Rules
What are Laravel validation rules?
To give a more real-life example we need an application, and in the application layer I'd use Laravel, in Laravel we can have validation rules, like the following:
$validated = $request->validate([
'title' => 'required|unique:posts|max:255',
'body' => 'required',
]);
In this example, we have title
and body
as the fields of the request, and the validation rules applied are required
, unique
, and max
.
To make it simple:
-
required
: as its name indicates; it makes a field required and if not added it gives you an error -
unique
: you have to add a unique field, and the name after:
is the name of the table it checks the uniqueness on -
max
: check for the max character length in our case
The used rules are default ones that come from Laravel itself. You can learn more about them in the documentation.
Making a Custom Validation Rule
Laravel also gives you the flexibility to add custom rules as you wish.
In our case, the unique
validation rule is not enough, as it goes directly to the database and makes a full table scan.
We need something custom for that.
By following the documentation for making custom validation rules, we can create a new rule by running this command:
php artisan make:rule BloomUnique
Then we have a template for the new rule.
After I made the changes by using Redis with the BF.EXISTS
this is the final code:
namespace App\Rules;
use Closure;
use Illuminate\Contracts\Validation\ValidationRule;
use Illuminate\Support\Facades\Redis;
use App\Models\User;
class BloomUnique implements ValidationRule
{
protected $filterName;
public function __construct(string $filterName)
{
$this->filterName = $filterName;
}
/**
* Run the validation rule.
*
* @param \Closure(string): \Illuminate\Translation\PotentiallyTranslatedString $fail
*/
public function validate(string $attribute, mixed $value, Closure $fail): void
{
if (Redis::executeRaw(['BF.EXISTS', $this->filterName, $value])) {
// Check database only if Bloom filter indicates possible existence
if (User::where($this->filterName, $value)->exists()) {
$fail('The :attribute has already been taken.');
}
}
}
}
I use the Redis
class to execute the BF.EXISTS
command on the filterName
added to the request. Yet we're not done, if it exists in the Bloom filter that doesn't mean it's for sure there, we still need to make a table scan to check if it's actually in the database.
Yet in case we didn't find it in the Bloom filter at all, we are sure it was never inserted before.
We use the
$fail
function to indicate that it's not passing, and we add the error message as a parameter.
Prototype
Explained
I have made a prototype for this using Laravel and Redis. I also used Docker to run them.
I made the case where we just use the default unique
validation rule.
Then I made the case where we used the custom rule BloomUnique
.
Each case is in a separate Git branch, the case with normal unique
is in the branch named without-bloom-filters
, and the other case with the custom rule is in the branch named with-bloom-filers
.
I also added seeders using Laravel built-in seeder to add testing data.
How to run?
Clone the Repository
git clone https://github.com/AbdelrahmanDwedar/bloom-filters-validation-laravel-prototype.git
cd laravel-bloom-validation
Setup Sail (For using docker)
composer require laravel/sail --dev
php artisan sail:install
Setup the environment
cp .env.example .env
./vendor/bin/sail artisan key:generate
Then add the configuration for the database
DB_CONNECTION=pgsql
DB_HOST=127.0.0.1
DB_PORT=5432
DB_DATABASE=laravel
DB_USERNAME=root
DB_PASSWORD=laravel
Also, add the configuration for Redis
REDIS_HOST=redis
REDIS_PASSWORD=null
REDIS_PORT=6379
Start the docker containers, and install dependencies
./vendor/bin/sail up -d
./vendor/bin/sail composer install
Use the custom-made commands to prepare the setup
./vendor/bin/sail artisan benchmark:prepare
./vendor/bin/sail artisan benchmark:perform
These are two custom commands I made for this prototype, the first one benchmark:prepare
task is:
- migrations with a fresh database
- run the seeder
And for the benchmark:perform
task is to
- run the benchmarks
- display them in a table in the terminal
Conclusion
Bloom filters is a great tool we might use to solve the issue of having to always make full-table scans in cases of checking if some field exists in the database.
We can use it in real-life scenarios by leveraging Redis implementation in it, which I personally admire.
This prototype is just to show how it might be used in some real-life scenarios using the Laravel application, we leveraged adding custom rules in Laravel for better code readability and separation of concerns.
I have added some resources you can learn more from in the references below if you need more information about it.
References
- Wikipedia: Bloom Filter (article)
- What Are Bloom Filters? (video)
- Bloom Filters Explained by Example (video)
- Redis: Bloom Filters In Redis (video)
- Redis: How to Use Bloom Filters in Redis (video)
- Redis: Bloom Filter Datatype for Redis (article)
- Redis: Redis Bloom Filters Commands (documentation)
- Laravel: Custom Validation Rules (documentation)
- Laravel: Redis database (documentation)