Song recommendations as an Impureim Sandwich
Does your data fit in RAM? This article is part of a series on functional programming design alternatives. In a previous article you saw how to add enough Characterisation Tests to capture the intended behaviour of the example song recommendations system originally presented by Oleksii Holub in the article Pure-Impure Segregation Principle. Problem statement # After showing how one problem can be refactored to pure functions, Oleksii Holub writes: "Although very useful, the type of "lossless" refactoring shown earlier only works if the data required by the function can be easily encapsulated within its input parameters. Unfortunately, this is not always the case. "Often a function may need to dynamically resolve data from an external API or a database, with no way of knowing about it beforehand. This typically results in an implementation where pure and impure concerns are interleaved with each other, creating a tightly coupled cohesive structure." Oleksii Holub, Pure-Impure Segregation Principle The article then proceeds to present the song recommendations example. It's a single C# method that queries a data store or third-party service to recommend songs. I'm imagining that it queries a third-party web service that contains usages data for a system like Spotify. "The above algorithm works by retrieving the user's most listened songs, finding other people who have also listened to the same titles, and then extracting their top songs as well. Those songs are then aggregated into a list of recommendations and returned to the caller. "It's quite clear that this function would benefit greatly from being pure, seeing how much business logic is encapsulated within it. Unfortunately, the technique we relied upon earlier won't work here. "In order to fully isolate GetRecommendationsAsync(...) from its impure dependencies, we would have to somehow supply the function with an entire list of songs, users, and their scrobbles upfront. If we assume that we're dealing with data on millions of users, it's obvious that this would be completely impractical and likely even impossible." Oleksii Holub, Pure-Impure Segregation Principle It does, indeed, sound impractical. Data sizes # Can you, however, trust your intuition? Research suggests that the human brain is ill-equipped to think about randomness and probabilities, and I've observed something similar when it comes to data sizes. In the real world, a million of anything countable is an almost incomprehensible amount, so it's no wonder if our intuition fails us. A million records sounds like a lot, but if it's only a few integers, is it really that bad? Many systems use 32-bit integers for various IDs. A million IDs, then, is 32 million bits, or approximately 4 MB. As I'm writing this, the smallest Azure instance (Free F1) has 1 GB of memory, and while the OS takes a big bite out of that, 4 MB is nothing. The song recommendations problem implies larger memory pressure. It may not fit on every machine, but it's worth considering if, after all, it doesn't fit in RAM. My real-life experience with developing streaming services # It just so happens that I have professional experience developing REST APIs for a white-label audio streaming service. Back in the early 2010s I helped design and implement the company's online music catalogue, user system, and a few other services. The catalogue is particularly interesting in this regard, since it only changed nightly, and we were planning on relying on HTTP for caching. I vividly recall a meeting we had with the IT operations specialist responsible for the new catalogue service. We explained that we'd set HTTP cache timeouts to 6 hours, and asked if he'd be able to set up a reverse proxy so that we didn't have to implement caching in our code base. He asked how much cache space we needed. We knew the size of a typical HTTP response, and the number of tracks, artists, and albums in the system, so after a back-of-the-envelope calculation, we told him: 18 GB. He just shrugged and said "OK". In 2012 I though that 18 GB was a fair amount of data (I actually still think that). Even so, the operations team had plenty of servers with that capacity. Later, I did more work for that company, but most of it is less relevant to the song recommendations example. What does turn out to be relevant to the topic is something I learned the first few days of my engagement. Early on, before I was involv

Does your data fit in RAM?
This article is part of a series on functional programming design alternatives. In a previous article you saw how to add enough Characterisation Tests to capture the intended behaviour of the example song recommendations system originally presented by Oleksii Holub in the article Pure-Impure Segregation Principle.
Problem statement #
After showing how one problem can be refactored to pure functions, Oleksii Holub writes:
"Although very useful, the type of "lossless" refactoring shown earlier only works if the data required by the function can be easily encapsulated within its input parameters. Unfortunately, this is not always the case.
"Often a function may need to dynamically resolve data from an external API or a database, with no way of knowing about it beforehand. This typically results in an implementation where pure and impure concerns are interleaved with each other, creating a tightly coupled cohesive structure."
The article then proceeds to present the song recommendations example. It's a single C# method that queries a data store or third-party service to recommend songs. I'm imagining that it queries a third-party web service that contains usages data for a system like Spotify.
"The above algorithm works by retrieving the user's most listened songs, finding other people who have also listened to the same titles, and then extracting their top songs as well. Those songs are then aggregated into a list of recommendations and returned to the caller.
"It's quite clear that this function would benefit greatly from being pure, seeing how much business logic is encapsulated within it. Unfortunately, the technique we relied upon earlier won't work here.
"In order to fully isolate
GetRecommendationsAsync(...)
from its impure dependencies, we would have to somehow supply the function with an entire list of songs, users, and their scrobbles upfront. If we assume that we're dealing with data on millions of users, it's obvious that this would be completely impractical and likely even impossible."
It does, indeed, sound impractical.
Data sizes #
Can you, however, trust your intuition? Research suggests that the human brain is ill-equipped to think about randomness and probabilities, and I've observed something similar when it comes to data sizes.
In the real world, a million of anything countable is an almost incomprehensible amount, so it's no wonder if our intuition fails us. A million records sounds like a lot, but if it's only a few integers, is it really that bad?
Many systems use 32-bit integers for various IDs. A million IDs, then, is 32 million bits, or approximately 4 MB. As I'm writing this, the smallest Azure instance (Free F1) has 1 GB of memory, and while the OS takes a big bite out of that, 4 MB is nothing.
The song recommendations problem implies larger memory pressure. It may not fit on every machine, but it's worth considering if, after all, it doesn't fit in RAM.
My real-life experience with developing streaming services #
It just so happens that I have professional experience developing REST APIs for a white-label audio streaming service. Back in the early 2010s I helped design and implement the company's online music catalogue, user system, and a few other services. The catalogue is particularly interesting in this regard, since it only changed nightly, and we were planning on relying on HTTP for caching.
I vividly recall a meeting we had with the IT operations specialist responsible for the new catalogue service. We explained that we'd set HTTP cache timeouts to 6 hours, and asked if he'd be able to set up a reverse proxy so that we didn't have to implement caching in our code base.
He asked how much cache space we needed.
We knew the size of a typical HTTP response, and the number of tracks, artists, and albums in the system, so after a back-of-the-envelope calculation, we told him: 18 GB.
He just shrugged and said "OK".
In 2012 I though that 18 GB was a fair amount of data (I actually still think that). Even so, the operations team had plenty of servers with that capacity.
Later, I did more work for that company, but most of it is less relevant to the song recommendations example. What does turn out to be relevant to the topic is something I learned the first few days of my engagement.
Early on, before I was involved, the company needed a recommendation system, but hadn't been able to find any off-the-shelf component. This was in the early 2000s and before Solr, but after Lucene. I'm not aware of all the forces that affected my then future client, but in the end, they decided to write their own search and recommendations engine.
Essentially, during the night a beefy server would read all relevant data from the database, crunch it, create data structures, and keep all data in memory. Like the reverse proxy, it required a server with more RAM than a normal pizza box, but not prohibitively so.
Costs #
Consider the cost of hardware, compared to developer time. A few specialised servers may set your organisation back a few thousand of dollars/pounds/euros. That's an amount you can easily burn through in salary if the code is too complicated, or has too many bugs.
You may argue that if you already have programmers on staff, they don't cost extra, but a too-complicated code base is still going to slow them down. Thus, the wrong software design could incur an opportunity cost greater than the cost of a server.
One of many reasons I'm interested in functional programming (FP) is its potential to make code bases simpler. The Impureim Sandwich is a wonderfully simple design, so it's worth pursuing; not only for some FP ideal, but because of its simplifying potential.
Intuition may tell us that the song recommendations scenario is prohibitively big, and therefore, an Impureim Sandwich is out of the question. As this overall article series explores, it's not the only alternative, but given its advantages, its worth giving it a second chance.
Analysis #
The GetRecommendationsAsync
method from the example makes a lot of external calls, with its nested loops. The method uses the first call to GetTopScrobblesAsync
to produce the scrobblesSnapshot
variable, which is capped at 100 objects. If we assume that this method call returns at least 100 objects, the outer foreach
loop will make 100 calls to GetTopListenersAsync
.
If we again assume that each of these return enough data, the inner foreach
loop will make 20 calls to GetTopScrobblesAsync
, for each object in the outer loop. That's 2,000 external calls, plus the 100 calls in the outer loop, plus the initial call to GetTopScrobblesAsync
, for a total of 2,101.
When I first saw the example, I didn't know much about the overall context. I didn't know if these impure actions were database queries or web service calls, so I asked Oleksii Holub.
It turns out that it's all web service calls, and as I interpret the response, GetRecommendationsAsync
is being invoked from a background maintenance process.
"It takes around 10 min in total while maintaining it."
That's good to know, because if we're going to consider an Impureim Sandwich, it implies reading gigabytes of data in the first phase. That's going to take some time, but if this is a background process, we do have time.
Memory estimates #
One thing is to load an entire song catalogue into memory. That's what required 18 GB in 2012. Another thing is to load all scrobbles; i.e. statistics about plays. Fortunately, in order to produce song recommendations, we only need IDs. Consider again the data structures from the previous article:
public sealed record Song(int Id, bool IsVerifiedArtist, byte Rating); public sealed record Scrobble(Song Song, int ScrobbleCount); public sealed record User(string UserName, int TotalScrobbleCount);
Apart from the UserName
all values are small predictable values: int
, byte
, and bool
, and while a string
may be arbitrarily long, we can make a guess at the average size of a user name. In the previous article, I assumed that the user name would be an alphanumeric string between one and twenty characters.
How many songs might a system contain? Some numbers thrown around for a system like Spotify suggest a number on the order of 100 million. With an int
, a bool
, and a byte
, we can estimate that a song requires 6 bytes, plus some overhead. Let's guess 8 bytes. A 100 million songs would then require 800 million bytes, or around 800 MB. That eliminates the smallest cloud instances, but is in itself easily within reach for all modern computers. Your phone has more memory than that.
How about scrobbles? While I don't use Spotify, I do scrobble plays to Last.fm. At the moment I have around 114,000 scrobbles, and while I don't listen to music as much as I used to when I was younger, I have, on the other hand, been at it for a long time: Since 2007. If we assume that each user has 200,000 scrobbles, and a scrobble requires 8 bytes, that's 1,600,000 bytes, or 1.6 MB. Practically nothing.
The size of a User
object depends on how long the user name is, but will probably, on average, be less than 32 bytes. Compared to the user's scrobbles, we can ignore the memory pressure of the user object itself.
As the number of users grow, it will dominate the memory requirements for the catalogue. How many users should we assume?
A million is probably too few, but for a frame of reference, that would require 1,6 TB. This is where it starts to sound unfeasible to keep all data in RAM. Even though servers with that much RAM exist, they're so expensive (still) that the above cost consideration no longer applies.
Still, there are some naive assumptions above. Instead of storing each scrobble in a separate Scrobble
object, you could store repeated plays as a single object with the appropriate ScrobbleCount
value. If you've listened to the same song 50 times, it doesn't require 400 bytes of storage, but only 8 bytes. That is, after all, orders of magnitude less.
In the end, back-of-the-envelope calculations are fine, but measurements are better. It might be worthwhile to develop a proof of concept and measure how much memory it requires.
In three articles, I'll explore how a song recommendations Impureim Sandwich looks in various constellations:
- Song recommendations as a C# Impureim Sandwich
- Song recommendations as an F# Impureim Sandwich
- Song recommendations as a Haskell Impureim Sandwich
In the end, it may turn out that for this particular system, an Impureim Sandwich truly is unfeasible. Keep in mind, though, that the purpose of this article series is to demonstrate alternative designs. The song recommendations problem is just a placeholder example. Perhaps you have another system where, intuitively, an Impureim Sandwich sounds impossible, but once you run the numbers, it might actually be not only possible, but desirable.
Conclusion #
Modern computers have so vast storage capacities that intuition often fails us. We may think that billions of data points sounds like something that can't possibly fit in RAM. When you run the numbers, however, it may turn out that the required data fits on a normal server.
If so, an Impureim Sandwich may still be an option. Load data into memory, pass it as argument to a pure function, and handle the return value.
The song recommendations scenario is interesting because an Impureim Sandwich seems to be pushing the envelope. It probably is impractical, but still worth a proof of concept. On the other hand, if it's impractical, it's worthwhile to also explore alternatives. Later articles will do that, but first, if you're interested, the next articles look at the proofs of concept in three languages.
Next: Song recommendations as a C# Impureim Sandwich.
This blog is totally free, but if you like it, please consider supporting it.