Does finding the number of unique elements in a set actually require comparison ...

ReleaseCandidat · on May 17, 2024

> Does finding the number of unique elements in a set

That's easy, that's all of them. Sorry, could not resist.

Yes, hashing is the usual method. In a sorted list you can compare to the following element.

seadan83 · on May 17, 2024

Imagine 1PB of data and you expect 30% of it to be unique. That needs 300TB RAM to store unique elements. Keep in mind the values in a hash table are the elements themselves, so a hastable of perhaps 300TB. Doing that without that much RAM, even swapping to disk can be tough.

jangxx · on May 17, 2024

Using a hashtable is the "normal" approach mentioned in the article. It works of course, but requires memory to store each unique element (or their hashes). If you have less memory available, the described algorithm can still give a very good approximation.

Anduia · on May 17, 2024

Using a hashtable is effective because you only compare elements within their hash buckets, not the entire set. However, they can become inefficient with very large datasets due to memory usage and processing time, which is where approximate counts shine.

datavirtue · on May 17, 2024

This algorithm is still spinning a lot of random. I would guess that this is much less overhead than hashing but still seems like it could be significant.

theginger · on May 17, 2024

That is fine when you have say 1 million values and only 1000 are unique. But when you have 1 million values and about 900 thousand are unique you are putting more or less the whole data set into memory.

xwolfi · on May 17, 2024

Imagine a million elements. How big must your hashtable be ? The article explains it very well, did you miss it ? It's a way to save memory.

But to be honest I implemented it, ran it on Hamlet, and it's very wrong, it's barely useful but maybe if you just need a vague idea...

cb321 · on May 17, 2024

How big was your thresh? I found it pretty accurate: https://news.ycombinator.com/item?id=40388878