Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Does finding the number of unique elements in a set actually require comparison of each element with everything else? Can't you use a hashtable? For every element, add it to the table (ignore if already exists), and finally, take a count of keys.


> Does finding the number of unique elements in a set

That's easy, that's all of them. Sorry, could not resist.

Yes, hashing is the usual method. In a sorted list you can compare to the following element.


Imagine 1PB of data and you expect 30% of it to be unique. That needs 300TB RAM to store unique elements. Keep in mind the values in a hash table are the elements themselves, so a hastable of perhaps 300TB. Doing that without that much RAM, even swapping to disk can be tough.


Using a hashtable is the "normal" approach mentioned in the article. It works of course, but requires memory to store each unique element (or their hashes). If you have less memory available, the described algorithm can still give a very good approximation.


Using a hashtable is effective because you only compare elements within their hash buckets, not the entire set. However, they can become inefficient with very large datasets due to memory usage and processing time, which is where approximate counts shine.


This algorithm is still spinning a lot of random. I would guess that this is much less overhead than hashing but still seems like it could be significant.


That is fine when you have say 1 million values and only 1000 are unique. But when you have 1 million values and about 900 thousand are unique you are putting more or less the whole data set into memory.


Imagine a million elements. How big must your hashtable be ? The article explains it very well, did you miss it ? It's a way to save memory.

But to be honest I implemented it, ran it on Hamlet, and it's very wrong, it's barely useful but maybe if you just need a vague idea...


How big was your thresh? I found it pretty accurate: https://news.ycombinator.com/item?id=40388878




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: