Bloom filters, CID bit distribution, and index complexity?

I made a post on the IPFS subreddit (link in context) a while ago to ask about what others there thought about possible ways of building an index for this kind of decentralized storage system. Unfortunately, while the post did get some upvotes, it didn't get much conversation (the only comment there is, pleasantly enough, from the user there with the most consistently insightful/intelligent things to say).

I won't repeat myself here but I do have a strange feeling that, with a large enough CID hash, something like a per-node bloom filter might actually make sense. Now, if that hash is the current size or something dramatically larger, I don't know (there is a joke about the computational complexity of multiplication implemented with FFT there, somewhere).

The annoying thing is that I could actually run a relevant experiment on this idea if I could query the nodes for what they are providing. This data isn't private (they broadcast it to their peers when connecting - or at least the nodes they think should be part of the index - not quite sure how that part of the system works) but it isn't exposed in a convenient way like that (which might be for a good reason or just because it seemed not generally the point or a waste of maintenance effort).

However, if I could get that information, I could at least collect some live network data and emulate the behaviour of my proposed system pretty easily. Testing with different hashes would be more complicated, but knowing per-node CID provisions would be enough to test on the current hash.

I could be totally speaking nonsense on this topic but I do have an odd feeling that there is a hash size threshold where this idea begins to make sense.

Hmm,
Jeff.

View discussion: https://www.reddit.com/r/ipfs/comments/1ljyvpx/bloom_filters_cid_bit_distribution_and_index/