On Tue, Feb 18, 2014 at 9:28 PM, Alex Rousskov <rousskov@xxxxxxxxxxxxxxxxxxxxxxx> wrote: > On 02/18/2014 04:11 PM, Rajiv Desai wrote: > >> It seems like there is a bug where an object overwrites previously >> written object: > > This is not really a bug -- hash collisions do happen, as you have > discovered and Rock store resolves them by overwriting older entries if > possible (or not caching the newer ones otherwise). The smaller your > cache, the more collisions you will have for the same number of unique > cache keys. > > How many total slots does your rock cache_dir have? 13107198 (200 GB with 16KB slots). I think I was thrown off by the hit % in mgr:info stats. Are these stats accurate when using 8 SMP workers? Is the % underreported? I couldnt find an absolute value of hits/misses from stats. :/ Any pointers? > > >> Needs a better hash as pointed out by the TODO :) >> I will send out a patch. > > Thank you. No hash function will eliminate collisions though. If > somebody wants to eliminate side-effects of hash collisions on recently > added objects, they would have to implement a different anchor > allocation algorithm that resolves hash collisions through chains (for > example). I doubt that would be easy though! > > Yes, I realized that as I read the code further. I will try to give it a shot but I agree that this will be non trivial. Currently sharding data across multiple cache_dirs will reduce collisions, correct? >> Is the key guaranteed to be a null terminated string? > > No. The key is a 128-bit opaque buffer, essentially. > > >> I intend to use std::tr1::hash > > Please keep in mind that the key is an MD5 sum already. See > store_key_md5.cc. I am not a hashing expert by any means, but the MD5 > sum should not really need much further hashing. > > It is possible that the k[0]+k[1] sum in the code below should be > removed in favor of either k[1] or k[0] alone, but again, this may not > have any significant effect (except for any specific case where the two > given URLs used to collide). > I tried k[0] ^ k[1] which seemed to provide fewer collisions (67 collisions in 39129 GETs). Haven't quantified and compared i detail though. > >> <code> >> sfileno >> Ipc::StoreMap::anchorIndexByKey(const cache_key *const key) const >> { >> const uint64_t *const k = reinterpret_cast<const uint64_t *>(key); >> // TODO: use a better hash function >> return (k[0] + k[1]) % shared->limit; >> } >> </code> > > > HTH, > > Alex. >