FWIW, the actual memory cost is higher than the numbers we've quoted -- possibly considerably higher. That's because it principally consists of a large number of small allocations and there's been no attempt to adjust for the allocator overhead itself. I'm not that concerned with the amount of memory consumed by a decoded oNode/Shard. That's because I don't expect to have a lot of them around. Realistically, I only see a benefit in caching decoded oNode/Shards that are being sequentially used. I doubt that we need more than 2 or 3 of these per client connection (hundreds in total?). For random accessed oNode/Shards, you'll be better off trying to cache them in the encoded format within the KV subsystem itself. Hence I care A LOT about the time needed to serialize/deserialize the oNode/Shard. We should be focused on that part of the problem. I believe that as we optimize the TIME required to serialize/deserialize we will end up also shrinking the SPACE required as an unintended consequence. Allen Samuels SanDisk |a Western Digital brand 2880 Junction Avenue, Milpitas, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@xxxxxxxxxxx > -----Original Message----- > From: Igor Fedotov [mailto:ifedotov@xxxxxxxxxxxx] > Sent: Monday, December 19, 2016 6:46 AM > To: Allen Samuels <Allen.Samuels@xxxxxxxxxxx>; Sage Weil > <sage@xxxxxxxxxxxx> > Cc: ceph-devel <ceph-devel@xxxxxxxxxxxxxxx> > Subject: Re: BlueStore in-memory Onode footprint > > > > On 17.12.2016 1:08, Allen Samuels wrote: > > I'm not sure what the conclusion from this is. > IMHO the numbers I shared are pretty high and we should consider some > ways to reduce them. > > > > The point of the sharding exercise was to eliminate the need to > serialize/deserialize all 1024 Extents/Blobs/SharedBlobs on each I/O > transaction. > > > > This shows that a fully populated oNode with ALL of the shards present is a > large number. But that ought to be a rare occurrence. > Actually we have pretty high number for each Blob entry. And this means > that cache effectiveness is badly affected in general case since we're able to > cache less entries in total. > > > > This test shows that each Blob is 248 bytes and that each SharedBlob is 216 > bytes. That matches the sizeof(...), so the MemPool logic got the right > answer! Yay! > > > > Looking at the Blob I see: > > > > Bluestore_blob_t 72 bytes > > Bufferlist 88 bytes > > Extentrefmap 64 bytes > > > > That's most the 248. I suspect that trying to fix this will require a new > strategy, etc. > > > > Allen Samuels > > SanDisk |a Western Digital brand > > 2880 Junction Avenue, San Jose, CA 95134 > > T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@xxxxxxxxxxx > > > > > >> -----Original Message----- > >> From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel- > >> owner@xxxxxxxxxxxxxxx] On Behalf Of Sage Weil > >> Sent: Friday, December 16, 2016 7:20 AM > >> To: Igor Fedotov <ifedotov@xxxxxxxxxxxx> > >> Cc: ceph-devel <ceph-devel@xxxxxxxxxxxxxxx> > >> Subject: Re: BlueStore in-memory Onode footprint > >> > >> On Fri, 16 Dec 2016, Igor Fedotov wrote: > >>> Hey All! > >>> > >>> Recently I realized that I'm unable to fit all my onodes ( 32768 > >>> objects/4Mb each/4K alloc unit/no csum) in 15G RAM cache. > >>> > >>> Hence decided to estimate Onode in-memory size. > >>> > >>> At first I filled 4Mb object with a single 4M write - mempools > >>> indicate ~5K mem usage for total onode metadata. Good enough. > >>> > >>> Secondly I refill that object with 4K writes. Resulting memusage = 574K!!! > >>> Onode itself 704 bytes in 1 object. And 4120 other metadata items > >>> occupies all other space. > >>> > >>> Then I removed SharedBlob from mempools. Resulting mem usage = > 355K. > >>> The same Onode size. And 3096 other metadata objects. Hence we had > >>> 1024 SharedBlob instances thata took ~ 220K. > >>> > >>> > >>> And finally I removed Blob instances from measurements. Resulting > mem > >>> usage = 99K. And 2072 other objects. Hence Blob instances take another > >>> ~250K > >> Yikes! > >> > >> BTW you can get a full breakdown by type with 'mempool debug = true' in > >> ceph.conf (-o 'mempool debug = true' on vstart.sh command line) > without > >> having to recompile. Do you mind repeating the test and including the full > >> breakdown? > >> > >>> Yeah, that's the worst case( actually csum enable will give even more > >>> mem use) but shouldn't we revisit some Onode internals due to such > >> numbers? > >> > >> Yep! > >> > >> sage > >> -- > >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the > >> body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info > at > >> http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html