On Mon, Mar 6, 2017 at 1:47 PM, Igor Fedotov <ifedotov@xxxxxxxxxxxx> wrote: > > On 3/6/2017 9:44 PM, Gregory Farnum wrote: >> >> On Mon, Mar 6, 2017 at 9:35 AM, Igor Fedotov <ifedotov@xxxxxxxxxxxx> >> wrote: >>> >>> Hi Cephers, >>> >>> I've just created a ticket related to bluestore object content caching in >>> particular and buffer::create_page_aligned in general. >>> >>> But I'd like to additionally share this information here as well since >>> the >>> root cause seems to be pretty global. >>> >>> Ticker URL: >>> >>> http://tracker.ceph.com/issues/19198 >>> >>> Description: >>> >>> When caching object content BlueStore uses twice as much memory than it >>> really needs for that data amount. >>> >>> The root cause seems to be in buffer::create_page_aligned implementation. >>> Actually it results in >>> new raw_posix_aligned() >>> >>> calling mempool::buffer_data::alloc_char.allocate_aligned(len, align); >>> >>> calling posix_memalign((void**)(void*)&ptr, align, total); >>> >>> sequence that in fact does 2 allocations: >>> >>> 1) for raw_posix_aligned struct >>> 2) for data itself (4096 bytes). >>> >>> It looks like this sequence causes 2 * 4096 bytes allocation instead of >>> sizeof(raw_posix_aligned) + alignment + 4096. >>> The additional trick is that mempool stuff is unable to estimate such an >>> overhead and hence BlueStore cache cleanup doesn't work properly. >>> >>> It's not clear for me why allocator(s) behave that inefficiently for such >>> a >>> pattern though. >>> >>> The issue is reproducible under Ubuntu 16.04.1 LTS for both jemalloc and >>> tcmalloc builds. >>> >>> >>> The ticket contains the patch to reproduce the issue and one can see that >>> for 16Gb content system mem usage tend to be ~32Gb. >>> >>> Patch firstly allocates 4K pages 0x400000 times using: >>> >>> ... >>> >>> + size_t alloc_count = 0x400000; // allocate 16 Gb total >>> + allocs.resize(alloc_count); >>> + for( auto i = 0u; i < alloc_count; ++i) { >>> + bufferptr p = buffer::create_page_aligned(bsize); >>> + bufferlist* bl = new bufferlist; >>> + bl->append(p); >>> + *(bl->c_str()) = 0; // touch the page to increment system mem use >>> >>> ... >>> >>> then do the same reproducing create_page_aligned() implementation: >>> >>> + struct fake_raw_posix_aligned{ >>> + char stub[8]; >>> + void* data; >>> + fake_raw_posix_aligned() { >>> + ::posix_memalign(&data, 0x1000, 0x1000); >>> //mempool::buffer_data::alloc_char.allocate_aligned(0x1000, 0x1000); >>> + *((char*)data) = 0; // touch the page >>> + } >>> + ~fake_raw_posix_aligned() { >>> + ::free(data); >>> + } >>> + }; >>> + vector <fake_raw_posix_aligned*> allocs2; >>> >>> + allocs2.resize(alloc_count); >>> + for( auto i = 0u; i < alloc_count; ++i) { >>> + allocs2[i] = new fake_raw_posix_aligned(); >>> ... >>> >>> Output shows 32Gb usage in both cases. >>> >>> Mem before: VmRSS: 45232 kB >>> Mem after: VmRSS: 33599524 kB >>> Mem actually used: 33554292 kB >>> Mem pool reports: 16777216 kB >>> Mem before2: VmRSS: 2161412 kB >>> Mem after2: VmRSS: 33632268 kB >>> Mem actually used: 32226156544 bytes >>> >>> >>> In general there are two issues here: >>> 1) Doubled memory usage >>> 2) mempool is unaware of such an overhead and miscalculates the actual >>> mem >>> usage. >>> >>> There is probably a way to resolve 2) by forcing raw_combined::create() >>> use >>> in buffer::create_page_aligned and tuning mempool calculation to take >>> page >>> alignment into account. But I'd like to get some comments/thoughts >>> first.... >> >> Is this memory being allocated and then freed, so it's "just" imposing >> extra work on malloc? Or are we leaking the old unaligned page as >> well? > > I don't see any issues after free call. I'm mostly about unexpectedly high > memory usage while data block is allocated. > And mempool miscalculation related to that. > Surely this is more critical for long-living allocations, e.g. data blocks > in BlueStore cache. Yeah, that's what I meant by "leak", which I realize isn't quite the typical usage. Do you have any proposed patches or fixes to deal with it? :) -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html