Re: mem use doubles due to buffer::create_page_aligned + bluestore obj content caching

Gregory Farnum <gfarnum@xxxxxxxxxx> · Fri, 10 Mar 2017 13:24:22 -0800

On Mon, Mar 6, 2017 at 1:47 PM, Igor Fedotov <ifedotov@xxxxxxxxxxxx> wrote:
>
> On 3/6/2017 9:44 PM, Gregory Farnum wrote:
>>
>> On Mon, Mar 6, 2017 at 9:35 AM, Igor Fedotov <ifedotov@xxxxxxxxxxxx>
>> wrote:
>>>
>>> Hi Cephers,
>>>
>>> I've just created a ticket related to bluestore object content caching in
>>> particular and buffer::create_page_aligned in general.
>>>
>>> But I'd like to additionally share this information here as well since
>>> the
>>> root cause seems to be pretty global.
>>>
>>> Ticker URL:
>>>
>>> http://tracker.ceph.com/issues/19198
>>>
>>> Description:
>>>
>>> When caching object content BlueStore uses twice as much memory than it
>>> really needs for that data amount.
>>>
>>> The root cause seems to be in buffer::create_page_aligned implementation.
>>> Actually it results in
>>> new raw_posix_aligned()
>>>
>>>    calling mempool::buffer_data::alloc_char.allocate_aligned(len, align);
>>>
>>>        calling  posix_memalign((void**)(void*)&ptr, align, total);
>>>
>>> sequence that in fact does 2 allocations:
>>>
>>> 1) for raw_posix_aligned struct
>>> 2) for data itself (4096 bytes).
>>>
>>> It looks like this sequence causes 2 * 4096 bytes allocation instead of
>>> sizeof(raw_posix_aligned) + alignment + 4096.
>>> The additional trick is that mempool stuff is unable to estimate such an
>>> overhead and hence BlueStore cache cleanup doesn't work properly.
>>>
>>> It's not clear for me why allocator(s) behave that inefficiently for such
>>> a
>>> pattern though.
>>>
>>> The issue is reproducible under Ubuntu 16.04.1 LTS for both jemalloc and
>>> tcmalloc builds.
>>>
>>>
>>> The ticket contains the patch to reproduce the issue and one can see that
>>> for 16Gb content system mem usage tend to be ~32Gb.
>>>
>>> Patch firstly allocates 4K pages 0x400000 times using:
>>>
>>> ...
>>>
>>> +  size_t alloc_count = 0x400000; // allocate 16 Gb total
>>> +  allocs.resize(alloc_count);
>>> +  for( auto i = 0u; i < alloc_count; ++i) {
>>> +    bufferptr p = buffer::create_page_aligned(bsize);
>>> +    bufferlist* bl = new bufferlist;
>>> +    bl->append(p);
>>> +    *(bl->c_str()) = 0; // touch the page to increment system mem use
>>>
>>> ...
>>>
>>> then do the same reproducing  create_page_aligned() implementation:
>>>
>>> +  struct fake_raw_posix_aligned{
>>> +    char stub[8];
>>> +    void* data;
>>> +    fake_raw_posix_aligned() {
>>> +      ::posix_memalign(&data, 0x1000, 0x1000);
>>> //mempool::buffer_data::alloc_char.allocate_aligned(0x1000, 0x1000);
>>> +      *((char*)data) = 0; // touch the page
>>> +    }
>>> +    ~fake_raw_posix_aligned() {
>>> +      ::free(data);
>>> +    }
>>> +  };
>>> +  vector <fake_raw_posix_aligned*> allocs2;
>>>
>>> +  allocs2.resize(alloc_count);
>>> +  for( auto i = 0u; i < alloc_count; ++i) {
>>> +    allocs2[i] = new fake_raw_posix_aligned();
>>> ...
>>>
>>> Output shows 32Gb usage in both cases.
>>>
>>> Mem before: VmRSS: 45232 kB
>>> Mem after: VmRSS: 33599524 kB
>>> Mem actually used: 33554292 kB
>>> Mem pool reports: 16777216 kB
>>> Mem before2: VmRSS: 2161412 kB
>>> Mem after2: VmRSS: 33632268 kB
>>> Mem actually used: 32226156544 bytes
>>>
>>>
>>> In general there are two issues here:
>>> 1) Doubled memory usage
>>> 2) mempool is unaware of such an overhead and miscalculates the actual
>>> mem
>>> usage.
>>>
>>> There is probably a way to resolve 2) by forcing raw_combined::create()
>>> use
>>> in buffer::create_page_aligned and tuning mempool calculation to take
>>> page
>>> alignment into account. But I'd like to get some comments/thoughts
>>> first....
>>
>> Is this memory being allocated and then freed, so it's "just" imposing
>> extra work on malloc? Or are we leaking the old unaligned page as
>> well?
>
> I don't see any issues after free call. I'm mostly about unexpectedly high
> memory usage while data block is allocated.
> And mempool miscalculation related to that.
> Surely this is more critical for long-living allocations, e.g. data blocks
> in BlueStore cache.

Yeah, that's what I meant by "leak", which I realize isn't quite the
typical usage.

Do you have any proposed patches or fixes to deal with it? :)
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html