Re: mem use doubles due to buffer::create_page_aligned + bluestore obj content caching

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Mar 6, 2017 at 9:35 AM, Igor Fedotov <ifedotov@xxxxxxxxxxxx> wrote:
> Hi Cephers,
>
> I've just created a ticket related to bluestore object content caching in
> particular and buffer::create_page_aligned in general.
>
> But I'd like to additionally share this information here as well since the
> root cause seems to be pretty global.
>
> Ticker URL:
>
> http://tracker.ceph.com/issues/19198
>
> Description:
>
> When caching object content BlueStore uses twice as much memory than it
> really needs for that data amount.
>
> The root cause seems to be in buffer::create_page_aligned implementation.
> Actually it results in
> new raw_posix_aligned()
>
>   calling mempool::buffer_data::alloc_char.allocate_aligned(len, align);
>
>       calling  posix_memalign((void**)(void*)&ptr, align, total);
>
> sequence that in fact does 2 allocations:
>
> 1) for raw_posix_aligned struct
> 2) for data itself (4096 bytes).
>
> It looks like this sequence causes 2 * 4096 bytes allocation instead of
> sizeof(raw_posix_aligned) + alignment + 4096.
> The additional trick is that mempool stuff is unable to estimate such an
> overhead and hence BlueStore cache cleanup doesn't work properly.
>
> It's not clear for me why allocator(s) behave that inefficiently for such a
> pattern though.
>
> The issue is reproducible under Ubuntu 16.04.1 LTS for both jemalloc and
> tcmalloc builds.
>
>
> The ticket contains the patch to reproduce the issue and one can see that
> for 16Gb content system mem usage tend to be ~32Gb.
>
> Patch firstly allocates 4K pages 0x400000 times using:
>
> ...
>
> +  size_t alloc_count = 0x400000; // allocate 16 Gb total
> +  allocs.resize(alloc_count);
> +  for( auto i = 0u; i < alloc_count; ++i) {
> +    bufferptr p = buffer::create_page_aligned(bsize);
> +    bufferlist* bl = new bufferlist;
> +    bl->append(p);
> +    *(bl->c_str()) = 0; // touch the page to increment system mem use
>
> ...
>
> then do the same reproducing  create_page_aligned() implementation:
>
> +  struct fake_raw_posix_aligned{
> +    char stub[8];
> +    void* data;
> +    fake_raw_posix_aligned() {
> +      ::posix_memalign(&data, 0x1000, 0x1000);
> //mempool::buffer_data::alloc_char.allocate_aligned(0x1000, 0x1000);
> +      *((char*)data) = 0; // touch the page
> +    }
> +    ~fake_raw_posix_aligned() {
> +      ::free(data);
> +    }
> +  };
> +  vector <fake_raw_posix_aligned*> allocs2;
>
> +  allocs2.resize(alloc_count);
> +  for( auto i = 0u; i < alloc_count; ++i) {
> +    allocs2[i] = new fake_raw_posix_aligned();
> ...
>
> Output shows 32Gb usage in both cases.
>
> Mem before: VmRSS: 45232 kB
> Mem after: VmRSS: 33599524 kB
> Mem actually used: 33554292 kB
> Mem pool reports: 16777216 kB
> Mem before2: VmRSS: 2161412 kB
> Mem after2: VmRSS: 33632268 kB
> Mem actually used: 32226156544 bytes
>
>
> In general there are two issues here:
> 1) Doubled memory usage
> 2) mempool is unaware of such an overhead and miscalculates the actual mem
> usage.
>
> There is probably a way to resolve 2) by forcing raw_combined::create() use
> in buffer::create_page_aligned and tuning mempool calculation to take page
> alignment into account. But I'd like to get some comments/thoughts first....

Is this memory being allocated and then freed, so it's "just" imposing
extra work on malloc? Or are we leaking the old unaligned page as
well?

I think we have (prior to BlueStore) only used these functions when
sending data over the wire or speaking to certain kinds of disks
(though I could be totally misremembering), at which point it's going
to be freed really quickly. That might explain why it's not come up
before; I hope we can just massage the implementation or interfaces
rather than this bubbling up way beyond the bufferlist internals...
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux