Re: mem use doubles due to buffer::create_page_aligned + bluestore obj content caching

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 3/6/2017 9:44 PM, Gregory Farnum wrote:
On Mon, Mar 6, 2017 at 9:35 AM, Igor Fedotov <ifedotov@xxxxxxxxxxxx> wrote:
Hi Cephers,

I've just created a ticket related to bluestore object content caching in
particular and buffer::create_page_aligned in general.

But I'd like to additionally share this information here as well since the
root cause seems to be pretty global.

Ticker URL:

http://tracker.ceph.com/issues/19198

Description:

When caching object content BlueStore uses twice as much memory than it
really needs for that data amount.

The root cause seems to be in buffer::create_page_aligned implementation.
Actually it results in
new raw_posix_aligned()

   calling mempool::buffer_data::alloc_char.allocate_aligned(len, align);

       calling  posix_memalign((void**)(void*)&ptr, align, total);

sequence that in fact does 2 allocations:

1) for raw_posix_aligned struct
2) for data itself (4096 bytes).

It looks like this sequence causes 2 * 4096 bytes allocation instead of
sizeof(raw_posix_aligned) + alignment + 4096.
The additional trick is that mempool stuff is unable to estimate such an
overhead and hence BlueStore cache cleanup doesn't work properly.

It's not clear for me why allocator(s) behave that inefficiently for such a
pattern though.

The issue is reproducible under Ubuntu 16.04.1 LTS for both jemalloc and
tcmalloc builds.


The ticket contains the patch to reproduce the issue and one can see that
for 16Gb content system mem usage tend to be ~32Gb.

Patch firstly allocates 4K pages 0x400000 times using:

...

+  size_t alloc_count = 0x400000; // allocate 16 Gb total
+  allocs.resize(alloc_count);
+  for( auto i = 0u; i < alloc_count; ++i) {
+    bufferptr p = buffer::create_page_aligned(bsize);
+    bufferlist* bl = new bufferlist;
+    bl->append(p);
+    *(bl->c_str()) = 0; // touch the page to increment system mem use

...

then do the same reproducing  create_page_aligned() implementation:

+  struct fake_raw_posix_aligned{
+    char stub[8];
+    void* data;
+    fake_raw_posix_aligned() {
+      ::posix_memalign(&data, 0x1000, 0x1000);
//mempool::buffer_data::alloc_char.allocate_aligned(0x1000, 0x1000);
+      *((char*)data) = 0; // touch the page
+    }
+    ~fake_raw_posix_aligned() {
+      ::free(data);
+    }
+  };
+  vector <fake_raw_posix_aligned*> allocs2;

+  allocs2.resize(alloc_count);
+  for( auto i = 0u; i < alloc_count; ++i) {
+    allocs2[i] = new fake_raw_posix_aligned();
...

Output shows 32Gb usage in both cases.

Mem before: VmRSS: 45232 kB
Mem after: VmRSS: 33599524 kB
Mem actually used: 33554292 kB
Mem pool reports: 16777216 kB
Mem before2: VmRSS: 2161412 kB
Mem after2: VmRSS: 33632268 kB
Mem actually used: 32226156544 bytes


In general there are two issues here:
1) Doubled memory usage
2) mempool is unaware of such an overhead and miscalculates the actual mem
usage.

There is probably a way to resolve 2) by forcing raw_combined::create() use
in buffer::create_page_aligned and tuning mempool calculation to take page
alignment into account. But I'd like to get some comments/thoughts first....
Is this memory being allocated and then freed, so it's "just" imposing
extra work on malloc? Or are we leaking the old unaligned page as
well?
I don't see any issues after free call. I'm mostly about unexpectedly high memory usage while data block is allocated.
And mempool miscalculation related to that.
Surely this is more critical for long-living allocations, e.g. data blocks in BlueStore cache.

I think we have (prior to BlueStore) only used these functions when
sending data over the wire or speaking to certain kinds of disks
(though I could be totally misremembering), at which point it's going
to be freed really quickly. That might explain why it's not come up
before; I hope we can just massage the implementation or interfaces
rather than this bubbling up way beyond the bufferlist internals...
Yeah, that explains the case a bit.
-Greg

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux