Hi Cephers,
I've just created a ticket related to bluestore object content caching
in particular and buffer::create_page_aligned in general.
But I'd like to additionally share this information here as well since
the root cause seems to be pretty global.
Ticker URL:
http://tracker.ceph.com/issues/19198
Description:
When caching object content BlueStore uses twice as much memory than it
really needs for that data amount.
The root cause seems to be in buffer::create_page_aligned
implementation. Actually it results in
new raw_posix_aligned()
calling mempool::buffer_data::alloc_char.allocate_aligned(len, align);
calling posix_memalign((void**)(void*)&ptr, align, total);
sequence that in fact does 2 allocations:
1) for raw_posix_aligned struct
2) for data itself (4096 bytes).
It looks like this sequence causes 2 * 4096 bytes allocation instead of
sizeof(raw_posix_aligned) + alignment + 4096.
The additional trick is that mempool stuff is unable to estimate such an
overhead and hence BlueStore cache cleanup doesn't work properly.
It's not clear for me why allocator(s) behave that inefficiently for
such a pattern though.
The issue is reproducible under Ubuntu 16.04.1 LTS for both jemalloc and
tcmalloc builds.
The ticket contains the patch to reproduce the issue and one can see
that for 16Gb content system mem usage tend to be ~32Gb.
Patch firstly allocates 4K pages 0x400000 times using:
...
+ size_t alloc_count = 0x400000; // allocate 16 Gb total
+ allocs.resize(alloc_count);
+ for( auto i = 0u; i < alloc_count; ++i) {
+ bufferptr p = buffer::create_page_aligned(bsize);
+ bufferlist* bl = new bufferlist;
+ bl->append(p);
+ *(bl->c_str()) = 0; // touch the page to increment system mem use
...
then do the same reproducing create_page_aligned() implementation:
+ struct fake_raw_posix_aligned{
+ char stub[8];
+ void* data;
+ fake_raw_posix_aligned() {
+ ::posix_memalign(&data, 0x1000, 0x1000);
//mempool::buffer_data::alloc_char.allocate_aligned(0x1000, 0x1000);
+ *((char*)data) = 0; // touch the page
+ }
+ ~fake_raw_posix_aligned() {
+ ::free(data);
+ }
+ };
+ vector <fake_raw_posix_aligned*> allocs2;
+ allocs2.resize(alloc_count);
+ for( auto i = 0u; i < alloc_count; ++i) {
+ allocs2[i] = new fake_raw_posix_aligned();
...
Output shows 32Gb usage in both cases.
Mem before: VmRSS: 45232 kB
Mem after: VmRSS: 33599524 kB
Mem actually used: 33554292 kB
Mem pool reports: 16777216 kB
Mem before2: VmRSS: 2161412 kB
Mem after2: VmRSS: 33632268 kB
Mem actually used: 32226156544 bytes
In general there are two issues here:
1) Doubled memory usage
2) mempool is unaware of such an overhead and miscalculates the actual
mem usage.
There is probably a way to resolve 2) by forcing raw_combined::create()
use in buffer::create_page_aligned and tuning mempool calculation to
take page alignment into account. But I'd like to get some
comments/thoughts first....
Thanks,
Igor
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html