Re: mem use doubles due to buffer::create_page_aligned + bluestore obj content caching

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, 13 Mar 2017, Igor Fedotov wrote:
> On 11.03.2017 0:24, Gregory Farnum wrote:
> > On Mon, Mar 6, 2017 at 1:47 PM, Igor Fedotov <ifedotov@xxxxxxxxxxxx> wrote:
> > > On 3/6/2017 9:44 PM, Gregory Farnum wrote:
> > > > On Mon, Mar 6, 2017 at 9:35 AM, Igor Fedotov <ifedotov@xxxxxxxxxxxx>
> > > > wrote:
> > > > > Hi Cephers,
> > > > > 
> > > > > I've just created a ticket related to bluestore object content caching
> > > > > in
> > > > > particular and buffer::create_page_aligned in general.
> > > > > 
> > > > > But I'd like to additionally share this information here as well since
> > > > > the
> > > > > root cause seems to be pretty global.
> > > > > 
> > > > > Ticker URL:
> > > > > 
> > > > > http://tracker.ceph.com/issues/19198
> > > > > 
> > > > > Description:
> > > > > 
> > > > > When caching object content BlueStore uses twice as much memory than
> > > > > it
> > > > > really needs for that data amount.
> > > > > 
> > > > > The root cause seems to be in buffer::create_page_aligned
> > > > > implementation.
> > > > > Actually it results in
> > > > > new raw_posix_aligned()
> > > > > 
> > > > >     calling mempool::buffer_data::alloc_char.allocate_aligned(len,
> > > > > align);
> > > > > 
> > > > >         calling  posix_memalign((void**)(void*)&ptr, align, total);
> > > > > 
> > > > > sequence that in fact does 2 allocations:
> > > > > 
> > > > > 1) for raw_posix_aligned struct
> > > > > 2) for data itself (4096 bytes).
> > > > > 
> > > > > It looks like this sequence causes 2 * 4096 bytes allocation instead
> > > > > of
> > > > > sizeof(raw_posix_aligned) + alignment + 4096.
> > > > > The additional trick is that mempool stuff is unable to estimate such
> > > > > an
> > > > > overhead and hence BlueStore cache cleanup doesn't work properly.
> > > > > 
> > > > > It's not clear for me why allocator(s) behave that inefficiently for
> > > > > such
> > > > > a
> > > > > pattern though.
> > > > > 
> > > > > The issue is reproducible under Ubuntu 16.04.1 LTS for both jemalloc
> > > > > and
> > > > > tcmalloc builds.
> > > > > 
> > > > > 
> > > > > The ticket contains the patch to reproduce the issue and one can see
> > > > > that
> > > > > for 16Gb content system mem usage tend to be ~32Gb.
> > > > > 
> > > > > Patch firstly allocates 4K pages 0x400000 times using:
> > > > > 
> > > > > ...
> > > > > 
> > > > > +  size_t alloc_count = 0x400000; // allocate 16 Gb total
> > > > > +  allocs.resize(alloc_count);
> > > > > +  for( auto i = 0u; i < alloc_count; ++i) {
> > > > > +    bufferptr p = buffer::create_page_aligned(bsize);
> > > > > +    bufferlist* bl = new bufferlist;
> > > > > +    bl->append(p);
> > > > > +    *(bl->c_str()) = 0; // touch the page to increment system mem use
> > > > > 
> > > > > ...
> > > > > 
> > > > > then do the same reproducing  create_page_aligned() implementation:
> > > > > 
> > > > > +  struct fake_raw_posix_aligned{
> > > > > +    char stub[8];
> > > > > +    void* data;
> > > > > +    fake_raw_posix_aligned() {
> > > > > +      ::posix_memalign(&data, 0x1000, 0x1000);
> > > > > //mempool::buffer_data::alloc_char.allocate_aligned(0x1000, 0x1000);
> > > > > +      *((char*)data) = 0; // touch the page
> > > > > +    }
> > > > > +    ~fake_raw_posix_aligned() {
> > > > > +      ::free(data);
> > > > > +    }
> > > > > +  };
> > > > > +  vector <fake_raw_posix_aligned*> allocs2;
> > > > > 
> > > > > +  allocs2.resize(alloc_count);
> > > > > +  for( auto i = 0u; i < alloc_count; ++i) {
> > > > > +    allocs2[i] = new fake_raw_posix_aligned();
> > > > > ...
> > > > > 
> > > > > Output shows 32Gb usage in both cases.

This is really disconcerting.  If you take out the memalign in the 
fake_raw_posix_aligned ctor, does it use 16gb?  Or is really just that the 
order of the allocations (new then posix_memalign then new ...) 
makes the allocator consume a full page for each 
fake_raw_posix_aligned?  And/or, can you confirm that 
fake_raw_posix_aligned pointers are on page boundaries?

What if all the fake_raw_posix_aligned strutcs are allocated first, and 
*then* the data pages?

> > Do you have any proposed patches or fixes to deal with it? :)
> Just some thoughts, none of them seems ideal though...
> 1) Get rid of bluestore *content* cache. IMO object metadata caching is much
> more important and hence it's better to use memory for that. As a result no
> long-living aligned page allocations...
> 2) Do not use raw_posix_aligned in buffer::create_aligned and rollback to
> raw_combined::create(). For the latter mempool's mechanics can be fixed to
> measure alignment overhead properly and hence BlueStore cache will handle mem
> limits properly. Mem use for page aligned buffers is still ineffective
> though...

The simple heuristic that only does raw_combined for smaller buffers is 
based on the assumption that the allocator isn't stupid and can consume 
less than a full page of overhead for the buffer::raw stuff.  If that's 
truly not the case, then I think there's no reason not to unconditionally 
use raw_combined.  That doesn't seem right, though!

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux