Re: Fwd: [ceph-users] ceph osd commit latency increase over time, until restart

Sage Weil <sage@xxxxxxxxxxxx> · Mon, 4 Mar 2019 15:18:47 +0000 (UTC)

On Mon, 4 Mar 2019, Xiaoxi Chen wrote:
> [Resend with pure text]
> 
> Hi List,
> 
>      After 3 days + bake,  the bitmap allocator shows much better
> performance characteristics compared to stupid. The osd.44 is nautilus
> +  bitmap allocator, osd.19 is luminous + bitmap, as a comparison,
> osd.406 just did a fresh restart but continue with luminous + stupid.
>  See https://pasteboard.co/I3SkfuN.png for figure.

I just want to follow this up with a warning:

  ** DO NOT USE BITMAP ALLOCATOR ON LUMINOUS IN PRODUCTION **

We made luminous default to StupidAllocator because we saw instability 
with the (old) bitmap implementation.  In Natuilus, there is a completely 
new implementation (with similar design).  Now that we've pinpointed the 
problem, it is likely we'll backport the new implementation to luminous.

Thanks!
sage

> 
>      Nautilus shows best performance consistency, max at 20ms compared
> to 92 ms in luminous.(https://pasteboard.co/I3SkxxK.png)
> 
>       In the same time, as Igor point out, the stupid can be fragment
> and there is no de-frag functionality in there, so it is slower over
> time. This theory can be prove by the mempool status of OSDs before
> and after reboot. You can see the tree shrink 9 times.
> 
> Before reboot:
> 
> "bluestore_alloc": {
> 
>         "items": 915127024,
> 
>         "bytes": 915127024
> 
>     },
> 
> 
> After reboot,
> 
>     "bluestore_alloc": {
> 
>         "items": 104727568,
> 
>         "bytes": 104727568
> 
>  },
> 
> 
> 
>      I am extending the Nautilus deployment to  1 rack,  and in
> another rack I changed the min_alloc_size from 4K to 32k, to see if it
> can relieve the b-tree a bit.
>      Also trying Igor's testing branch at
> https://github.com/ceph/ceph/commits/wip-ifed-reset-allocator-luminous.
> 
> Xiaoxi
> 
> Igor Fedotov <ifedotov@xxxxxxx> 于2019年3月3日周日 上午3:24写道：
> >
> > Hi Xiaoxi,
> >
> > Please note that this PR is proof-of-concept hence I didn't try to
> > implement the best algorithm.
> >
> > But IMO your approach is not viable (or at least isn't that simple)
> > though since freelist manager (and RocksDB) doesn't contain up to date
> > allocator state at arbitrary moment of time - running transactions might
> > have some pending allocations that were processed by allocator but
> > haven't landed to DB yet. So one is unable to restore valid allocator
> > state from Freelist Manager unless he finalizes all the transactions.
> > Which looks a bit troublesome...
> >
> >
> > Thanks,
> >
> > Igor
> >
> >
> > On 3/2/2019 6:41 PM, Xiaoxi Chen wrote:
> > > Hi Igor,
> > >      Thanks, no worry I will build it locally and test , will update
> > > this thread if I get anything.
> > >
> > >
> > > The commit
> > > https://github.com/ceph/ceph/commit/8ee87c22bcd88a8911d58936cec9049e0932fb77 make
> > > sense though the concern is the full defragment will take long time.
> > >
> > >      Do you see it will be faster to use freelist manager rather than
> > > iterate every btree , insert to common one and re-build original
> > > b-tree?  The freelist manager should always de-fraged in anytime.
> > >
> > >   fm->enumerate_reset();
> > >   uint64_t offset, length;
> > >   while (fm->enumerate_next(&offset, &length)) {
> > >     alloc->init_add_free(offset, length);
> > >     ++num;
> > >     bytes += length;
> > >   }
> > >
> > > Xiaoxi
> > >
> > > Igor Fedotov <ifedotov@xxxxxxx <mailto:ifedotov@xxxxxxx>>
> > > 于2019年3月2日周六 上午3:23写道：
> > >
> > >     Xiaoxi,
> > >
> > >     Here is the luminous patch which performs StupidAllocator reset
> > >     once per 12 hours.
> > >
> > >     https://github.com/ceph/ceph/tree/wip-ifed-reset-allocator-luminous
> > >
> > >     Sorry, didn't have enough time today to learn how to make a
> > >     package from it, just sources for now.
> > >
> > >
> > >     Thanks,
> > >
> > >     Igor
> > >
> > >
> > >     On 3/1/2019 11:46 AM, Xiaoxi Chen wrote:
> > >>     igor，
> > >>        I can test the patch if we have a package.
> > >>        My enviroment and workload can consistently reproduce the
> > >>     latency  2-3 days after restarting.
> > >>         Sage tells me to try bitmap allocator to make sure stupid
> > >>     allocator is the bad guy. I have some osds in luminous +bitmap
> > >>     and some osds in 14.1.0+bitmap.  Both looks positive till now,
> > >>     but i need more time to be sure.
> > >>          The perf ,log and admin socket analysis lead to the theory
> > >>     that in alloc_int the loop sometimes take long time wkth
> > >>     allocator locks held. Which blocks release part called from
> > >>     _txc_finish in kv_finalize_thread, this thread is also the one to
> > >>     calculate state_kv_committing_lat and overall commit_lat. You can
> > >>     find from admin socket that state_done_latency has similar trend
> > >>     as commit_latency.
> > >>         But we cannot find a theory to.explain why reboot helps, the
> > >>     allocator btree will be rebuild from freelist manager
> > >>     and.it.should be exactly. the same as it is prior to reboot.
> > >>      Anything related with pg recovery?
> > >>
> > >>        Anyway, as I have a live env and workload, I am more than
> > >>     willing to work with you for further investigatiom
> > >>
> > >>     -Xiaoxi
> > >>
> > >>     Igor Fedotov <ifedotov@xxxxxxx <mailto:ifedotov@xxxxxxx>> 于
> > >>     2019年3月1日周五 上午6:21写道：
> > >>
> > >>         Also I think it makes sense to create a ticket at this point.
> > >>         Any
> > >>         volunteers?
> > >>
> > >>         On 3/1/2019 1:00 AM, Igor Fedotov wrote:
> > >>         > Wondering if somebody would be able to apply simple patch that
> > >>         > periodically resets StupidAllocator?
> > >>         >
> > >>         > Just to verify/disprove the hypothesis it's allocator relateted
> > >>         >
> > >>         > On 2/28/2019 11:57 PM, Stefan Kooman wrote:
> > >>         >> Quoting Wido den Hollander (wido@xxxxxxxx
> > >>         <mailto:wido@xxxxxxxx>):
> > >>         >>> Just wanted to chime in, I've seen this with
> > >>         Luminous+BlueStore+NVMe
> > >>         >>> OSDs as well. Over time their latency increased until we
> > >>         started to
> > >>         >>> notice I/O-wait inside VMs.
> > >>         >> On a Luminous 12.2.8 cluster with only SSDs we also hit
> > >>         this issue I
> > >>         >> guess. After restarting the OSD servers the latency would
> > >>         drop to normal
> > >>         >> values again. See
> > >>         https://owncloud.kooman.org/s/BpkUc7YM79vhcDj
> > >>         >>
> > >>         >> Reboots were finished at ~ 19:00.
> > >>         >>
> > >>         >> Gr. Stefan
> > >>         >>
> > >>
> 
> 
>