Re: [ceph-users] ceph osd commit latency increase over time, until restart

Alexandre DERUMIER <aderumier@xxxxxxxxx> · Tue, 5 Mar 2019 10:00:52 +0100 (CET)

>> Now that we've pinpointed the 
>>problem, it is likely we'll backport the new implementation to luminous.

Hi Sage, is it also planned for mimic ?

----- Mail original -----
De: "Sage Weil" <sage@xxxxxxxxxxxx>
À: "Xiaoxi Chen" <superdebuger@xxxxxxxxx>
Cc: "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx>
Envoyé: Lundi 4 Mars 2019 16:18:47
Objet: Re: Fwd: [ceph-users] ceph osd commit latency increase over time, until restart

On Mon, 4 Mar 2019, Xiaoxi Chen wrote: 
> [Resend with pure text] 
> 
> Hi List, 
> 
> After 3 days + bake, the bitmap allocator shows much better 
> performance characteristics compared to stupid. The osd.44 is nautilus 
> + bitmap allocator, osd.19 is luminous + bitmap, as a comparison, 
> osd.406 just did a fresh restart but continue with luminous + stupid. 
> See https://pasteboard.co/I3SkfuN.png for figure. 

I just want to follow this up with a warning: 

** DO NOT USE BITMAP ALLOCATOR ON LUMINOUS IN PRODUCTION ** 

We made luminous default to StupidAllocator because we saw instability 
with the (old) bitmap implementation. In Natuilus, there is a completely 
new implementation (with similar design). Now that we've pinpointed the 
problem, it is likely we'll backport the new implementation to luminous. 

Thanks! 
sage 

> 
> Nautilus shows best performance consistency, max at 20ms compared 
> to 92 ms in luminous.(https://pasteboard.co/I3SkxxK.png) 
> 
> In the same time, as Igor point out, the stupid can be fragment 
> and there is no de-frag functionality in there, so it is slower over 
> time. This theory can be prove by the mempool status of OSDs before 
> and after reboot. You can see the tree shrink 9 times. 
> 
> Before reboot: 
> 
> "bluestore_alloc": { 
> 
> "items": 915127024, 
> 
> "bytes": 915127024 
> 
> }, 
> 
> 
> After reboot, 
> 
> "bluestore_alloc": { 
> 
> "items": 104727568, 
> 
> "bytes": 104727568 
> 
> }, 
> 
> 
> 
> I am extending the Nautilus deployment to 1 rack, and in 
> another rack I changed the min_alloc_size from 4K to 32k, to see if it 
> can relieve the b-tree a bit. 
> Also trying Igor's testing branch at 
> https://github.com/ceph/ceph/commits/wip-ifed-reset-allocator-luminous. 
> 
> Xiaoxi 
> 
> Igor Fedotov <ifedotov@xxxxxxx> 于2019年3月3日周日 上午3:24写道： 
> > 
> > Hi Xiaoxi, 
> > 
> > Please note that this PR is proof-of-concept hence I didn't try to 
> > implement the best algorithm. 
> > 
> > But IMO your approach is not viable (or at least isn't that simple) 
> > though since freelist manager (and RocksDB) doesn't contain up to date 
> > allocator state at arbitrary moment of time - running transactions might 
> > have some pending allocations that were processed by allocator but 
> > haven't landed to DB yet. So one is unable to restore valid allocator 
> > state from Freelist Manager unless he finalizes all the transactions. 
> > Which looks a bit troublesome... 
> > 
> > 
> > Thanks, 
> > 
> > Igor 
> > 
> > 
> > On 3/2/2019 6:41 PM, Xiaoxi Chen wrote: 
> > > Hi Igor, 
> > > Thanks, no worry I will build it locally and test , will update 
> > > this thread if I get anything. 
> > > 
> > > 
> > > The commit 
> > > https://github.com/ceph/ceph/commit/8ee87c22bcd88a8911d58936cec9049e0932fb77 make 
> > > sense though the concern is the full defragment will take long time. 
> > > 
> > > Do you see it will be faster to use freelist manager rather than 
> > > iterate every btree , insert to common one and re-build original 
> > > b-tree? The freelist manager should always de-fraged in anytime. 
> > > 
> > > fm->enumerate_reset(); 
> > > uint64_t offset, length; 
> > > while (fm->enumerate_next(&offset, &length)) { 
> > > alloc->init_add_free(offset, length); 
> > > ++num; 
> > > bytes += length; 
> > > } 
> > > 
> > > Xiaoxi 
> > > 
> > > Igor Fedotov <ifedotov@xxxxxxx <mailto:ifedotov@xxxxxxx>> 
> > > 于2019年3月2日周六 上午3:23写道： 
> > > 
> > > Xiaoxi, 
> > > 
> > > Here is the luminous patch which performs StupidAllocator reset 
> > > once per 12 hours. 
> > > 
> > > https://github.com/ceph/ceph/tree/wip-ifed-reset-allocator-luminous 
> > > 
> > > Sorry, didn't have enough time today to learn how to make a 
> > > package from it, just sources for now. 
> > > 
> > > 
> > > Thanks, 
> > > 
> > > Igor 
> > > 
> > > 
> > > On 3/1/2019 11:46 AM, Xiaoxi Chen wrote: 
> > >> igor， 
> > >> I can test the patch if we have a package. 
> > >> My enviroment and workload can consistently reproduce the 
> > >> latency 2-3 days after restarting. 
> > >> Sage tells me to try bitmap allocator to make sure stupid 
> > >> allocator is the bad guy. I have some osds in luminous +bitmap 
> > >> and some osds in 14.1.0+bitmap. Both looks positive till now, 
> > >> but i need more time to be sure. 
> > >> The perf ,log and admin socket analysis lead to the theory 
> > >> that in alloc_int the loop sometimes take long time wkth 
> > >> allocator locks held. Which blocks release part called from 
> > >> _txc_finish in kv_finalize_thread, this thread is also the one to 
> > >> calculate state_kv_committing_lat and overall commit_lat. You can 
> > >> find from admin socket that state_done_latency has similar trend 
> > >> as commit_latency. 
> > >> But we cannot find a theory to.explain why reboot helps, the 
> > >> allocator btree will be rebuild from freelist manager 
> > >> and.it.should be exactly. the same as it is prior to reboot. 
> > >> Anything related with pg recovery? 
> > >> 
> > >> Anyway, as I have a live env and workload, I am more than 
> > >> willing to work with you for further investigatiom 
> > >> 
> > >> -Xiaoxi 
> > >> 
> > >> Igor Fedotov <ifedotov@xxxxxxx <mailto:ifedotov@xxxxxxx>> 于 
> > >> 2019年3月1日周五 上午6:21写道： 
> > >> 
> > >> Also I think it makes sense to create a ticket at this point. 
> > >> Any 
> > >> volunteers? 
> > >> 
> > >> On 3/1/2019 1:00 AM, Igor Fedotov wrote: 
> > >> > Wondering if somebody would be able to apply simple patch that 
> > >> > periodically resets StupidAllocator? 
> > >> > 
> > >> > Just to verify/disprove the hypothesis it's allocator relateted 
> > >> > 
> > >> > On 2/28/2019 11:57 PM, Stefan Kooman wrote: 
> > >> >> Quoting Wido den Hollander (wido@xxxxxxxx 
> > >> <mailto:wido@xxxxxxxx>): 
> > >> >>> Just wanted to chime in, I've seen this with 
> > >> Luminous+BlueStore+NVMe 
> > >> >>> OSDs as well. Over time their latency increased until we 
> > >> started to 
> > >> >>> notice I/O-wait inside VMs. 
> > >> >> On a Luminous 12.2.8 cluster with only SSDs we also hit 
> > >> this issue I 
> > >> >> guess. After restarting the OSD servers the latency would 
> > >> drop to normal 
> > >> >> values again. See 
> > >> https://owncloud.kooman.org/s/BpkUc7YM79vhcDj 
> > >> >> 
> > >> >> Reboots were finished at ~ 19:00. 
> > >> >> 
> > >> >> Gr. Stefan 
> > >> >> 
> > >> 
> 
> 
>