Re: [ceph-users] ceph osd commit latency increase over time, until restart

Igor Fedotov <ifedotov@xxxxxxx> · Tue, 5 Mar 2019 13:55:51 +0300

Answering for Sage...

yes, if it goes to Luminous it goes to mimic too.

Thanks,

Igor

On 3/5/2019 12:00 PM, Alexandre DERUMIER wrote:
Now that we've pinpointed the
problem, it is likely we'll backport the new implementation to luminous.
Hi Sage, is it also planned for mimic ?

----- Mail original -----
De: "Sage Weil" <sage@xxxxxxxxxxxx>
À: "Xiaoxi Chen" <superdebuger@xxxxxxxxx>
Cc: "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx>
Envoyé: Lundi 4 Mars 2019 16:18:47
Objet: Re: Fwd: [ceph-users] ceph osd commit latency increase over time, until restart

On Mon, 4 Mar 2019, Xiaoxi Chen wrote:
[Resend with pure text]

Hi List,

After 3 days + bake, the bitmap allocator shows much better
performance characteristics compared to stupid. The osd.44 is nautilus
+ bitmap allocator, osd.19 is luminous + bitmap, as a comparison,
osd.406 just did a fresh restart but continue with luminous + stupid.
See https://pasteboard.co/I3SkfuN.png for figure.
I just want to follow this up with a warning:

** DO NOT USE BITMAP ALLOCATOR ON LUMINOUS IN PRODUCTION **

We made luminous default to StupidAllocator because we saw instability
with the (old) bitmap implementation. In Natuilus, there is a completely
new implementation (with similar design). Now that we've pinpointed the
problem, it is likely we'll backport the new implementation to luminous.

Thanks!
sage

Nautilus shows best performance consistency, max at 20ms compared
to 92 ms in luminous.(https://pasteboard.co/I3SkxxK.png)

In the same time, as Igor point out, the stupid can be fragment
and there is no de-frag functionality in there, so it is slower over
time. This theory can be prove by the mempool status of OSDs before
and after reboot. You can see the tree shrink 9 times.

Before reboot:

"bluestore_alloc": {

"items": 915127024,

"bytes": 915127024

},

After reboot,

"bluestore_alloc": {

"items": 104727568,

"bytes": 104727568

},

I am extending the Nautilus deployment to 1 rack, and in
another rack I changed the min_alloc_size from 4K to 32k, to see if it
can relieve the b-tree a bit.
Also trying Igor's testing branch at
https://github.com/ceph/ceph/commits/wip-ifed-reset-allocator-luminous.

Xiaoxi

Igor Fedotov <ifedotov@xxxxxxx> 于2019年3月3日周日 上午3:24写道：
Hi Xiaoxi,

Please note that this PR is proof-of-concept hence I didn't try to
implement the best algorithm.

But IMO your approach is not viable (or at least isn't that simple)
though since freelist manager (and RocksDB) doesn't contain up to date
allocator state at arbitrary moment of time - running transactions might
have some pending allocations that were processed by allocator but
haven't landed to DB yet. So one is unable to restore valid allocator
state from Freelist Manager unless he finalizes all the transactions.
Which looks a bit troublesome...

Thanks,

Igor

On 3/2/2019 6:41 PM, Xiaoxi Chen wrote:
Hi Igor,
Thanks, no worry I will build it locally and test , will update
this thread if I get anything.

The commit
https://github.com/ceph/ceph/commit/8ee87c22bcd88a8911d58936cec9049e0932fb77 make
sense though the concern is the full defragment will take long time.

Do you see it will be faster to use freelist manager rather than
iterate every btree , insert to common one and re-build original
b-tree? The freelist manager should always de-fraged in anytime.

fm->enumerate_reset();
uint64_t offset, length;
while (fm->enumerate_next(&offset, &length)) {
alloc->init_add_free(offset, length);
++num;
bytes += length;
}

Xiaoxi

Igor Fedotov <ifedotov@xxxxxxx <mailto:ifedotov@xxxxxxx>>
于2019年3月2日周六 上午3:23写道：

Xiaoxi,

Here is the luminous patch which performs StupidAllocator reset
once per 12 hours.

https://github.com/ceph/ceph/tree/wip-ifed-reset-allocator-luminous

Sorry, didn't have enough time today to learn how to make a
package from it, just sources for now.

Thanks,

Igor

On 3/1/2019 11:46 AM, Xiaoxi Chen wrote:
igor，
I can test the patch if we have a package.
My enviroment and workload can consistently reproduce the
latency 2-3 days after restarting.
Sage tells me to try bitmap allocator to make sure stupid
allocator is the bad guy. I have some osds in luminous +bitmap
and some osds in 14.1.0+bitmap. Both looks positive till now,
but i need more time to be sure.
The perf ,log and admin socket analysis lead to the theory
that in alloc_int the loop sometimes take long time wkth
allocator locks held. Which blocks release part called from
_txc_finish in kv_finalize_thread, this thread is also the one to
calculate state_kv_committing_lat and overall commit_lat. You can
find from admin socket that state_done_latency has similar trend
as commit_latency.
But we cannot find a theory to.explain why reboot helps, the
allocator btree will be rebuild from freelist manager
and.it.should be exactly. the same as it is prior to reboot.
Anything related with pg recovery?

Anyway, as I have a live env and workload, I am more than
willing to work with you for further investigatiom

-Xiaoxi

Igor Fedotov <ifedotov@xxxxxxx <mailto:ifedotov@xxxxxxx>> 于
2019年3月1日周五 上午6:21写道：

Also I think it makes sense to create a ticket at this point.
Any
volunteers?

On 3/1/2019 1:00 AM, Igor Fedotov wrote:
Wondering if somebody would be able to apply simple patch that
periodically resets StupidAllocator?

Just to verify/disprove the hypothesis it's allocator relateted

On 2/28/2019 11:57 PM, Stefan Kooman wrote:
Quoting Wido den Hollander (wido@xxxxxxxx
<mailto:wido@xxxxxxxx>):
Just wanted to chime in, I've seen this with
Luminous+BlueStore+NVMe
OSDs as well. Over time their latency increased until we
started to
notice I/O-wait inside VMs.
On a Luminous 12.2.8 cluster with only SSDs we also hit
this issue I
guess. After restarting the OSD servers the latency would
drop to normal
values again. See
https://owncloud.kooman.org/s/BpkUc7YM79vhcDj
Reboots were finished at ~ 19:00.

Gr. Stefan