Re: snaptrim blocks IO on ceph nautilus

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



This blocking io problem is coming I think from somewhere deeper and we got it in different areas of ceph.
I got it from rados gateway, with octopus 15.2.14, crashing osd, laggy pgs … would be good to figure out where is the bottleneck …
Feel enough od troubleshooting a month and still haven’t found out the real problem :(

You may try to increase osd_suicide_timeout and osd_op_threads_timeout like multiply it with 5-10.

Istvan Szabo
Senior Infrastructure Engineer
---------------------------------------------------
Agoda Services Co., Ltd.
e: istvan.szabo@xxxxxxxxx<mailto:istvan.szabo@xxxxxxxxx>
---------------------------------------------------

On 2021. Nov 6., at 17:59, Francois Legrand <fleg@xxxxxxxxxxxxxx> wrote:

Email received from the internet. If in doubt, don't click any link nor open any attachment !
________________________________

Le 06/11/2021 à 16:57, Francois Legrand a écrit :
Hi,

Can you confirm that the changing bluefs_buffered_io to true solved
your problem ?

Because I have a rather similar problem. My Nautilus cluster was with
bluefs_buffered_io = false. It was working (even with snaptrim lasting
a lot, i.e. several hours). I upgraded to octopus and it seems that
creating/deleting snapshot now creates a lot of instabilities (leading
to osd marked down or crashing, mgr and mds crashing, MON_DISK_BIG
warning, mon out of quorum and a tons of slowops and MOSDScrubReserve
messages in the logs). Compaction of the failed osd seems more or less
to solve the problem (osds stop to crash). Thus I desactivated the
snapshots for the moment.

F.

Le 27/07/2020 à 15:59, Manuel Lausch a écrit :
Hi,

since some days I try to debug a problem with snaptrimming under
nautilus.

I have a cluster with Nautilus (v14.2.10) , 44 Nodes á 24 OSDs á 14 TB
I create every day a snapshot for 7 days.

Every time the old snapshot is deleting I have bad IO performcance
and blocked requests for several seconds until the snaptrim is done.
Settings like snaptrim_sleep and osd_pg_max_concurrent_snap_trims
don't affect this behavior.

In the debug_osd 10/10 log I see the following:

2020-07-27 11:45:49.976 7fd8b8404700 10 osd.411 22457 dequeue_op
0x557886edda20 prio 196 cost 0 latency 0.019545
osd_repop_reply(client.22731418.0:615257 3.636 e22457/22372) v2 pg
pg[3.636( v 22457'100855 (21737'97756,22457'100855]
local-lis/les=22372/22374 n=27762 ec=2842/2839 lis/c 22372/22372
les/c/f 22374/22374/0 22372/22372/22343) [411,36,956,763] r=0
lpr=22372 luod=22457'100854 crt=22457'100855 lcod 22457'100853 mlcod
22457'100853 active+clean+snaptrim_wait trimq=[1d~1]]
2020-07-27 11:45:49.976 7fd8b8404700 10 osd.411 22457 dequeue_op
0x557886edda20 finish
2020-07-27 11:45:49.976 7fd8b8404700 10 osd.411 22457 dequeue_op
0x557886edc2c0 prio 127 cost 0 latency 0.043165
MOSDScrubReserve(2.2645 RELEASE e22457) v1 pg pg[2.2645( empty
local-lis/les=22359/22364 n=0 ec=2403/2403 lis/c 22359/22359 les/c/f
22364/22367/0 22359/22359/22359) [379,411,884,975] r=1 lpr=22359
crt=0'0 active mbc={}]
2020-07-27 11:45:49.976 7fd8b8404700 10 osd.411 22457 dequeue_op
0x557886edc2c0 finish
2020-07-27 11:45:50.039 7fd8b8404700 10 osd.411 pg_epoch: 22457
pg[3.278e( v 22457'99491 (21594'96426,22457'99491]
local-lis/les=22359/22362 n=27669 ec=2859/2839 lis/c 22359/22359
les/c/f 22362/22365/0 22359/22359/22343) [411,379,848,924] r=0
lpr=22359 crt=22457'99491 lcod 22457'99489 mlcod 22457'99489
active+clean+snaptrim trimq=[1d~1]] snap_trimmer posting
2020-07-27 11:45:57.801 7fd8b8404700 10 osd.411 pg_epoch: 22457
pg[3.278e( v 22457'99493 (21594'96426,22457'99493]
local-lis/les=22359/22362 n=27669 ec=2859/2839 lis/c 22359/22359
les/c/f 22362/22365/0 22359/22359/22343) [411,379,848,924] r=0
lpr=22359 luod=22457'99491 crt=22457'99493 lcod 22457'99489 mlcod
22457'99489 active+clean+snaptrim trimq=[1d~1]] snap_trimmer complete
2020-07-27 11:45:57.801 7fd8b8404700 10 osd.411 22457 dequeue_op
0x557880ac3760 prio 127 cost 663 latency 7.761823
osd_repop(osd.217.0:3025 3.1ca5 e22457/22378) v2 pg pg[3.1ca5( v
22457'100370 (21716'97357,22457'100370] local-lis/les=22378/22379
n=27532 ec=2855/2839 lis/c 22378/22378 les/c/f 22379/22379/0
22378/22378/22378) [217,411,551,1055] r=1 lpr=22378 luod=0'0
lua=22294'100006 crt=22457'100370 lcod 22457'100369 active mbc={}]
2020-07-27 11:45:57.801 7fd8b8404700 10 osd.411 22457 dequeue_op
0x557880ac3760 finish
2020-07-27 11:45:57.801 7fd8b8404700 10 osd.411 22457 dequeue_op
0x5578813e1e40 prio 127 cost 0 latency 7.494296
MOSDScrubReserve(2.37e2 REQUEST e22457) v1 pg pg[2.37e2( empty
local-lis/les=22355/22356 n=0 ec=2412/2412 lis/c 22355/22355 les/c/f
22356/22356/0 22355/22355/22355) [245,411,834,768] r=1 lpr=22355
crt=0'0 active mbc={}]
2020-07-27 11:45:57.801 7fd8b8404700 10 osd.411 22457 dequeue_op
0x5578813e1e40 finish

the dequeueing of ops works without pauses until the „snap_trimmer
posting“ and „snap_trimmer complete“ loglines. This task takes in
this example about 7 Seconds. The following operations which are
dequeued have now a latency of about this time.

I tried to drill down this in the code. (Developers are asked here)
It seems, that the PG will be locked for every operation.
The snap_trimmer posting and complete message comes from
„osd/PrimaryLogPG.cc“ on line 4700. This indicates me, that the
process of deleting a snapshot object will sometimes take some time.

After further poking around. I see in „osd/SnapMapper.cc“ the method
„SnapMapper::get_next_objects_to_trim“ which takes several seconds to
get finished. I followed this further to the „common/map_cacher.hpp“
to the line 94: „int r = driver->get_next(key, &store);“
>From there I lost the path.

The slowness is not on all OSDs at the same time. Somteime, this few
OSDs are affected, sometimes some others. Restart of an OSD does not
help.

With luminous and filestore, snapshot deletion was not an issue at all.
With nautilus and bluestore this is not acceptable for my usecase.

I don‘t know so far, if this is a bluestore specific problem or some
general issue.
I wonder a bit why there are no other who have this problem.


Regards
Manuel

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux