Re: laggy OSDs and staling krbd IO after upgrade from nautilus to octopus

Boris Behrens <bb@xxxxxxxxx> · Tue, 13 Sep 2022 21:14:51 +0200

Hi,
i just checked and all OSDs have it set to true.
It seems also not a problem with the snaptrim opration.

We just had two times in the last 7 days where nearly all OSDs logged a lot
(around 3k times in 20 minutes) of these messages:
022-09-12T20:27:19.146+0200 7f576de49700 -1 osd.9 786378 get_health_metrics
reporting 1 slow ops, oldest is osd_op(client.153241560.0:42288714 8.56
8:6a19e4ee:::rbd_data.4c64dc3662fb05.0000000000000c00:head [write
2162688~4096 in=4096b] snapc 9835e=[] ondisk+write+known_if_redirected
e786375)

Am Di., 13. Sept. 2022 um 20:20 Uhr schrieb Wesley Dillingham <
wes@xxxxxxxxxxxxxxxxx>:

> I haven't read through this entire thread so forgive me if already
> mentioned:
>
> What is the parameter "bluefs_buffered_io" set to on your OSDs? We once
> saw a terrible slowdown on our OSDs during snaptrim events and setting
> bluefs_buffered_io to true alleviated that issue. That was on a nautilus
> cluster.
>
> Respectfully,
>
> *Wes Dillingham*
> wes@xxxxxxxxxxxxxxxxx
> LinkedIn <http://www.linkedin.com/in/wesleydillingham>
>
>
> On Tue, Sep 13, 2022 at 10:48 AM Boris Behrens <bb@xxxxxxxxx> wrote:
>
>> The cluster is SSD only with 2TB,4TB and 8TB disks. I would expect that
>> this should be done fairly fast.
>> For now I will recreate every OSD in the cluster and check if this helps.
>>
>> Do you experience slow OPS (so the cluster shows a message like "cluster
>> [WRN] Health check update: 679 slow ops, oldest one blocked for 95 sec,
>> daemons
>>
>> [osd.0,osd.106,osd.107,osd.108,osd.113,osd.116,osd.123,osd.124,osd.125,osd.134]...
>> have slow ops. (SLOW_OPS)")?
>>
>> I can also see a huge spike in the load of all hosts in our cluster for a
>> couple of minutes.
>>
>>
>> Am Di., 13. Sept. 2022 um 13:14 Uhr schrieb Frank Schilder <frans@xxxxxx
>> >:
>>
>> > Hi Boris.
>> >
>> > > 3. wait some time (took around 5-20 minutes)
>> >
>> > Sounds short. Might just have been the compaction that the OSDs do any
>> > ways on startup after upgrade. I don't know how to check for completed
>> > format conversion. What I see in your MON log is exactly what I have
>> seen
>> > with default snap trim settings until all OSDs were converted. Once an
>> OSD
>> > falls behind and slow ops start piling up, everything comes to a halt.
>> Your
>> > logs clearly show a sudden drop of IOP/s on snap trim start and I would
>> > guess this is the cause of the slowly growing OPS back log of the OSDs.
>> >
>> > If its not that, I don't know what else to look for.
>> >
>> > Best regards,
>> > =================
>> > Frank Schilder
>> > AIT Risø Campus
>> > Bygning 109, rum S14
>> >
>> > ________________________________________
>> > From: Boris Behrens <bb@xxxxxxxxx>
>> > Sent: 13 September 2022 12:58:19
>> > To: Frank Schilder
>> > Cc: ceph-users@xxxxxxx
>> > Subject: Re:  laggy OSDs and staling krbd IO after upgrade
>> > from nautilus to octopus
>> >
>> > Hi Frank,
>> > we converted the OSDs directly on the upgrade.
>> >
>> > 1. installing new ceph versions
>> > 2. restart all OSD daemons
>> > 3. wait some time (took around 5-20 minutes)
>> > 4. all OSDs were online again.
>> >
>> > So I would expect, that the OSDs are all upgraded correctly.
>> > I also checked when the trimming happens, and it does not seem to be an
>> > issue on it's own, as the trim happens all the time in various sizes.
>> >
>> > Am Di., 13. Sept. 2022 um 12:45 Uhr schrieb Frank Schilder <
>> frans@xxxxxx
>> > <mailto:frans@xxxxxx>>:
>> > Are you observing this here:
>> >
>> https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/message/LAN6PTZ2NHF2ZHAYXZIQPHZ4CMJKMI5K/
>> > =================
>> > Frank Schilder
>> > AIT Risø Campus
>> > Bygning 109, rum S14
>> >
>> > ________________________________________
>> > From: Boris Behrens <bb@xxxxxxxxx<mailto:bb@xxxxxxxxx>>
>> > Sent: 13 September 2022 11:43:20
>> > To: ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
>> > Subject:  laggy OSDs and staling krbd IO after upgrade from
>> > nautilus to octopus
>> >
>> > Hi, I need you help really bad.
>> >
>> > we are currently experiencing a very bad cluster hangups that happen
>> > sporadic. (once on 2022-09-08 mid day (48 hrs after the upgrade) and
>> once
>> > 2022-09-12 in the evening)
>> > We use krbd without cephx for the qemu clients and when the OSDs are
>> > getting laggy, the krbd connection comes to a grinding halt, to a point
>> > that all IO is staling and we can't even unmap the rbd device.
>> >
>> > From the logs, it looks like that the cluster starts to snaptrim a lot a
>> > PGs, then PGs become laggy and then the cluster snowballs into laggy
>> OSDs.
>> > I have attached the monitor log and the osd log (from one OSD) around
>> the
>> > time where it happened.
>> >
>> > - is this a known issue?
>> > - what can I do to debug it further?
>> > - can I downgrade back to nautilus?
>> > - should I upgrade the PGs for the pool to 4096 or 8192?
>> >
>> > The cluster contains a mixture of 2,4 and 8TB SSDs (no rotating disks)
>> > where the 8TB disks got ~120PGs and the 2TB disks got ~30PGs. All hosts
>> > have a minimum of 128GB RAM and the kernel logs of all ceph hosts do not
>> > show anything for the timeframe.
>> >
>> > Cluster stats:
>> >   cluster:
>> >     id:     74313356-3b3d-43f3-bce6-9fb0e4591097
>> >     health: HEALTH_OK
>> >
>> >   services:
>> >     mon: 3 daemons, quorum ceph-rbd-mon4,ceph-rbd-mon5,ceph-rbd-mon6
>> (age
>> > 25h)
>> >     mgr: ceph-rbd-mon5(active, since 4d), standbys: ceph-rbd-mon4,
>> > ceph-rbd-mon6
>> >     osd: 149 osds: 149 up (since 6d), 149 in (since 7w)
>> >
>> >   data:
>> >     pools:   4 pools, 2241 pgs
>> >     objects: 25.43M objects, 82 TiB
>> >     usage:   231 TiB used, 187 TiB / 417 TiB avail
>> >     pgs:     2241 active+clean
>> >
>> >   io:
>> >     client:   211 MiB/s rd, 273 MiB/s wr, 1.43k op/s rd, 8.80k op/s wr
>> >
>> > --- RAW STORAGE ---
>> > CLASS  SIZE     AVAIL    USED     RAW USED  %RAW USED
>> > ssd    417 TiB  187 TiB  230 TiB   231 TiB      55.30
>> > TOTAL  417 TiB  187 TiB  230 TiB   231 TiB      55.30
>> >
>> > --- POOLS ---
>> > POOL                   ID  PGS   STORED   OBJECTS  USED     %USED  MAX
>> > AVAIL
>> > isos                    7    64  455 GiB  117.92k  1.3 TiB   1.17     38
>> > TiB
>> > rbd                     8  2048   76 TiB   24.65M  222 TiB  66.31     38
>> > TiB
>> > archive                 9   128  2.4 TiB  669.59k  7.3 TiB   6.06     38
>> > TiB
>> > device_health_metrics  10     1   25 MiB      149   76 MiB      0     38
>> > TiB
>> >
>> >
>> >
>> > --
>> > Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
>> > groÃƒ¼en Saal.
>> > _______________________________________________
>> > ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx
>> >
>> > To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:
>> > ceph-users-leave@xxxxxxx>
>> >
>> >
>> > --
>> > Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
>> > groÃƒ¼en Saal.
>> >
>>
>>
>> --
>> Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
>> groÃƒ¼en Saal.
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
>

-- 
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groÃƒ¼en Saal.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx