Re: laggy OSDs and staling krbd IO after upgrade from nautilus to octopus

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hey Frank,

What happens if you issue a compact ('ceph tell osd.XXX compact') to
those affected OSDs? Does it change how they perform during snaptrim?

Josh

On Tue, Sep 13, 2022 at 7:13 AM Frank Schilder <frans@xxxxxx> wrote:
>
> Hi Boris,
>
> some more information you probably don't want to hear. We just checked all OSD hosts and it is only the OSDs that were converted during the upgrade (by accident we upgraded a host while bluestore_fsck_quick_fix_on_mount=true) that are acting up. All other OSDs that were upgraded after setting require-osd-release=octopus are operating normally. This is my best lead at the moment.
>
> It might be possible that converting OSDs before setting require-osd-release=octopus leads to a broken state of the converted OSDs. I could not yet find a way out of this situation. We will soon perform a third upgrade test to test this hypothesis.
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Frank Schilder <frans@xxxxxx>
> Sent: 13 September 2022 14:55:35
> To: Boris Behrens
> Cc: ceph-users@xxxxxxx
> Subject:  Re: laggy OSDs and staling krbd IO after upgrade from nautilus to octopus
>
> Hi Boris,
>
> we just completed a second test-upgrade of our test cluster, following the extended procedure I posted. This time, however, the problem with snaptrim does not disappear.
>
> The symptoms are that (1) IO is basically dead if snaptrim is happening, (2) PGs get laggy and (3) all OSDs ate spinning on between 100-200% CPU without doing anything. Ceph status looks just like yours:
>
> # ceph status
>   cluster:
>     id:     bf1f51f5-b381-4cf7-b3db-88d044c1960c
>     health: HEALTH_WARN
>             1 nearfull osd(s)
>             4 pool(s) nearfull
>             0 slow ops, oldest one blocked for 31 sec, osd.0 has slow ops
>
>   services:
>     mon: 3 daemons, quorum tceph-01,tceph-02,tceph-03 (age 24m)
>     mgr: tceph-01(active, since 23m), standbys: tceph-02, tceph-03
>     mds: fs:1 {0=tceph-01=up:active} 2 up:standby
>     osd: 9 osds: 9 up (since 12m), 9 in
>
>   data:
>     pools:   4 pools, 321 pgs
>     objects: 11.46M objects, 401 GiB
>     usage:   1.9 TiB used, 535 GiB / 2.4 TiB avail
>     pgs:     237 active+clean+snaptrim_wait
>              65  active+clean
>              18  active+clean+snaptrim
>              1   active+clean+scrubbing+snaptrim_wait
>
>   io:
>     client:   691 KiB/s wr, 0 op/s rd, 0 op/s wr
>
> We are running a copy of a large ISO and client IO should be 120MB/s. The OSD logs contain loads of messages like:
>
> 2022-09-13T14:54:30.070+0200 7fdac316e700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fdaa50c2700' had timed out after 15
> 2022-09-13T14:54:30.070+0200 7fdac316e700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fdaa68c5700' had timed out after 15
> 2022-09-13T14:54:30.153+0200 7fdaa50c2700  1 heartbeat_map reset_timeout 'OSD::osd_op_tp thread 0x7fdaa50c2700' had timed out after 15
> 2022-09-13T14:54:30.181+0200 7fdaa68c5700  1 heartbeat_map reset_timeout 'OSD::osd_op_tp thread 0x7fdaa68c5700' had timed out after 15
>
> Could you check if you observe the same symptoms?
>
> The first time we upgraded the cluster we had the same problem, mimic-OSDs were running at 1-2% while the non-converted octopus OSDs were running at 100-200% and IO was dead. Disabling snaptrim resolved the problem immediately. After converting a few OSDs and enabling snaptrim for a short while, we observed that the converted octopus OSDs now consumed 1-2% CPU while the unconverted ones were still spinning. Hence, after completing all conversions, CPU usage and behaviour went back to normal.
>
> This time, however, the OSDs are stuck in the low-performing state even after conversion. I have currently no idea what the difference is to our last test upgrade, but we are trying to find out how to get the OSDs out of this situation.
>
> In case you have any news, please update. We are really interested what might help. We tried restarting everything and individually - to no avail.
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Frank Schilder <frans@xxxxxx>
> Sent: 13 September 2022 13:14:01
> To: Boris Behrens
> Cc: ceph-users@xxxxxxx
> Subject:  Re: laggy OSDs and staling krbd IO after upgrade from nautilus to octopus
>
> Hi Boris.
>
> > 3. wait some time (took around 5-20 minutes)
>
> Sounds short. Might just have been the compaction that the OSDs do any ways on startup after upgrade. I don't know how to check for completed format conversion. What I see in your MON log is exactly what I have seen with default snap trim settings until all OSDs were converted. Once an OSD falls behind and slow ops start piling up, everything comes to a halt. Your logs clearly show a sudden drop of IOP/s on snap trim start and I would guess this is the cause of the slowly growing OPS back log of the OSDs.
>
> If its not that, I don't know what else to look for.
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Boris Behrens <bb@xxxxxxxxx>
> Sent: 13 September 2022 12:58:19
> To: Frank Schilder
> Cc: ceph-users@xxxxxxx
> Subject: Re:  laggy OSDs and staling krbd IO after upgrade from nautilus to octopus
>
> Hi Frank,
> we converted the OSDs directly on the upgrade.
>
> 1. installing new ceph versions
> 2. restart all OSD daemons
> 3. wait some time (took around 5-20 minutes)
> 4. all OSDs were online again.
>
> So I would expect, that the OSDs are all upgraded correctly.
> I also checked when the trimming happens, and it does not seem to be an issue on it's own, as the trim happens all the time in various sizes.
>
> Am Di., 13. Sept. 2022 um 12:45 Uhr schrieb Frank Schilder <frans@xxxxxx<mailto:frans@xxxxxx>>:
> Are you observing this here: https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/message/LAN6PTZ2NHF2ZHAYXZIQPHZ4CMJKMI5K/
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Boris Behrens <bb@xxxxxxxxx<mailto:bb@xxxxxxxxx>>
> Sent: 13 September 2022 11:43:20
> To: ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
> Subject:  laggy OSDs and staling krbd IO after upgrade from nautilus to octopus
>
> Hi, I need you help really bad.
>
> we are currently experiencing a very bad cluster hangups that happen
> sporadic. (once on 2022-09-08 mid day (48 hrs after the upgrade) and once
> 2022-09-12 in the evening)
> We use krbd without cephx for the qemu clients and when the OSDs are
> getting laggy, the krbd connection comes to a grinding halt, to a point
> that all IO is staling and we can't even unmap the rbd device.
>
> From the logs, it looks like that the cluster starts to snaptrim a lot a
> PGs, then PGs become laggy and then the cluster snowballs into laggy OSDs.
> I have attached the monitor log and the osd log (from one OSD) around the
> time where it happened.
>
> - is this a known issue?
> - what can I do to debug it further?
> - can I downgrade back to nautilus?
> - should I upgrade the PGs for the pool to 4096 or 8192?
>
> The cluster contains a mixture of 2,4 and 8TB SSDs (no rotating disks)
> where the 8TB disks got ~120PGs and the 2TB disks got ~30PGs. All hosts
> have a minimum of 128GB RAM and the kernel logs of all ceph hosts do not
> show anything for the timeframe.
>
> Cluster stats:
>   cluster:
>     id:     74313356-3b3d-43f3-bce6-9fb0e4591097
>     health: HEALTH_OK
>
>   services:
>     mon: 3 daemons, quorum ceph-rbd-mon4,ceph-rbd-mon5,ceph-rbd-mon6 (age
> 25h)
>     mgr: ceph-rbd-mon5(active, since 4d), standbys: ceph-rbd-mon4,
> ceph-rbd-mon6
>     osd: 149 osds: 149 up (since 6d), 149 in (since 7w)
>
>   data:
>     pools:   4 pools, 2241 pgs
>     objects: 25.43M objects, 82 TiB
>     usage:   231 TiB used, 187 TiB / 417 TiB avail
>     pgs:     2241 active+clean
>
>   io:
>     client:   211 MiB/s rd, 273 MiB/s wr, 1.43k op/s rd, 8.80k op/s wr
>
> --- RAW STORAGE ---
> CLASS  SIZE     AVAIL    USED     RAW USED  %RAW USED
> ssd    417 TiB  187 TiB  230 TiB   231 TiB      55.30
> TOTAL  417 TiB  187 TiB  230 TiB   231 TiB      55.30
>
> --- POOLS ---
> POOL                   ID  PGS   STORED   OBJECTS  USED     %USED  MAX AVAIL
> isos                    7    64  455 GiB  117.92k  1.3 TiB   1.17     38 TiB
> rbd                     8  2048   76 TiB   24.65M  222 TiB  66.31     38 TiB
> archive                 9   128  2.4 TiB  669.59k  7.3 TiB   6.06     38 TiB
> device_health_metrics  10     1   25 MiB      149   76 MiB      0     38 TiB
>
>
>
> --
> Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
> groüen Saal.
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
> To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>
>
>
> --
> Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im groüen Saal.
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux