Re: 1 OSD laggy: log_latency_fn slow; heartbeat_map is_healthy had timed out after 15

Dan van der Ster <dvanders@xxxxxxxxx> · Sun, 16 Oct 2022 22:16:27 +0200

Hi Michel,

Are you sure there isn't a hardware problem with the disk? E.g. maybe you
have SCSI timeouts in dmesg or high ioutil with iostat?

Anyway I don't think there's a big risk related to draining and stopping
the osd. Just consider this a disk failure, which can happen at any time
anyway.

Start by marking it out. If there are still too many slow requests or laggy
PGs, try setting primary affinity to zero.
And if that still doesn't work, I wouldn't hesitate to stop that sick osd
so objects backfill from the replicas

(We had a somewhat similar issue today, btw .. some brand of SSDs
occasionally hangs IO across a whole SCSI bus when failing. Stopping the
osd revives the rest of the disks on the box).

Cheers, Dan

On Sun, Oct 16, 2022, 22:08 Michel Jouvin <jouvin@xxxxxxxxxxxx> wrote:

> Hi,
>
> We have a production cluster made of 12 OSD servers with 16 OSD each
> (all the same HW) which has been running fine for 5 years (initially
> installed with Luminous) and which has been running Octopus (15.2.16)
> for 1 year and was recently upgraded to 15.2.17 (1 week before the
> problem started but doesn't seem to be linked with this upgrade). Since
> beginning of October, we started to see PGs in state "active+laggy" and
> slow requests always related to the same OSD and looking at its log, we
> saw "log_latency_fn slow" messages. There was no disk error logged in
> any system log file. Restarting the OSD didn't really help but no
> functionnal problems were seen.
>
> Looking again at the problem in the last days, we saw that the cluster
> was in HEALTH_WARN state because several PGs were not deep-scrubbed in
> time. In the logs we saw also (but may be we just missed them initially)
> "heartbeat_map is_healthy 'OSD::osd_op_tp thread...' had timed out after
> 15" messages. This number increased days after days and is now almost 3
> times the number of PGs hosted by the laggy OSD (despite hundreds of
> deep scrubs running successfully, the cluster has 4297 PGs). It seems
> that in the list we find all PGs that have a replica (all the pools are
> with 3 replica, no EC) on the laggy OSD. We confirmed that there is no
> detected disk error in the system.
>
> Today we restarted the server hosted this OSD, without much hope. It
> didn't help and the same OSD (and only this one) continues to have the
> same problem. In addition to the messages mentioned, the admin socket
> for this OSD became unresponsive: despite command being executed (see
> below), it was not returning in a decent amount of times (several minutes).
>
> As the OSD RocksDB have probably never been compacted, we decided to
> compact the laggy OSD. Despite the "ceph tell osd.10 compact" never
> returned (it was killed after a few hours as the OSD has been marked
> down during a few seconds), the compaction started and lasted ~5
> hours... but completed successfully. But the only improvement that was
> seen after the compaction was that the admin socket is now responsive
> (despite a bit slow). The messages about log_latency_fn and
> heartbeat_map are still present (and frequent) and the deep scrubs are
> still blocked.
>
> We are looking for advices on what to do to fix this issue. We'd in mind
> to stop this OSD, zap it and resintall it but we are worrying it may be
> risky to do this with an OSD that has not been deep scrubbed for a long
> time. And we are sure there is a better solution! Understanding the
> cause would be a much better approach!
>
> Thanks in advance for any help. Best regards,
>
> Michel
>
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx