OSD_TOO_MANY_REPAIRS on random OSDs causing clients to hang

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi all,

Over the last 2 weeks we have experienced several OSD_TOO_MANY_REPAIRS errors that we struggle to handle in a non-intrusive manner. Restarting MDS + hypervisor that accessed the object in question seems to be the only way we can clear the error so we can repair the PG and recover access. Any pointers on how to handle this issue in a more gentle way than rebooting the hypervisor and failing the MDS would be welcome!


The problem seems to only affect one specific pool (id 42) that is used for cephfs_data. This pool is our second cephfs data pool in this cluster. The data in the pool is accessible via LXC container via Samba and have the cephfs filesystem bind-mounted from hypervisor.

Ceph is recently updated to version 16.2.11 (pacific) -- kernel version is 5.13.19-6-pve on OSD-hosts/samba-containers and 5.19.17-2-pve on MDS-hosts.


The following warnings are issued:
$ ceph health detail
HEALTH_WARN 1 clients failing to respond to capability release; Too many repaired reads on 1 OSDs; Degraded data redundancy: 1/2648430
090 objects degraded (0.000%), 1 pg degraded; 1 slow ops, oldest one blocked for 608 sec, osd.34 has slow ops
[WRN] MDS_CLIENT_LATE_RELEASE: 1 clients failing to respond to capability release
    mds.hk-cephnode-65(mds.0): Client hk-cephnode-56 failing to respond to capability release client_id: 9534859837
[WRN] OSD_TOO_MANY_REPAIRS: Too many repaired reads on 1 OSDs
    osd.34 had 9936 reads repaired
[WRN] PG_DEGRADED: Degraded data redundancy: 1/2648430090 objects degraded (0.000%), 1 pg degraded
    pg 42.e2 is active+recovering+degraded+repair, acting [34,275,284]
[WRN] SLOW_OPS: 1 slow ops, oldest one blocked for 608 sec, osd.34 has slow ops



The logs for OSD.34 are flooded with these messages:
root@hk-cephnode-53:~# tail /var/log/ceph/ceph-osd.34.log
2023-04-26T11:41:00.760+0200 7f03921f3700 -1 log_channel(cluster) log [ERR] : 42.e2 missing primary copy of 42:4703efac:::10003d86a99.00000001:head, will try copies on 275,284
2023-04-26T11:41:00.784+0200 7f03921f3700 -1 log_channel(cluster) log [ERR] : 42.e2 full-object read crc 0xebd673ed != expected 0xffffffff on 42:4703efac:::10003d86a99.00000001:head
2023-04-26T11:41:00.812+0200 7f03921f3700 -1 log_channel(cluster) log [ERR] : 42.e2 full-object read crc 0xebd673ed != expected 0xffffffff on 42:4703efac:::10003d86a99.00000001:head
2023-04-26T11:41:00.812+0200 7f03921f3700 -1 log_channel(cluster) log [ERR] : 42.e2 missing primary copy of 42:4703efac:::10003d86a99.00000001:head, will try copies on 275,284
2023-04-26T11:41:00.824+0200 7f03a821f700 -1 osd.34 1352563 get_health_metrics reporting 1 slow ops, oldest is osd_op(client.9534859837.0:20412906 42.e2 42:4703efac:::10003d86a99.00000001:head [read 0~1048576 [307@0] out=1048576b] snapc 0=[] RETRY=5 ondisk+retry+read+known_if_redirected e1352553)
2023-04-26T11:41:00.824+0200 7f03a821f700  0 log_channel(cluster) log [WRN] : 1 slow requests (by type [ 'delayed' : 1 ] most affected pool [ 'qa-cephfs_data' : 1 ])
2023-04-26T11:41:00.840+0200 7f03921f3700 -1 log_channel(cluster) log [ERR] : 42.e2 full-object read crc 0xebd673ed != expected 0xffffffff on 42:4703efac:::10003d86a99.00000001:head
2023-04-26T11:41:00.864+0200 7f03921f3700 -1 log_channel(cluster) log [ERR] : 42.e2 full-object read crc 0xebd673ed != expected 0xffffffff on 42:4703efac:::10003d86a99.00000001:head
2023-04-26T11:41:00.864+0200 7f03921f3700 -1 log_channel(cluster) log [ERR] : 42.e2 missing primary copy of 42:4703efac:::10003d86a99.00000001:head, will try copies on 275,284
2023-04-26T11:41:00.888+0200 7f03921f3700 -1 log_channel(cluster) log [ERR] : 42.e2 full-object read crc 0xebd673ed != expected 0xffffffff on 42:4703efac:::10003d86a99.00000001:head



We have tried the following:
 - Restarting the OSD in question clears the error for a few seconds but then we also we get OSD_TOO_MANY_REPAIRS on OSDs with PGs that holds the object that have blocked I/O.

 - Trying to repair the PG seems to restart every 10 second and not actually do anything/progressing. (Is there a way to check repair progress?)

 - Restarting the MDS and hypervisor clears the error (the hypervisor hangs for several minutes before timing out). However if the object is requested again the error reoccurs. If we don't access the object we are able to eventually repair the PG.

 - Occasionally setting the primary-affinity to 0 for the primary OSD in the PG clears the error after restarting all affected OSD and we are able to repair the PG (unless the object is accessed during recovery) and access to the object is OK afterwards.

 - Finding and deleting the file pointing to the object (10003d86a99) and restarting OSDs will clear the error.

 - Killing the samba process that accessed the object does not clear the SLOW_OPS, and hence the error prevail

 - Normal scrubs have revealed a handfull of other PGs in the same pool (id 42) that are damaged and we are doing repairs without any problems.

 - We believe MDS_CLIENT_LATE_RELEASE and SLOW_OPS errors are symptoms of the fact that the I/O are blocked.

 - We have verified that there are no SMART errors of any kind on any of our disks in the cluster.

 - If we don't handle this issue rather promptly, we experience full lockup of the samba container and rebooting hypervisor seems to be the only cure. Trying to force unmount and remount cephfs does not help.



This have now happened 6-7 times over the last 2 weeks and we suspect that a hardware or memory error on one of our nodes may have caused the objects to be written to disk with bad checksums. We have replaced the mainboard in one of our nodes that we might think is the culprit and are currently testing the memory. Can these random checksum errors be caused by anything else that we should investigate? It's a bit suspicious that the error only occurs on one specific pool? If the mainboard are to blame we should see these errors in more pools by now?

Regardless we are stumped by how Ceph handles this error. Checksum-errors should not leave clients hanging like this? Should this be considered a bug? Is there a way to cancel the blocking I/O request to clear the error? And why is the PG flapping between active+recovering+degraded+repair, active+recovering+repair, active+clean+repair every few seconds?

Any ideas on how to gracefully battle this problem? Thanks!


--thomas


Thomas Hukkelberg
thomas@xxxxxxxxxxxxxxxxx


_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux