OSD containers lose connectivity after change from Rocky 8.7->9.2

"Dan O'Brien" <dobrie2@xxxxxxx> · Tue, 15 Aug 2023 12:37:19 -0000

I recently updated one of the hosts (an older Dell PowerEdge R515) in my Ceph Quincy (17.2.6) cluster. I needed to change the IP address, so I removed the host from the cluster (gracefully removed OSDs and daemons, then removed the host). I also took the opportunity to upgrade the host from Rocky 8.7 to 9.2 before re-joining it to the cluster with cephadm. I zapped the storage, so for all intents and purposes, it should have been a completely clean instance and went smoothly. I have 2 other hosts (new Dell PowerEdge R450s) using Rocky 9.2 with no problems. Before the upgrade, the R515 host was well-behaved and unremarkable.

Our cluster is connected to our internal network, and has a 10G private network used for interconnect between the nodes.

Since the upgrade, the OSDs on the R515 host regularly, after a period of minutes to hours (usually a few hours), drop out of the cluster. I can restart the OSDs and they immediately reconnect and join the cluster, which returns to a HEALTHY state after a short period. The OSD logs show

Aug 15 06:53:57 ceph99.cecnet.gmu.edu ceph-osd[193725]: log_channel(cluster) log [WRN] : Monitor daemon marked osd.9 down, but it is still running
Aug 15 06:53:57 ceph99.cecnet.gmu.edu ceph-osd[193725]: log_channel(cluster) log [DBG] : map e17993 wrongly marked me down at e17988
Aug 15 06:53:57 ceph99.cecnet.gmu.edu ceph-osd[193725]: osd.9 17993 start_waiting_for_healthy
Aug 15 06:53:57 ceph99.cecnet.gmu.edu ceph-osd[193725]: osd.9 pg_epoch: 17988 pg[16.f( v 16315'2765702 (15197'2757704,16315'2765702] local-lis/les=17902/17903 n=188 ec=211/211 lis/c=17902/17902 les/c/f=17903/17903/0 sis=17988 pruub=8.000660896s) [23,18] r=-1 lpr=1798>
Aug 15 06:53:57 ceph99.cecnet.gmu.edu ceph-osd[193725]: osd.9 17993 is_healthy false -- only 0/10 up peers (less than 33%)
Aug 15 06:53:57 ceph99.cecnet.gmu.edu ceph-osd[193725]: osd.9 17993 not healthy; waiting to boot

The MON logs show

Aug 15 06:53:53 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: osd.9 reported immediately failed by osd.3
Aug 15 06:53:53 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: osd.9 failed (root=default,pod=openstack,host=ceph99) (connection refused reported by osd.3)
Aug 15 06:53:53 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: osd.11 reported immediately failed by osd.3
Aug 15 06:53:53 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: osd.11 failed (root=default,pod=openstack,host=ceph99) (connection refused reported by osd.3)
Aug 15 06:53:53 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: osd.12 reported immediately failed by osd.3
Aug 15 06:53:53 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: osd.12 failed (root=default,pod=openstack,host=ceph99) (connection refused reported by osd.3)
Aug 15 06:53:53 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: osd.10 reported immediately failed by osd.3
Aug 15 06:53:53 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: osd.10 failed (root=default,pod=openstack,host=ceph99) (connection refused reported by osd.3)
Aug 15 06:53:53 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: osd.9 reported immediately failed by osd.3
Aug 15 06:53:53 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: osd.10 reported immediately failed by osd.3
Aug 15 06:53:53 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: osd.11 reported immediately failed by osd.3
Aug 15 06:53:53 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: osd.12 reported immediately failed by osd.3
Aug 15 06:53:54 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: mon.os-storage-1@1(peon).osd e17989 e17989: 26 total, 23 up, 26 in
Aug 15 06:53:54 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: mon.os-storage-1@1(peon).osd e17989 _set_new_cache_sizes cache_size:1020054731 inc_alloc: 339738624 full_alloc: 356515840 kv_alloc: 318767104
Aug 15 06:53:54 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: 15.13 scrub starts
Aug 15 06:53:54 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: Health check failed: 4 osds down (OSD_DOWN)
Aug 15 06:53:54 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: Health check failed: 2 hosts (4 osds) down (OSD_HOST_DOWN)
Aug 15 06:53:54 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: osdmap e17988: 26 total, 22 up, 26 in
Aug 15 06:53:54 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: osd.11 marked itself dead as of e17988
Aug 15 06:53:54 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: from='mgr.16700599 10.192.126.85:0/2567473893' entity='mgr.os-storage.cecnet.gmu.edu.mouglb' cmd=[{"prefix": "config dump", "format": "json"}]: dispatch
Aug 15 06:53:54 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: pgmap v121406: 374 pgs: 3 peering, 20 stale+active+clean, 3 active+remapped+backfilling, 348 active+clean; 2.1 TiB data, 4.1 TiB used, 76 TiB / 80 TiB avail; 102 B/s rd, 338 KiB/s wr, 6 op/s; 36715/1268774 objects >
Aug 15 06:53:54 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: Health check cleared: OSD_HOST_DOWN (was: 2 hosts (4 osds) down)
Aug 15 06:53:54 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: from='mgr.16700599 10.192.126.85:0/2567473893' entity='mgr.os-storage.cecnet.gmu.edu.mouglb' cmd=[{"prefix": "osd metadata", "id": 11}]: dispatch
Aug 15 06:53:54 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: osd.11 [v2:10.192.126.76:6808/3768579449,v1:10.192.126.76:6809/3768579449] boot
Aug 15 06:53:54 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: osdmap e17989: 26 total, 23 up, 26 in
Aug 15 06:53:54 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: osd.12 marked itself dead as of e17989
Aug 15 06:53:54 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: osd.11 reported immediately failed by osd.24
Aug 15 06:53:54 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: osd.11 failed (root=default,pod=openstack,host=ceph99) (connection refused reported by osd.24)
Aug 15 06:53:54 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: osd.11 reported immediately failed by osd.5
Aug 15 06:53:54 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: osd.11 reported immediately failed by osd.3
Aug 15 06:53:55 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: mon.os-storage-1@1(peon).osd e17990 e17990: 26 total, 23 up, 26 in
Aug 15 06:53:55 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: Monitor daemon marked osd.11 down, but it is still running
Aug 15 06:53:55 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: map e17988 wrongly marked me down at e17988
Aug 15 06:53:55 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: 25.16 continuing backfill to osd.22 from (17451'16681348,17987'16685608] 25:6977376b:::rbd_data.4040a64151e028.000000000000a3c3:head to 17987'16685608
Aug 15 06:53:55 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: Monitor daemon marked osd.12 down, but it is still running

The system logs show no problems around the time of the drop.

My best guess at the moment is that it's a networking issue with podman, but I've found no evidence of a problem. Output from ethtool doesn't show any problems

[root@ceph99 ~]# ethtool -S enp3s0f0 | egrep 'error|drop|timeout'
     rx_errors: 0
     tx_errors: 0
     rx_dropped: 0
     tx_dropped: 0
     rx_over_errors: 0
     rx_crc_errors: 0
     rx_frame_errors: 0
     rx_fifo_errors: 0
     rx_missed_errors: 0
     tx_aborted_errors: 0
     tx_carrier_errors: 0
     tx_fifo_errors: 0
     tx_heartbeat_errors: 0
     tx_timeout_count: 0
     rx_length_errors: 0
     rx_long_length_errors: 0
     rx_short_length_errors: 0
     rx_csum_offload_errors: 0
     tx_hwtstamp_timeouts: 0

So far I have:
- Updated the Intel X550-T NIC's firmware and driver (ixgbe) to the latest from Intel
- Reverted the kernel, podman and NetworkManager packages to match the other Rocky 9.2 hosts that are working
- Reverted the Intel driver to the ixgbe included in the kernel
- sworn, cried and pleaded with the gods to spare me further anguish, to no avail.

Other, possibly relevant, information:
- There's no MON or MGR daemons on the host; just node exporter, crash, alerter and promtail.

Is there anything else I should be looking at before I remove the host from the cluster and re-install Rocky 8.7 (and hope it works again)? This host was used while we were standing up the cluster and is due to be retired as we repurpose some of our other storage (standalone NFS servers with RAID6) and move them into the cluster.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx