cephadm upgrade: heartbeat failures not considered

Eugen Block <eblock@xxxxxx> · Tue, 07 May 2024 14:57:03 +0000

Hi,

we're facing an issue during upgrades (and sometimes server reboots),  
it appears to occur when (at leat) one of the MONs has to do a full  
sync. And I'm wondering if the upgrade procedure could be improved in  
that regard, I'll come back to that later. First, I'll try to  
summarize the events.
We upgraded to latest Pacific (16.2.15) last week. Note that the  
filesystems of the MONs are on HDDs (we're planning to move to flash,  
we already noticed issues during disaster recovery on HDDs last year).

According to the logs the MGRs and MONs were upgraded successfully and  
a quorum was formed successfully at:

2024-04-30T12:57:22.347182+0000 mon.ndeceph03 (mon.0) 1208046 :  
cluster [INF] Health check cleared: MON_DOWN (was: 1/3 mons down,  
quorum ndeceph03,ndeceph01)

Then ceph started the first OSD upgrades a few minutes later, the  
first one at:

2024-04-30T13:00:42.733528+0000 mon.ndeceph03 (mon.0) 101 : cluster  
[INF] osd.25 marked itself down and dead

One OSD by one on ndeceph01 was upgraded:

      2024-04-30T13:01:46.517+0000 7fbd78a31700  0 [cephadm INFO  
cephadm.upgrade] Upgrade: Updating osd.19 (5/7)
      2024-04-30T13:01:46.517+0000 7fbd78a31700  0  
log_channel(cephadm) log [INF] : Upgrade: Updating osd.19 (5/7)
      2024-04-30T13:02:24.800+0000 7fbd78a31700  0 [cephadm INFO  
cephadm.upgrade] Upgrade: Updating osd.22 (6/7)
      2024-04-30T13:02:24.800+0000 7fbd78a31700  0  
log_channel(cephadm) log [INF] : Upgrade: Updating osd.22 (6/7)
      2024-04-30T13:02:48.220+0000 7fbd78a31700  0 [cephadm INFO  
cephadm.upgrade] Upgrade: Updating osd.29 (7/7)
      2024-04-30T13:02:48.220+0000 7fbd78a31700  0  
log_channel(cephadm) log [INF] : Upgrade: Updating osd.29 (7/7)

But the mon service on ndeceph02 (also OSD server) was still syncing  
(for around 6 minutes):

2024-04-30T13:02:33.124+0000 7f1c24444700  1  
mon.ndeceph02@2(synchronizing) e58 handle_auth_request failed to  
assign global_id
2024-04-30T13:08:30.123+0000 7f1c24444700  1  
mon.ndeceph02@2(synchronizing) e58 handle_auth_request failed to  
assign global_id

All HDD OSDs from ndeceph02 (not upgraded yet) were complaining about  
timeouts, and apparently tried to reboot multiple times (wondering why  
the SSDs didn't complain though):

2024-04-30T13:03:54.374+0000 7f2e2d1ce700  1 heartbeat_map is_healthy  
'OSD::osd_op_tp thread 0x7f2e1399b700' had timed out after 15.000000954s
2024-04-30T13:03:54.374+0000 7f2e2d1ce700  1 osd.0 367411 is_healthy  
false -- internal heartbeat failed
2024-04-30T13:03:54.374+0000 7f2e2d1ce700  1 osd.0 367411 not healthy;  
waiting to boot
2024-04-30T13:03:55.330+0000 7f2e2d1ce700  1 heartbeat_map is_healthy  
'OSD::osd_op_tp thread 0x7f2e1399b700' had timed out after 15.000000954s
2024-04-30T13:03:55.330+0000 7f2e2d1ce700  1 osd.0 367411 is_healthy  
false -- internal heartbeat failed
2024-04-30T13:03:55.330+0000 7f2e2d1ce700  1 osd.0 367411 not healthy;  
waiting to boot
...
2024-04-30T13:03:59.322+0000 7f2e201b4700  1 osd.0 367414 state:  
booting -> active
2024-04-30T13:05:22.184+0000 7f2e201b4700  1 osd.0 367433 state:  
booting -> active
2024-04-30T13:06:39.602+0000 7f2e201b4700  1 osd.0 367447 state:  
booting -> active
2024-04-30T13:08:02.612+0000 7f2e201b4700  1 osd.0 367454 state:  
booting -> active

During this syncing period the disk utilization of the OS filesystem  
was at 100%, but ceph kept upgrading other OSD daemons. There are 3  
main hosts, failure domain is host, replicated pools with min_size 2,  
size 3.
If the OSDs from one host are struggling (because of disk IO on the  
filesystem, apparently) and ceph keeps upgrading others, we get  
inactive PGs.

So what I'm wondering about is, how does the orchestrator decide if  
it's ok to stop one OSD ('ceph osd ok-to-stop <ID>' is the manual  
command) while others obviously are not healthy and would cause  
inactive PGs? The cluster did notice slow requests and reported  
messages like these:

2024-04-30T13:02:58.867499+0000 osd.28 (osd.28) 10529 : cluster [WRN]  
Monitor daemon marked osd.28 down, but it is still running

If some OSDs clearly aren't healthy, I would expect the orchestrator  
to pause the upgrade. After the mon full sync completed, it started  
upgrading the OSDs on ndeceph02 as well, but the inactive PGs were  
only resolved after several more minutes when almost all OSDs from  
that tree branch had been upgraded:

2024-04-30T13:16:17.536359+0000 mon.ndeceph03 (mon.0) 1461 : cluster  
[INF] Health check cleared: PG_AVAILABILITY (was: Reduced data  
availability: 2 pgs inactive, 11 pgs peering)

During a planned upgrade this issue can be mitigated via staggered  
upgrades (now that I know what the cause is), upgrading MGRs and MONs  
first and wait until everything has settled. Then continue with OSDs.  
But after a reboot there's no way to control that, of course. I helped  
a customer last year with a mon sync issue so we might be able to  
improve things a bit until we have flash disks.

I don't know how it all behaves in newer ceph versions regarding the  
upgrade, but I feel like it wouldn't make a difference right now. Is  
there any way to improve the orchestrator to consider unhealthy OSDs  
before stopping healthy ones?

Thanks!
Eugen
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx