cephadm upgrade: heartbeat failures not considered

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

we're facing an issue during upgrades (and sometimes server reboots), it appears to occur when (at leat) one of the MONs has to do a full sync. And I'm wondering if the upgrade procedure could be improved in that regard, I'll come back to that later. First, I'll try to summarize the events. We upgraded to latest Pacific (16.2.15) last week. Note that the filesystems of the MONs are on HDDs (we're planning to move to flash, we already noticed issues during disaster recovery on HDDs last year).

According to the logs the MGRs and MONs were upgraded successfully and a quorum was formed successfully at:

2024-04-30T12:57:22.347182+0000 mon.ndeceph03 (mon.0) 1208046 : cluster [INF] Health check cleared: MON_DOWN (was: 1/3 mons down, quorum ndeceph03,ndeceph01)

Then ceph started the first OSD upgrades a few minutes later, the first one at:

2024-04-30T13:00:42.733528+0000 mon.ndeceph03 (mon.0) 101 : cluster [INF] osd.25 marked itself down and dead

One OSD by one on ndeceph01 was upgraded:

2024-04-30T13:01:46.517+0000 7fbd78a31700 0 [cephadm INFO cephadm.upgrade] Upgrade: Updating osd.19 (5/7) 2024-04-30T13:01:46.517+0000 7fbd78a31700 0 log_channel(cephadm) log [INF] : Upgrade: Updating osd.19 (5/7) 2024-04-30T13:02:24.800+0000 7fbd78a31700 0 [cephadm INFO cephadm.upgrade] Upgrade: Updating osd.22 (6/7) 2024-04-30T13:02:24.800+0000 7fbd78a31700 0 log_channel(cephadm) log [INF] : Upgrade: Updating osd.22 (6/7) 2024-04-30T13:02:48.220+0000 7fbd78a31700 0 [cephadm INFO cephadm.upgrade] Upgrade: Updating osd.29 (7/7) 2024-04-30T13:02:48.220+0000 7fbd78a31700 0 log_channel(cephadm) log [INF] : Upgrade: Updating osd.29 (7/7)

But the mon service on ndeceph02 (also OSD server) was still syncing (for around 6 minutes):

2024-04-30T13:02:33.124+0000 7f1c24444700 1 mon.ndeceph02@2(synchronizing) e58 handle_auth_request failed to assign global_id 2024-04-30T13:08:30.123+0000 7f1c24444700 1 mon.ndeceph02@2(synchronizing) e58 handle_auth_request failed to assign global_id

All HDD OSDs from ndeceph02 (not upgraded yet) were complaining about timeouts, and apparently tried to reboot multiple times (wondering why the SSDs didn't complain though):

2024-04-30T13:03:54.374+0000 7f2e2d1ce700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f2e1399b700' had timed out after 15.000000954s 2024-04-30T13:03:54.374+0000 7f2e2d1ce700 1 osd.0 367411 is_healthy false -- internal heartbeat failed 2024-04-30T13:03:54.374+0000 7f2e2d1ce700 1 osd.0 367411 not healthy; waiting to boot 2024-04-30T13:03:55.330+0000 7f2e2d1ce700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f2e1399b700' had timed out after 15.000000954s 2024-04-30T13:03:55.330+0000 7f2e2d1ce700 1 osd.0 367411 is_healthy false -- internal heartbeat failed 2024-04-30T13:03:55.330+0000 7f2e2d1ce700 1 osd.0 367411 not healthy; waiting to boot
...
2024-04-30T13:03:59.322+0000 7f2e201b4700 1 osd.0 367414 state: booting -> active 2024-04-30T13:05:22.184+0000 7f2e201b4700 1 osd.0 367433 state: booting -> active 2024-04-30T13:06:39.602+0000 7f2e201b4700 1 osd.0 367447 state: booting -> active 2024-04-30T13:08:02.612+0000 7f2e201b4700 1 osd.0 367454 state: booting -> active

During this syncing period the disk utilization of the OS filesystem was at 100%, but ceph kept upgrading other OSD daemons. There are 3 main hosts, failure domain is host, replicated pools with min_size 2, size 3. If the OSDs from one host are struggling (because of disk IO on the filesystem, apparently) and ceph keeps upgrading others, we get inactive PGs.

So what I'm wondering about is, how does the orchestrator decide if it's ok to stop one OSD ('ceph osd ok-to-stop <ID>' is the manual command) while others obviously are not healthy and would cause inactive PGs? The cluster did notice slow requests and reported messages like these:

2024-04-30T13:02:58.867499+0000 osd.28 (osd.28) 10529 : cluster [WRN] Monitor daemon marked osd.28 down, but it is still running

If some OSDs clearly aren't healthy, I would expect the orchestrator to pause the upgrade. After the mon full sync completed, it started upgrading the OSDs on ndeceph02 as well, but the inactive PGs were only resolved after several more minutes when almost all OSDs from that tree branch had been upgraded:

2024-04-30T13:16:17.536359+0000 mon.ndeceph03 (mon.0) 1461 : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 2 pgs inactive, 11 pgs peering)

During a planned upgrade this issue can be mitigated via staggered upgrades (now that I know what the cause is), upgrading MGRs and MONs first and wait until everything has settled. Then continue with OSDs. But after a reboot there's no way to control that, of course. I helped a customer last year with a mon sync issue so we might be able to improve things a bit until we have flash disks.

I don't know how it all behaves in newer ceph versions regarding the upgrade, but I feel like it wouldn't make a difference right now. Is there any way to improve the orchestrator to consider unhealthy OSDs before stopping healthy ones?

Thanks!
Eugen
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux