Hi,
we're facing an issue during upgrades (and sometimes server reboots),
it appears to occur when (at leat) one of the MONs has to do a full
sync. And I'm wondering if the upgrade procedure could be improved in
that regard, I'll come back to that later. First, I'll try to
summarize the events.
We upgraded to latest Pacific (16.2.15) last week. Note that the
filesystems of the MONs are on HDDs (we're planning to move to flash,
we already noticed issues during disaster recovery on HDDs last year).
According to the logs the MGRs and MONs were upgraded successfully and
a quorum was formed successfully at:
2024-04-30T12:57:22.347182+0000 mon.ndeceph03 (mon.0) 1208046 :
cluster [INF] Health check cleared: MON_DOWN (was: 1/3 mons down,
quorum ndeceph03,ndeceph01)
Then ceph started the first OSD upgrades a few minutes later, the
first one at:
2024-04-30T13:00:42.733528+0000 mon.ndeceph03 (mon.0) 101 : cluster
[INF] osd.25 marked itself down and dead
One OSD by one on ndeceph01 was upgraded:
2024-04-30T13:01:46.517+0000 7fbd78a31700 0 [cephadm INFO
cephadm.upgrade] Upgrade: Updating osd.19 (5/7)
2024-04-30T13:01:46.517+0000 7fbd78a31700 0
log_channel(cephadm) log [INF] : Upgrade: Updating osd.19 (5/7)
2024-04-30T13:02:24.800+0000 7fbd78a31700 0 [cephadm INFO
cephadm.upgrade] Upgrade: Updating osd.22 (6/7)
2024-04-30T13:02:24.800+0000 7fbd78a31700 0
log_channel(cephadm) log [INF] : Upgrade: Updating osd.22 (6/7)
2024-04-30T13:02:48.220+0000 7fbd78a31700 0 [cephadm INFO
cephadm.upgrade] Upgrade: Updating osd.29 (7/7)
2024-04-30T13:02:48.220+0000 7fbd78a31700 0
log_channel(cephadm) log [INF] : Upgrade: Updating osd.29 (7/7)
But the mon service on ndeceph02 (also OSD server) was still syncing
(for around 6 minutes):
2024-04-30T13:02:33.124+0000 7f1c24444700 1
mon.ndeceph02@2(synchronizing) e58 handle_auth_request failed to
assign global_id
2024-04-30T13:08:30.123+0000 7f1c24444700 1
mon.ndeceph02@2(synchronizing) e58 handle_auth_request failed to
assign global_id
All HDD OSDs from ndeceph02 (not upgraded yet) were complaining about
timeouts, and apparently tried to reboot multiple times (wondering why
the SSDs didn't complain though):
2024-04-30T13:03:54.374+0000 7f2e2d1ce700 1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f2e1399b700' had timed out after 15.000000954s
2024-04-30T13:03:54.374+0000 7f2e2d1ce700 1 osd.0 367411 is_healthy
false -- internal heartbeat failed
2024-04-30T13:03:54.374+0000 7f2e2d1ce700 1 osd.0 367411 not healthy;
waiting to boot
2024-04-30T13:03:55.330+0000 7f2e2d1ce700 1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f2e1399b700' had timed out after 15.000000954s
2024-04-30T13:03:55.330+0000 7f2e2d1ce700 1 osd.0 367411 is_healthy
false -- internal heartbeat failed
2024-04-30T13:03:55.330+0000 7f2e2d1ce700 1 osd.0 367411 not healthy;
waiting to boot
...
2024-04-30T13:03:59.322+0000 7f2e201b4700 1 osd.0 367414 state:
booting -> active
2024-04-30T13:05:22.184+0000 7f2e201b4700 1 osd.0 367433 state:
booting -> active
2024-04-30T13:06:39.602+0000 7f2e201b4700 1 osd.0 367447 state:
booting -> active
2024-04-30T13:08:02.612+0000 7f2e201b4700 1 osd.0 367454 state:
booting -> active
During this syncing period the disk utilization of the OS filesystem
was at 100%, but ceph kept upgrading other OSD daemons. There are 3
main hosts, failure domain is host, replicated pools with min_size 2,
size 3.
If the OSDs from one host are struggling (because of disk IO on the
filesystem, apparently) and ceph keeps upgrading others, we get
inactive PGs.
So what I'm wondering about is, how does the orchestrator decide if
it's ok to stop one OSD ('ceph osd ok-to-stop <ID>' is the manual
command) while others obviously are not healthy and would cause
inactive PGs? The cluster did notice slow requests and reported
messages like these:
2024-04-30T13:02:58.867499+0000 osd.28 (osd.28) 10529 : cluster [WRN]
Monitor daemon marked osd.28 down, but it is still running
If some OSDs clearly aren't healthy, I would expect the orchestrator
to pause the upgrade. After the mon full sync completed, it started
upgrading the OSDs on ndeceph02 as well, but the inactive PGs were
only resolved after several more minutes when almost all OSDs from
that tree branch had been upgraded:
2024-04-30T13:16:17.536359+0000 mon.ndeceph03 (mon.0) 1461 : cluster
[INF] Health check cleared: PG_AVAILABILITY (was: Reduced data
availability: 2 pgs inactive, 11 pgs peering)
During a planned upgrade this issue can be mitigated via staggered
upgrades (now that I know what the cause is), upgrading MGRs and MONs
first and wait until everything has settled. Then continue with OSDs.
But after a reboot there's no way to control that, of course. I helped
a customer last year with a mon sync issue so we might be able to
improve things a bit until we have flash disks.
I don't know how it all behaves in newer ceph versions regarding the
upgrade, but I feel like it wouldn't make a difference right now. Is
there any way to improve the orchestrator to consider unhealthy OSDs
before stopping healthy ones?
Thanks!
Eugen
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx