Hello,
I have run into an issue while upgrading a Ceph cluster from Hammer to Jewel on CentOS. It's a small cluster with 3 monitoring servers and a humble 6 OSDs distributed over 3 servers.
I've upgraded the 3 monitors successfully to 10.2.7. They appear to be running fine except for this health warning: "crush map has legacy tunables (require bobtail, min is firefly)". While I might completely underestimate the significance of this warning, it seemed pretty harmless to me and I decided to upgrade my OSDs (running 0.94.10) before touching the tunables.
However, as soon as I brought down the OSDs on the first storage server to start upgrading them, the cluster immediately got a HEALTH_ERR status (see ceph -s output below) which made me abort to update process and just start the OSDs again.
Now considering that my crushmap forces distribution of 3 copies over 3 servers, the cluster can't heal itself when I take those OSDs down, which would justify an error status. I'm worried however because my memory and my lab environment tell me that this situation should only give a health warning and only degraded PGs, not stuck/inactive (or did my lab environment not get the stuck pgs because they were not being addressed?).
health HEALTH_ERR
199 pgs are stuck inactive for more than 300 seconds
576 pgs degraded
199 pgs stuck inactive
238 pgs stuck unclean
576 pgs undersized
recovery 1415496/4246488 objects degraded (33.333%)
2/6 in osds are down
crush map has legacy tunables (require bobtail, min is firefly)
monmap e1: 3 mons at {mgm1=10.10.3.11:6789/0,mgm2=10.10.3.12:6789/0,mgm3=10.10.3.13:6789/0}
election epoch 1650, quorum 0,1,2 mgm1,mgm2,mgm3
osdmap e808: 6 osds: 4 up, 6 in; 576 remapped pgs
pgmap v4309615: 576 pgs, 5 pools, 1483 GB data, 1382 kobjects
4445 GB used, 7836 GB / 12281 GB avail
1415496/4246488 objects degraded (33.333%)
512 undersized+degraded+peered
64 active+undersized+degraded
How should I proceed from here? Am I seeing ghosts, is the HEALTH_ERR status to be expected and should I just continue or is something definitively wrong here?
On a side node: the timer for the stuck/inactive PGs is instantly 300 seconds, right after shutting down the OSDs.
Any help would be greatly appreciated.
Kind regards,
Eric van Blokland
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com