There are new PG states that cause health_err. In this case it is undersized that is causing this state.
While I decided to upgrade my tunables before upgrading the rest of my cluster, it does not seem to be a requirement. However I would recommend upgrading them sooner than later. It will cause a fair amount of backfilling when you do it. If you are using krbd, don't upgrade your tunables past Hammer.
In any case, you should feel safe continuing with your upgrade. You will definitely be safe to finish this first node as you have 2 copies of your data if anything goes awry. I would say that this first node will finish and get back to a state where all backfilling is done and you can continue with the other nodes.
On Wed, Sep 27, 2017, 6:32 PM Eric van Blokland <ericvanblokland@xxxxxxxxx> wrote:
Hello,_______________________________________________I have run into an issue while upgrading a Ceph cluster from Hammer to Jewel on CentOS. It's a small cluster with 3 monitoring servers and a humble 6 OSDs distributed over 3 servers.I've upgraded the 3 monitors successfully to 10.2.7. They appear to be running fine except for this health warning: "crush map has legacy tunables (require bobtail, min is firefly)". While I might completely underestimate the significance of this warning, it seemed pretty harmless to me and I decided to upgrade my OSDs (running 0.94.10) before touching the tunables.However, as soon as I brought down the OSDs on the first storage server to start upgrading them, the cluster immediately got a HEALTH_ERR status (see ceph -s output below) which made me abort to update process and just start the OSDs again.Now considering that my crushmap forces distribution of 3 copies over 3 servers, the cluster can't heal itself when I take those OSDs down, which would justify an error status. I'm worried however because my memory and my lab environment tell me that this situation should only give a health warning and only degraded PGs, not stuck/inactive (or did my lab environment not get the stuck pgs because they were not being addressed?).health HEALTH_ERR199 pgs are stuck inactive for more than 300 seconds576 pgs degraded199 pgs stuck inactive238 pgs stuck unclean576 pgs undersizedrecovery 1415496/4246488 objects degraded (33.333%)2/6 in osds are downcrush map has legacy tunables (require bobtail, min is firefly)monmap e1: 3 mons at {mgm1=10.10.3.11:6789/0,mgm2=10.10.3.12:6789/0,mgm3=10.10.3.13:6789/0}election epoch 1650, quorum 0,1,2 mgm1,mgm2,mgm3osdmap e808: 6 osds: 4 up, 6 in; 576 remapped pgspgmap v4309615: 576 pgs, 5 pools, 1483 GB data, 1382 kobjects4445 GB used, 7836 GB / 12281 GB avail1415496/4246488 objects degraded (33.333%)512 undersized+degraded+peered64 active+undersized+degradedHow should I proceed from here? Am I seeing ghosts, is the HEALTH_ERR status to be expected and should I just continue or is something definitively wrong here?On a side node: the timer for the stuck/inactive PGs is instantly 300 seconds, right after shutting down the OSDs.Any help would be greatly appreciated.Kind regards,Eric van Blokland
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com