weird state whilst upgrading to jewel

Luis Periquito <periquito@xxxxxxxxx> · Mon, 10 Oct 2016 14:21:53 +0100

I was upgrading a really old cluster from Infernalis (9.2.1) to Jewel
(10.2.3) and got some weird, but interesting issues. This cluster
started its life with Bobtail -> Dumpling -> Emperor -> Firefly ->
Giant -> Hammer -> Infernalis and now Jewel.

When I upgraded the first MON (out of 3) everything just worked as it
should. Upgraded the second and the first and second crashed. Reverted
binaries on one of them to Infernalis, deleted the store.db folder in
the other one, started as Jewel (now had 2x Infernalis and 1x Jewel)
and let it sync the store. Upgraded the other nodes and every thing
was fine.

Or so it mostly seems. Other than the usual "failed to encode map xxx
with expected crc".

I had some weird size graphs in calamari, and looking closer (ceph df) I got:
GLOBAL:
    SIZE     AVAIL     RAW USED     %RAW USED
     10E      932P           5E         52.46

oooh I got a really big cluster, it's usually a lot smaller (size is 655T).

a snippet cut from "ceph -s"
     health HEALTH_ERR
            1 full osd(s)
            flags full
      pgmap v77393779: 6384 pgs, 26 pools, 66584 GB data, 52605 kobjects
            5502 PB used, 17316 PB / 10488 PB avail

health detail shows: osd.89 is full at 266%
that is one of the OSD's that's being upgraded...

The cluster ends up recovering by its own, and showing the regular
sane values... But this does seem to indicate some sort of underlying
issue....

has anyone seen such an issue?
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com