Re: Upgrade from Jewel to Luminous. REQUIRE_JEWEL OSDMap

Sage Weil <sage@xxxxxxxxxxxx> · Tue, 28 Nov 2017 03:09:01 +0000 (UTC)



On Tue, 28 Nov 2017, Cary wrote:
> Hello,
> 
>  Could someone please help me complete my botched upgrade from Jewel
> 10.2.3-r1 to Luminous 12.2.1. I have 9 Gentoo servers, 4 of which have
> 2 OSDs each.
> 
>  My OSD servers were accidentally rebooted before the monitor servers
> causing them to be running Luminous before the monitors. All services
> have been restarted and running ceph versions gives the following:
> 
> # ceph versions
> 2017-11-27 21:27:24.356940 7fed67efe700 -1 WARNING: the following
> dangerous and experimental features are enabled: btrfs
> 2017-11-27 21:27:24.368469 7fed67efe700 -1 WARNING: the following
> dangerous and experimental features are enabled: btrfs
> 
>     "mon": {
>         "ceph version 12.2.1
> (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous (stable)": 4
>     },
>     "mgr": {
>         "ceph version 12.2.1
> (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous (stable)": 3
>     },
>     "osd": {},
>     "mds": {
>         "ceph version 12.2.1
> (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous (stable)": 1
>     },
>     "overall": {
>         "ceph version 12.2.1
> (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous (stable)": 8
> 
> 
> 
> For some reason the OSDs do not show what version they are running,
> and a ceph osd tree shows all of the OSD as being down.
> 
>  # ceph osd tree
> 2017-11-27 21:32:51.969335 7f483d9c2700 -1 WARNING: the following
> dangerous and experimental features are enabled: btrfs
> 2017-11-27 21:32:51.980976 7f483d9c2700 -1 WARNING: the following
> dangerous and experimental features are enabled: btrfs
> ID CLASS WEIGHT   TYPE NAME              STATUS REWEIGHT PRI-AFF
> -1       27.77998 root default
> -3       27.77998     datacenter DC1
> -6       27.77998         rack 1B06
> -5        6.48000             host ceph3
>  1        1.84000                 osd.1    down        0 1.00000
>  3        4.64000                 osd.3    down        0 1.00000
> -2        5.53999             host ceph4
>  5        4.64000                 osd.5    down        0 1.00000
>  8        0.89999                 osd.8    down        0 1.00000
> -4        9.28000             host ceph6
>  0        4.64000                 osd.0    down        0 1.00000
>  2        4.64000                 osd.2    down        0 1.00000
> -7        6.48000             host ceph7
>  6        4.64000                 osd.6    down        0 1.00000
>  7        1.84000                 osd.7    down        0 1.00000
> 
> The OSD logs all have this message:
> 
> 20235 osdmap REQUIRE_JEWEL OSDMap flag is NOT set; please set it.

THis is an annoying corner condition.  12.2.2 (out soon!)  will have a 
--force option to set the flag even tho no osds are up.  Until then, the 
workaround is to downgrade one host to jewel, start one jewel osd, then 
set the flag.  Then upgrade to luminous again and restart all osds.

sage


> 
> When I try to set it with "ceph osd set require_jewel_osds" I get this error:
> 
> Error EPERM: not all up OSDs have CEPH_FEATURE_SERVER_JEWEL feature
> 
> 
> 
> A "ceph features" returns:
> 
>     "mon": {
>         "group": {
>             "features": "0x1ffddff8eea4fffb",
>             "release": "luminous",
>             "num": 4
>         }
>     },
>     "mds": {
>         "group": {
>             "features": "0x1ffddff8eea4fffb",
>             "release": "luminous",
>             "num": 1
>         }
>     },
>     "osd": {
>         "group": {
>             "features": "0x1ffddff8eea4fffb",
>             "release": "luminous",
>             "num": 8
>         }
>     },
>     "client": {
>         "group": {
>             "features": "0x1ffddff8eea4fffb",
>             "release": "luminous",
>             "num": 3
> 
>  # ceph tell osd.* versions
> 2017-11-28 02:29:28.565943 7f99c6aee700 -1 WARNING: the following
> dangerous and experimental features are enabled: btrfs
> 2017-11-28 02:29:28.578956 7f99c6aee700 -1 WARNING: the following
> dangerous and experimental features are enabled: btrfs
> Error ENXIO: problem getting command descriptions from osd.0
> osd.0: problem getting command descriptions from osd.0
> Error ENXIO: problem getting command descriptions from osd.1
> osd.1: problem getting command descriptions from osd.1
> Error ENXIO: problem getting command descriptions from osd.2
> osd.2: problem getting command descriptions from osd.2
> Error ENXIO: problem getting command descriptions from osd.3
> osd.3: problem getting command descriptions from osd.3
> Error ENXIO: problem getting command descriptions from osd.5
> osd.5: problem getting command descriptions from osd.5
> Error ENXIO: problem getting command descriptions from osd.6
> osd.6: problem getting command descriptions from osd.6
> Error ENXIO: problem getting command descriptions from osd.7
> osd.7: problem getting command descriptions from osd.7
> Error ENXIO: problem getting command descriptions from osd.8
> osd.8: problem getting command descriptions from osd.8
> 
>  # ceph daemon osd.1 status
> 
>     "cluster_fsid": "CENSORED",
>     "osd_fsid": "CENSORED",
>     "whoami": 1,
>     "state": "preboot",
>     "oldest_map": 19482,
>     "newest_map": 20235,
>     "num_pgs": 141
> 
>  # ceph -s
> 2017-11-27 22:04:10.372471 7f89a3935700 -1 WARNING: the following
> dangerous and experimental features are enabled: btrfs
> 2017-11-27 22:04:10.375709 7f89a3935700 -1 WARNING: the following
> dangerous and experimental features are enabled: btrfs
>   cluster:
>     id:     CENSORED
>     health: HEALTH_ERR
>             513 pgs are stuck inactive for more than 60 seconds
>             126 pgs backfill_wait
>             52 pgs backfilling
>             435 pgs degraded
>             513 pgs stale
>             435 pgs stuck degraded
>             513 pgs stuck stale
>             435 pgs stuck unclean
>             435 pgs stuck undersized
>             435 pgs undersized
>             recovery 854719/3688140 objects degraded (23.175%)
>             recovery 838607/3688140 objects misplaced (22.738%)
>             mds cluster is degraded
>             crush map has straw_calc_version=0
> 
>   services:
>     mon: 4 daemons, quorum 0,1,3,2
>     mgr: 0(active), standbys: 1, 5
>     mds: cephfs-1/1/1 up  {0=a=up:replay}, 1 up:standby
>     osd: 8 osds: 0 up, 0 in
> 
>   data:
>     pools:   7 pools, 513 pgs
>     objects: 1199k objects, 4510 GB
>     usage:   13669 GB used, 15150 GB / 28876 GB avail
>     pgs:     854719/3688140 objects degraded (23.175%)
>              838607/3688140 objects misplaced (22.738%)
>              257 stale+active+undersized+degraded
>              126 stale+active+undersized+degraded+remapped+backfill_wait
>              78  stale+active+clean
>              52  stale+active+undersized+degraded+remapped+backfilling
> 
> 
> I ran "ceph auth list", and client.admin has the following permissions.
> auid: 0
> caps: [mds] allow
> caps: [mgr] allow *
> caps: [mon] allow *
> caps: [osd] allow *
> 
> Thank you for your time.
> 
> Is there any way I can get these OSDs to join the cluster now, or
> recover my data?
> 
> Cary
> -Dynamic
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html