osd problem upgrading from hammer to jewel

Randy Orr <randy.orr@xxxxxxxxxx> · Wed, 27 Apr 2016 16:57:35 -0500

Hi,
I have a small dev/test ceph cluster that sat neglected for quite some time. It was on the firefly release until very recently. I successfully upgraded from firefly to hammer without issue as an intermediate step to get to the latest jewel release.

This cluster has 3 ubuntu 14.04 hosts with kernel 3.13.0-40-generic. MONs and OSDs are colocated on the same hosts with 11 total OSDs across the 3 hosts.

The 3 MONs have been updated to jewel and are running successfully. I set noout on the cluster and shutdown the first 3 OSD processes, ran chown -R ceph:ceph on /var/lib/ceph/osd. The OSD processes start and run, but never show as UP. After setting debug osd = 20 I see the following in the logs:

2016-04-27 15:55:19.042230 7fd3854c7700 10 osd.1 13324 tick
2016-04-27 15:55:19.042244 7fd3854c7700 10 osd.1 13324 do_waiters -- start
2016-04-27 15:55:19.042247 7fd3854c7700 10 osd.1 13324 do_waiters -- finish
2016-04-27 15:55:19.061083 7fd384cc6700 10 osd.1 13324 tick_without_osd_lock
2016-04-27 15:55:19.061096 7fd384cc6700 20 osd.1 13324 scrub_random_backoff lost coin flip, randomly backing off
2016-04-27 15:55:20.042351 7fd3854c7700 10 osd.1 13324 tick
2016-04-27 15:55:20.042364 7fd3854c7700 10 osd.1 13324 do_waiters -- start
2016-04-27 15:55:20.042368 7fd3854c7700 10 osd.1 13324 do_waiters -- finish
2016-04-27 15:55:20.061192 7fd384cc6700 10 osd.1 13324 tick_without_osd_lock
2016-04-27 15:55:20.061206 7fd384cc6700 20 osd.1 13324 can_inc_scrubs_pending0 -> 1 (max 1, active 0)
2016-04-27 15:55:20.061212 7fd384cc6700 20 osd.1 13324 scrub_time_permit should run between 0 - 24 now 15 = yes
2016-04-27 15:55:20.061247 7fd384cc6700 20 osd.1 13324 scrub_load_below_threshold loadavg 0.04 < max 0.5 = yes
2016-04-27 15:55:20.061259 7fd384cc6700 20 osd.1 13324 sched_scrub load_is_low=1
2016-04-27 15:55:20.061261 7fd384cc6700 20 osd.1 13324 sched_scrub done
2016-04-27 15:55:20.861872 7fd368ded700 20 osd.1 13324 update_osd_stat osd_stat(61789 MB used, 865 GB avail, 926 GB total, peers []/[] op hist [])
2016-04-27 15:55:20.861886 7fd368ded700  5 osd.1 13324 heartbeat: osd_stat(61789 MB used, 865 GB avail, 926 GB total, peers []/[] op hist [])

The fact that no peers show up in the heartbeat seems problematic, but I can't see why the OSDs are failing to start correctly. 

A ceph status gives this:

    cluster 9e3f9cab-6f1b-4c7c-ab13-e01cb774f752
     health HEALTH_WARN
            725 pgs degraded
            3584 pgs stuck unclean
            725 pgs undersized
            recovery 23363/180420 objects degraded (12.949%)
            recovery 49218/180420 objects misplaced (27.280%)
            too many PGs per OSD (651 > max 300)
            3/11 in osds are down
            noout flag(s) set
     monmap e3: 3 mons at {DAL1S4UTIL6=10.2.0.116:6789/0,DAL1S4UTIL7=10.2.0.117:6789/0,DAL1S4UTIL8=10.2.0.118:6789/0}
            election epoch 32, quorum 0,1,2 DAL1S4UTIL6,DAL1S4UTIL7,DAL1S4UTIL8
     osdmap e13324: 11 osds: 8 up, 11 in; 2859 remapped pgs
            flags noout
      pgmap v6332775: 3584 pgs, 7 pools, 180 GB data, 60140 objects
            703 GB used, 9483 GB / 10186 GB avail
            23363/180420 objects degraded (12.949%)
            49218/180420 objects misplaced (27.280%)
                2238 active+remapped
                 725 active+undersized+degraded
                 621 active

Disk utilization is low. Nothing interesting in syslog or dmesg. Any ideas or suggestions on where to start debugging this?

Thanks,
Randy
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com