Hi,
I have a small dev/test ceph cluster that sat neglected for quite some time. It was on the firefly release until very recently. I successfully upgraded from firefly to hammer without issue as an intermediate step to get to the latest jewel release.
This cluster has 3 ubuntu 14.04 hosts with kernel 3.13.0-40-generic. MONs and OSDs are colocated on the same hosts with 11 total OSDs across the 3 hosts.
The 3 MONs have been updated to jewel and are running successfully. I set noout on the cluster and shutdown the first 3 OSD processes, ran chown -R ceph:ceph on /var/lib/ceph/osd. The OSD processes start and run, but never show as UP. After setting debug osd = 20 I see the following in the logs:
2016-04-27 15:55:19.042230 7fd3854c7700 10 osd.1 13324 tick
2016-04-27 15:55:19.042244 7fd3854c7700 10 osd.1 13324 do_waiters -- start
2016-04-27 15:55:19.042247 7fd3854c7700 10 osd.1 13324 do_waiters -- finish
2016-04-27 15:55:19.061083 7fd384cc6700 10 osd.1 13324 tick_without_osd_lock
2016-04-27 15:55:19.061096 7fd384cc6700 20 osd.1 13324 scrub_random_backoff lost coin flip, randomly backing off
2016-04-27 15:55:20.042351 7fd3854c7700 10 osd.1 13324 tick
2016-04-27 15:55:20.042364 7fd3854c7700 10 osd.1 13324 do_waiters -- start
2016-04-27 15:55:20.042368 7fd3854c7700 10 osd.1 13324 do_waiters -- finish
2016-04-27 15:55:20.061192 7fd384cc6700 10 osd.1 13324 tick_without_osd_lock
2016-04-27 15:55:20.061206 7fd384cc6700 20 osd.1 13324 can_inc_scrubs_pending0 -> 1 (max 1, active 0)
2016-04-27 15:55:20.061212 7fd384cc6700 20 osd.1 13324 scrub_time_permit should run between 0 - 24 now 15 = yes
2016-04-27 15:55:20.061247 7fd384cc6700 20 osd.1 13324 scrub_load_below_threshold loadavg 0.04 < max 0.5 = yes
2016-04-27 15:55:20.061259 7fd384cc6700 20 osd.1 13324 sched_scrub load_is_low=1
2016-04-27 15:55:20.061261 7fd384cc6700 20 osd.1 13324 sched_scrub done
2016-04-27 15:55:20.861872 7fd368ded700 20 osd.1 13324 update_osd_stat osd_stat(61789 MB used, 865 GB avail, 926 GB total, peers []/[] op hist [])
2016-04-27 15:55:20.861886 7fd368ded700 5 osd.1 13324 heartbeat: osd_stat(61789 MB used, 865 GB avail, 926 GB total, peers []/[] op hist [])
The fact that no peers show up in the heartbeat seems problematic, but I can't see why the OSDs are failing to start correctly.
A ceph status gives this:
cluster 9e3f9cab-6f1b-4c7c-ab13-e01cb774f752
health HEALTH_WARN
725 pgs degraded
3584 pgs stuck unclean
725 pgs undersized
recovery 23363/180420 objects degraded (12.949%)
recovery 49218/180420 objects misplaced (27.280%)
too many PGs per OSD (651 > max 300)
3/11 in osds are down
noout flag(s) set
monmap e3: 3 mons at {DAL1S4UTIL6=10.2.0.116:6789/0,DAL1S4UTIL7=10.2.0.117:6789/0,DAL1S4UTIL8=10.2.0.118:6789/0}
election epoch 32, quorum 0,1,2 DAL1S4UTIL6,DAL1S4UTIL7,DAL1S4UTIL8
osdmap e13324: 11 osds: 8 up, 11 in; 2859 remapped pgs
flags noout
pgmap v6332775: 3584 pgs, 7 pools, 180 GB data, 60140 objects
703 GB used, 9483 GB / 10186 GB avail
23363/180420 objects degraded (12.949%)
49218/180420 objects misplaced (27.280%)
2238 active+remapped
725 active+undersized+degraded
621 active
Disk utilization is low. Nothing interesting in syslog or dmesg. Any ideas or suggestions on where to start debugging this?
Thanks,
Randy
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com