Re: osd problem upgrading from hammer to jewel

Randy Orr <randy.orr@xxxxxxxxxx> · Fri, 29 Apr 2016 15:23:30 -0500

Hi,
I have a little bit of additional information here that might help debug this situation. From the OSD logs:

2016-04-29 14:32:46.886538 7fa4cd004800  0 osd.2 14422 done with init, starting boot process
2016-04-29 14:32:46.886555 7fa4cd004800  1 -- 10.2.0.116:6808/32079 --> 10.2.0.117:6789/0 -- mon_subscribe({osd_pg_creates=0+}) v2 -- ?+0 0x55d8389ee200 con 0x55d8549c4e80
2016-04-29 14:32:46.886568 7fa4cd004800  1 osd.2 14422 We are healthy, booting
2016-04-29 14:32:46.886577 7fa4cd004800  1 -- 10.2.0.116:6808/32079 --> 10.2.0.117:6789/0 -- mon_get_version(what=osdmap handle=1) v1 -- ?+0 0x55d837dc61e0 con 0x55d8549c4e80
2016-04-29 14:32:46.887063 7fa4b66bc700  1 -- 10.2.0.116:6808/32079 <== mon.1 10.2.0.117:6789/0 8 ==== mon_get_version_reply(handle=1 version=14422) v2 ==== 24+0+0 (1829608329 0 0) 0x55d837dc65a0 con 0x55d8549c4e80
2016-04-29 14:32:46.887087 7fa4adeab700  1 osd.2 14422 osdmap indicates one or more pre-v0.94.4 hammer OSDs is running
2016-04-29 14:32:46.887100 7fa4adeab700  1 -- 10.2.0.116:6808/32079 --> 10.2.0.117:6789/0 -- mon_subscribe({osdmap=14423}) v2 -- ?+0 0x55d854d65c00 con 0x55d8549c4e80

So, it's saying there is an older OSD running, but:

#ceph tell osd.* version
Error ENXIO: problem getting command descriptions from osd.0
osd.0: problem getting command descriptions from osd.0
Error ENXIO: problem getting command descriptions from osd.1
osd.1: problem getting command descriptions from osd.1
Error ENXIO: problem getting command descriptions from osd.2
osd.2: problem getting command descriptions from osd.2
Error ENXIO: problem getting command descriptions from osd.3
osd.3: problem getting command descriptions from osd.3
osd.4: {
    "version": "ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403)"
}
osd.5: {
    "version": "ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403)"
}
osd.6: {
    "version": "ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403)"
}
osd.7: {
    "version": "ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403)"
}
osd.8: {
    "version": "ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403)"
}
osd.9: {
    "version": "ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403)"
}
osd.10: {
    "version": "ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403)"
}
osd.11: {
    "version": "ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403)"
}
root@DAL1S4UTIL8:~# ceph tell mon.* version
mon.DAL1S4UTIL6: ceph version 10.2.0 (3a9fba20ec743699b69bd0181dd6c54dc01c64b9)
mon.DAL1S4UTIL7: ceph version 10.2.0 (3a9fba20ec743699b69bd0181dd6c54dc01c64b9)
mon.DAL1S4UTIL8: ceph version 10.2.0 (3a9fba20ec743699b69bd0181dd6c54dc01c64b9)

osd[1,2,3] are the ones that have been upgraded and restarted. So, it looks to me like all OSDs are greater than 0.94.4...

What could be causing this?

Thanks,
Randy

On Wed, Apr 27, 2016 at 4:57 PM, Randy Orr <randy.orr@xxxxxxxxxx> wrote:
Hi,
I have a small dev/test ceph cluster that sat neglected for quite some time. It was on the firefly release until very recently. I successfully upgraded from firefly to hammer without issue as an intermediate step to get to the latest jewel release.

This cluster has 3 ubuntu 14.04 hosts with kernel 3.13.0-40-generic. MONs and OSDs are colocated on the same hosts with 11 total OSDs across the 3 hosts.

The 3 MONs have been updated to jewel and are running successfully. I set noout on the cluster and shutdown the first 3 OSD processes, ran chown -R ceph:ceph on /var/lib/ceph/osd. The OSD processes start and run, but never show as UP. After setting debug osd = 20 I see the following in the logs:

2016-04-27 15:55:19.042230 7fd3854c7700 10 osd.1 13324 tick
2016-04-27 15:55:19.042244 7fd3854c7700 10 osd.1 13324 do_waiters -- start
2016-04-27 15:55:19.042247 7fd3854c7700 10 osd.1 13324 do_waiters -- finish
2016-04-27 15:55:19.061083 7fd384cc6700 10 osd.1 13324 tick_without_osd_lock
2016-04-27 15:55:19.061096 7fd384cc6700 20 osd.1 13324 scrub_random_backoff lost coin flip, randomly backing off
2016-04-27 15:55:20.042351 7fd3854c7700 10 osd.1 13324 tick
2016-04-27 15:55:20.042364 7fd3854c7700 10 osd.1 13324 do_waiters -- start
2016-04-27 15:55:20.042368 7fd3854c7700 10 osd.1 13324 do_waiters -- finish
2016-04-27 15:55:20.061192 7fd384cc6700 10 osd.1 13324 tick_without_osd_lock
2016-04-27 15:55:20.061206 7fd384cc6700 20 osd.1 13324 can_inc_scrubs_pending0 -> 1 (max 1, active 0)
2016-04-27 15:55:20.061212 7fd384cc6700 20 osd.1 13324 scrub_time_permit should run between 0 - 24 now 15 = yes
2016-04-27 15:55:20.061247 7fd384cc6700 20 osd.1 13324 scrub_load_below_threshold loadavg 0.04 < max 0.5 = yes
2016-04-27 15:55:20.061259 7fd384cc6700 20 osd.1 13324 sched_scrub load_is_low=1
2016-04-27 15:55:20.061261 7fd384cc6700 20 osd.1 13324 sched_scrub done
2016-04-27 15:55:20.861872 7fd368ded700 20 osd.1 13324 update_osd_stat osd_stat(61789 MB used, 865 GB avail, 926 GB total, peers []/[] op hist [])
2016-04-27 15:55:20.861886 7fd368ded700  5 osd.1 13324 heartbeat: osd_stat(61789 MB used, 865 GB avail, 926 GB total, peers []/[] op hist [])

The fact that no peers show up in the heartbeat seems problematic, but I can't see why the OSDs are failing to start correctly. 

A ceph status gives this:

    cluster 9e3f9cab-6f1b-4c7c-ab13-e01cb774f752
     health HEALTH_WARN
            725 pgs degraded
            3584 pgs stuck unclean
            725 pgs undersized
            recovery 23363/180420 objects degraded (12.949%)
            recovery 49218/180420 objects misplaced (27.280%)
            too many PGs per OSD (651 > max 300)
            3/11 in osds are down
            noout flag(s) set
     monmap e3: 3 mons at {DAL1S4UTIL6=10.2.0.116:6789/0,DAL1S4UTIL7=10.2.0.117:6789/0,DAL1S4UTIL8=10.2.0.118:6789/0}
            election epoch 32, quorum 0,1,2 DAL1S4UTIL6,DAL1S4UTIL7,DAL1S4UTIL8
     osdmap e13324: 11 osds: 8 up, 11 in; 2859 remapped pgs
            flags noout
      pgmap v6332775: 3584 pgs, 7 pools, 180 GB data, 60140 objects
            703 GB used, 9483 GB / 10186 GB avail
            23363/180420 objects degraded (12.949%)
            49218/180420 objects misplaced (27.280%)
                2238 active+remapped
                 725 active+undersized+degraded
                 621 active

Disk utilization is low. Nothing interesting in syslog or dmesg. Any ideas or suggestions on where to start debugging this?

Thanks,
Randy

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com