Just some more info -- this happens also when I just restart an OSD that *was* working -- it won't start back.
In the mon log I have (which correspond to the OSDs that I've been trying to start). osd.13 was working just now, before I stopped the service and tried to start it again.
2017-07-25 14:42:49.249076 7f2386806700 0 cephx server osd.10: couldn't find entity name: osd.10
2017-07-25 14:43:24.323603 7f2386806700 0 cephx server osd.13: couldn't find entity name: osd.13
2017-07-25 14:43:25.033487 7f2386806700 0 cephx server osd.7: couldn't find entity name: osd.7
Still reading and learning.
On Tue, Jul 25, 2017 at 2:38 PM, Daniel K <sathackr@xxxxxxxxx> wrote:
Update to this -- I tried building a new host and a new OSD, new disk, and I am having the same issue.I set osd debug level to 10 -- the issue looks like it's coming from a mon daemon. Still trying to learn enough about the internals of ceph to understand what's happening here.Relevant debug logs(I think)2017-07-25 14:21:58.889016 7f25a88af700 1 -- 10.0.15.142:6800/16150 <== mon.1 10.0.15.51:6789/0 1 ==== mon_map magic: 0 v1 ==== 541+0+0 (2831459213 0 0) 0x556640ecd900 con 0x5566419498002017-07-25 14:21:58.889109 7f25a88af700 1 -- 10.0.15.142:6800/16150 <== mon.1 10.0.15.51:6789/0 2 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 33+0+0 (248727397 0 0) 0x556640ecdb80 con 0x5566419498002017-07-25 14:21:58.889204 7f25a88af700 1 -- 10.0.15.142:6800/16150 --> 10.0.15.51:6789/0 -- auth(proto 2 32 bytes epoch 0) v1 -- 0x556640ecd400 con 02017-07-25 14:21:58.889966 7f25a88af700 1 -- 10.0.15.142:6800/16150 <== mon.1 10.0.15.51:6789/0 3 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 206+0+0 (3141870879 0 0) 0x556640ecd400 con 0x5566419498002017-07-25 14:21:58.890066 7f25a88af700 1 -- 10.0.15.142:6800/16150 --> 10.0.15.51:6789/0 -- auth(proto 2 165 bytes epoch 0) v1 -- 0x556640ecdb80 con 02017-07-25 14:21:58.890759 7f25a88af700 1 -- 10.0.15.142:6800/16150 <== mon.1 10.0.15.51:6789/0 4 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 564+0+0 (1715764650 0 0) 0x556640ecdb80 con 0x5566419498002017-07-25 14:21:58.890871 7f25a88af700 1 -- 10.0.15.142:6800/16150 --> 10.0.15.51:6789/0 -- mon_subscribe({monmap=0+}) v2 -- 0x556640e77680 con 02017-07-25 14:21:58.890901 7f25a88af700 1 -- 10.0.15.142:6800/16150 --> 10.0.15.51:6789/0 -- auth(proto 2 2 bytes epoch 0) v1 -- 0x556640ecd400 con 02017-07-25 14:21:58.891494 7f25a88af700 1 -- 10.0.15.142:6800/16150 <== mon.1 10.0.15.51:6789/0 5 ==== mon_map magic: 0 v1 ==== 541+0+0 (2831459213 0 0) 0x556640ecde00 con 0x5566419498002017-07-25 14:21:58.891555 7f25a88af700 1 -- 10.0.15.142:6800/16150 <== mon.1 10.0.15.51:6789/0 6 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 194+0+0 (1036670921 0 0) 0x556640ece080 con 0x5566419498002017-07-25 14:21:58.892003 7f25b5e71c80 10 osd.7 0 mon_cmd_maybe_osd_create cmd: {"prefix": "osd crush set-device-class", "class": "hdd", "ids": ["7"]}2017-07-25 14:21:58.892039 7f25b5e71c80 1 -- 10.0.15.142:6800/16150 --> 10.0.15.51:6789/0 -- mon_command({"prefix": "osd crush set-device-class", "class": "hdd", "ids": ["7"]} v 0) v1 -- 0x556640e78d00 con 02017-07-25 14:21:58.894596 7f25a88af700 1 -- 10.0.15.142:6800/16150 <== mon.1 10.0.15.51:6789/0 7 ==== mon_command_ack([{"prefix": "osd crush set-device-class", "class": "hdd", "ids": ["7"]}]=-2 (2) No such file or directory v10406) v1 ==== 133+0+0 (3400959855 0 0) 0x556640ece300 con 0x5566419498002017-07-25 14:21:58.894797 7f25b5e71c80 1 -- 10.0.15.142:6800/16150 --> 10.0.15.51:6789/0 -- mon_command({"prefix": "osd create", "id": 7, "uuid": "92445e4f-850e-453b-b5ab-569d1414f72d"} v 0) v1 -- 0x556640e79180 con 0 2017-07-25 14:21:58.896301 7f25a88af700 1 -- 10.0.15.142:6800/16150 <== mon.1 10.0.15.51:6789/0 8 ==== mon_command_ack([{"prefix": "osd create", "id": 7, "uuid": "92445e4f-850e-453b-b5ab-569d1414f72d"}]=0 v10406) v1 ==== 115+0+2 (2540205126 0 1371665406) 0x556640ece580 con 0x556641949800 2017-07-25 14:21:58.896473 7f25b5e71c80 10 osd.7 0 mon_cmd_maybe_osd_create cmd: {"prefix": "osd crush set-device-class", "class": "hdd", "ids": ["7"]}2017-07-25 14:21:58.896516 7f25b5e71c80 1 -- 10.0.15.142:6800/16150 --> 10.0.15.51:6789/0 -- mon_command({"prefix": "osd crush set-device-class", "class": "hdd", "ids": ["7"]} v 0) v1 -- 0x556640e793c0 con 02017-07-25 14:21:58.898180 7f25a88af700 1 -- 10.0.15.142:6800/16150 <== mon.1 10.0.15.51:6789/0 9 ==== mon_command_ack([{"prefix": "osd crush set-device-class", "class": "hdd", "ids": ["7"]}]=-2 (2) No such file or directory v10406) v1 ==== 133+0+0 (3400959855 0 0) 0x556640ecd900 con 0x5566419498002017-07-25 14:21:58.898276 7f25b5e71c80 -1 osd.7 0 mon_cmd_maybe_osd_create fail: '(2) No such file or directory': (2) No such file or directory2017-07-25 14:21:58.898380 7f25b5e71c80 1 -- 10.0.15.142:6800/16150 >> 10.0.15.51:6789/0 conn(0x556641949800 :-1 s=STATE_OPEN pgs=367879 cs=1 l=1).mark_downOn Mon, Jul 24, 2017 at 1:33 PM, Daniel K <sathackr@xxxxxxxxx> wrote:List --I have a 4-node cluster running on baremetal and have a need to use the kernel client on 2 nodes. As I read you should not run the kernel client on a node that runs an OSD daemon, I decided to move the OSD daemons into a VM on the same device.Orignal host is stor-vm2(bare metal), new host is stor-vm2a(Virtual)All went well -- I did these steps(for each OSD, 5 total per host)- setup the VM- install the OS- installed ceph(using ceph-deploy)- set noout- stopped ceph osd on bare metal host- unmount /dev/sdb1 from /var/lib/ceph/osd/ceph-0- add /dev/sdb to the VM- ceph detected the osd and started automatically.- moved VM host to the same bucket as physical host in crushmapI did this for each OSD, and despite some recovery IO because of the updated crushmap, all OSDs were up.I rebooted the physical host, which rebooted the VM, and now the OSDs are refusing to start.I've tried moving them back to the bare metal host with the same results.Any ideas?Here are what seem to be the relevant osd log lines:2017-07-24 13:21:53.561265 7faf1752fc80 0 osd.10 8854 crush map has features 2200130813952, adjusting msgr requires for clients2017-07-24 13:21:53.561284 7faf1752fc80 0 osd.10 8854 crush map has features 2200130813952 was 8705, adjusting msgr requires for mons2017-07-24 13:21:53.561298 7faf1752fc80 0 osd.10 8854 crush map has features 720578140510109696, adjusting msgr requires for osds2017-07-24 13:21:55.626834 7faf1752fc80 0 osd.10 8854 load_pgs2017-07-24 13:22:20.970222 7faf1752fc80 0 osd.10 8854 load_pgs opened 536 pgs2017-07-24 13:22:20.972659 7faf1752fc80 0 osd.10 8854 using weightedpriority op queue with priority op cut off at 64.2017-07-24 13:22:20.976861 7faf1752fc80 -1 osd.10 8854 log_to_monitors {default=true}2017-07-24 13:22:20.998233 7faf1752fc80 -1 osd.10 8854 mon_cmd_maybe_osd_create fail: '(2) No such file or directory': (2) No such file or directory2017-07-24 13:22:20.999165 7faf1752fc80 1 bluestore(/var/lib/ceph/osd/ceph-10) umount 2017-07-24 13:22:21.016146 7faf1752fc80 1 freelist shutdown2017-07-24 13:22:21.016243 7faf1752fc80 4 rocksdb: [/build/ceph-12.1.1/src/rocksdb/db/db_impl.cc:217] Shutdown: canceling all background work 2017-07-24 13:22:21.020440 7faf1752fc80 4 rocksdb: [/build/ceph-12.1.1/src/rocksdb/db/db_impl.cc:343] Shutdown complete 2017-07-24 13:22:21.274481 7faf1752fc80 1 bluefs umount2017-07-24 13:22:21.275822 7faf1752fc80 1 bdev(0x558bb1f82d80 /var/lib/ceph/osd/ceph-10/block) close 2017-07-24 13:22:21.485226 7faf1752fc80 1 bdev(0x558bb1f82b40 /var/lib/ceph/osd/ceph-10/block) close 2017-07-24 13:22:21.551009 7faf1752fc80 -1 ** ERROR: osd init failed: (2) No such file or directory2017-07-24 13:22:21.563567 7faf1752fc80 -1 /build/ceph-12.1.1/src/common/HeartbeatMap.cc: In function 'ceph::HeartbeatMap::~Heartbea tMap()' thread 7faf1752fc80 time 2017-07-24 13:22:21.558275 /build/ceph-12.1.1/src/common/HeartbeatMap.cc: 39: FAILED assert(m_workers.empty()) ceph version 12.1.1 (f3e663a190bf2ed12c7e3cda288b9a159572c800) luminous (rc) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x558ba6ba6b72]2: (()+0xb81cf1) [0x558ba6cc0cf1]3: (CephContext::~CephContext()+0x4d9) [0x558ba6ca77b9] 4: (CephContext::put()+0xe6) [0x558ba6ca7ab6]5: (main()+0x563) [0x558ba650df73]6: (__libc_start_main()+0xf0) [0x7faf14999830]7: (_start()+0x29) [0x558ba6597cf9]NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.--- begin dump of recent events ---
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com