Nevermind, they just came back. Looks like i had some other issues, such as manually enabled ceph-osd@#.service files in systemd config for OSDs that had been moved to different nodes.
The root problem is clearly that ceph-osd-prestart updates the crush map before the OSD successfully starts at all. If there's duplicate IDs for example, due to leftover files or somesuch, then a working OSD on another OSD may be forcibly moved in the crush map to another node where it doesn't exist. I would expect OSDs to update their own location in CRUSH, rather than having this be a prestart step.
-Ben
On Wed, May 4, 2016 at 10:27 PM, Ben Hines <bhines@xxxxxxxxx> wrote:
Centos 7.2... and i think i just figured it out. One node had directories from former OSDs in /var/lib/ceph/osd. When restarting other OSDs on this host, ceph apparently added those to the crush map, too.[root@sm-cld-mtl-013 osd]# ls -la /var/lib/ceph/osd/total 128drwxr-x--- 8 ceph ceph 90 Feb 24 14:44 .drwxr-x--- 9 ceph ceph 106 Feb 24 14:44 ..drwxr-xr-x 2 root root 6 Jul 2 2015 ceph-42drwxr-xr-x 2 root root 6 Jul 2 2015 ceph-43drwxr-xr-x 1 root root 278 May 4 22:21 ceph-44drwxr-xr-x 1 root root 278 May 4 22:21 ceph-45drwxr-xr-x 1 root root 278 May 4 22:25 ceph-67drwxr-xr-x 1 root root 304 May 4 22:25 ceph-86(42 and 43 are on a different host.. yet when 'systemctl start ceph.target' is used, the osd preflight adds them to the crush map anyway:May 4 22:13:26 sm-cld-mtl-013 ceph-osd: starting osd.67 at :/0 osd_data /var/lib/ceph/osd/ceph-67 /var/lib/ceph/osd/ceph-67/journalMay 4 22:13:26 sm-cld-mtl-013 ceph-osd: starting osd.45 at :/0 osd_data /var/lib/ceph/osd/ceph-45 /var/lib/ceph/osd/ceph-45/journalMay 4 22:13:26 sm-cld-mtl-013 ceph-osd: WARNING: will not setuid/gid: /var/lib/ceph/osd/ceph-42 owned by 0:0 and not requested 167:167May 4 22:13:26 sm-cld-mtl-013 ceph-osd: 2016-05-04 22:13:26.529176 7f00cca7c900 -1 #033[0;31m ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-43: (2) No such file or directory#033[0mMay 4 22:13:26 sm-cld-mtl-013 ceph-osd: 2016-05-04 22:13:26.534657 7fb55c17e900 -1 #033[0;31m ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-42: (2) No such file or directory#033[0mMay 4 22:13:26 sm-cld-mtl-013 systemd: ceph-osd@43.service: main process exited, code=exited, status=1/FAILUREMay 4 22:13:26 sm-cld-mtl-013 systemd: Unit ceph-osd@43.service entered failed state.May 4 22:13:26 sm-cld-mtl-013 systemd: ceph-osd@43.service failed.May 4 22:13:26 sm-cld-mtl-013 systemd: ceph-osd@42.service: main process exited, code=exited, status=1/FAILUREMay 4 22:13:26 sm-cld-mtl-013 systemd: Unit ceph-osd@42.service entered failed state.May 4 22:13:26 sm-cld-mtl-013 systemd: ceph-osd@42.service failed.-BenOn Tue, May 3, 2016 at 7:16 PM, Wade Holler <wade.holler@xxxxxxxxx> wrote:Hi Ben,What OS+Version ?Best Regards,WadeOn Tue, May 3, 2016 at 2:44 PM Ben Hines <bhines@xxxxxxxxx> wrote:_______________________________________________My crush map keeps putting some OSDs on the wrong node. Restarting them fixes it temporarily, but they eventually hop back to the other node that they aren't really on.Is there anything that can cause this to look for?Ceph 9.2.1-Ben
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com