Re: Incorrect crush map

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Nevermind, they just came back. Looks like i had some other issues, such as manually enabled ceph-osd@#.service files in systemd config for OSDs that had been moved to different nodes. 

The root problem is clearly that ceph-osd-prestart updates the crush map before the OSD successfully starts at all. If there's duplicate IDs for example, due to leftover files or somesuch, then a working OSD on another OSD may be forcibly moved in the crush map to another node where it doesn't exist. I would expect OSDs to update their own location in CRUSH, rather than having this be a prestart step.

-Ben


On Wed, May 4, 2016 at 10:27 PM, Ben Hines <bhines@xxxxxxxxx> wrote:
Centos 7.2.

.. and i think i just figured it out. One node had directories from former OSDs in /var/lib/ceph/osd. When restarting other OSDs on this host, ceph apparently added those to the crush map, too.

[root@sm-cld-mtl-013 osd]# ls -la /var/lib/ceph/osd/
total 128
drwxr-x--- 8 ceph ceph  90 Feb 24 14:44 .
drwxr-x--- 9 ceph ceph 106 Feb 24 14:44 ..
drwxr-xr-x 2 root root   6 Jul  2  2015 ceph-42
drwxr-xr-x 2 root root   6 Jul  2  2015 ceph-43
drwxr-xr-x 1 root root 278 May  4 22:21 ceph-44
drwxr-xr-x 1 root root 278 May  4 22:21 ceph-45
drwxr-xr-x 1 root root 278 May  4 22:25 ceph-67
drwxr-xr-x 1 root root 304 May  4 22:25 ceph-86


(42 and 43 are on a different host.. yet when 'systemctl start ceph.target' is used, the osd preflight adds them to the crush map anyway:


May  4 22:13:26 sm-cld-mtl-013 ceph-osd: starting osd.67 at :/0 osd_data /var/lib/ceph/osd/ceph-67 /var/lib/ceph/osd/ceph-67/journal
May  4 22:13:26 sm-cld-mtl-013 ceph-osd: starting osd.45 at :/0 osd_data /var/lib/ceph/osd/ceph-45 /var/lib/ceph/osd/ceph-45/journal
May  4 22:13:26 sm-cld-mtl-013 ceph-osd: WARNING: will not setuid/gid: /var/lib/ceph/osd/ceph-42 owned by 0:0 and not requested 167:167
May  4 22:13:26 sm-cld-mtl-013 ceph-osd: 2016-05-04 22:13:26.529176 7f00cca7c900 -1 #033[0;31m ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-43: (2) No such file or directory#033[0m
May  4 22:13:26 sm-cld-mtl-013 ceph-osd: 2016-05-04 22:13:26.534657 7fb55c17e900 -1 #033[0;31m ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-42: (2) No such file or directory#033[0m
May  4 22:13:26 sm-cld-mtl-013 systemd: ceph-osd@43.service: main process exited, code=exited, status=1/FAILURE
May  4 22:13:26 sm-cld-mtl-013 systemd: Unit ceph-osd@43.service entered failed state.
May  4 22:13:26 sm-cld-mtl-013 systemd: ceph-osd@43.service failed.
May  4 22:13:26 sm-cld-mtl-013 systemd: ceph-osd@42.service: main process exited, code=exited, status=1/FAILURE
May  4 22:13:26 sm-cld-mtl-013 systemd: Unit ceph-osd@42.service entered failed state.
May  4 22:13:26 sm-cld-mtl-013 systemd: ceph-osd@42.service failed.



-Ben

On Tue, May 3, 2016 at 7:16 PM, Wade Holler <wade.holler@xxxxxxxxx> wrote:
Hi Ben, 

What OS+Version ?

Best Regards,
Wade


On Tue, May 3, 2016 at 2:44 PM Ben Hines <bhines@xxxxxxxxx> wrote:
My crush map keeps putting some OSDs on the wrong node. Restarting them fixes it temporarily, but they eventually hop back to the other node that they aren't really on. 

Is there anything that can cause this to look for?

Ceph 9.2.1

-Ben
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux