Re: cannot reboot one of 3 nodes without locking a cluster OSDs stay in...

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Can you start giving specifics?  Ceph version, were the disks created with ceph-disk or ceph-volume, filestore/bluestore, upgraded from another version, has anything changed recently (an upgrade, migrating some OSDs from filestore to bluestore), etc, etc.

Sometimes I've found a node just fails to mark its OSDs down for no apparent reason.  Perhaps a race condition where the networking stopped before the OSD code got to the part where it wanted to tell the MONs it was going down.  If you manually run `ceph osd down #` it'll mark it as down and not interfere with cluster communication.  This has happened sporadically and not in any way reproducible for me on occasion.  Have you tested rebooting this server again to see if it continues to happen?  You might be able to find some information towards the end of the OSD log about it when it comes back up.  It would be easier to look through that log if you disable the OSD from automatically starting with the server so you're only looking at the end of the log.

On Tue, Feb 27, 2018 at 9:05 AM Philip Schroth <philip.schroth@xxxxxxxxxx> wrote:
They are stopped gracefully. i did a reboot 2 days ago. but now it doesnt work.



2018-02-27 14:24 GMT+01:00 David Turner <drakonstein@xxxxxxxxx>:
`systemctl list-dependencies ceph.target`

I'm guessing that you might need to enable your osds to be managed by systemctl so that they can be stopped when the server goes down.

`systemctl enable ceph-osd@{osd number}`

On Tue, Feb 27, 2018, 4:13 AM Philip Schroth <philip.schroth@xxxxxxxxxx> wrote:
I have a 3 node production cluster. All works fine. but i have one failing node. i replaced one disk on sunday. everyting went fine. last night there was another disk broken. Ceph nicely maks it as down. but when i wanted to reboot this node now. all remaining osd's are being kept in and not marked as down. and the whole cluster locks during the reboot of this node. once i reboot one of the other two nodes when the first failing node is back it works like charm. only this node i cannot reboot anymore without locking which i could still on sunday... 

--
Met vriendelijke groet / With kind regards

Philip Schroth

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
Met vriendelijke groet / With kind regards

Philip Schroth

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux