Re: osd down detection broken in jewel?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



It's right there in your config. 

mon osd report timeout = 900

See: http://docs.ceph.com/docs/jewel/rados/configuration/mon-osd-interaction/

___

John Petrini

NOC Systems Administrator   //   CoreDial, LLC   //   coredial.com   //   Twitter   LinkedIn   Google Plus   Blog 
Hillcrest I, 751 Arbor Way, Suite 150, Blue Bell PA, 19422 
P: 215.297.4400 x232   //   F: 215.297.4401   //   E: jpetrini@xxxxxxxxxxxx

Exceptional people. Proven Processes. Innovative Technology. Discover CoreDial - watch our video

The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission,  dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer.


On Wed, Nov 30, 2016 at 6:39 AM, Manuel Lausch <manuel.lausch@xxxxxxxx> wrote:
Hi,

In a test with ceph jewel we tested how long the cluster needs to detect and mark down OSDs after they are killed (with kill -9). The result -> 900 seconds.

In Hammer this took about 20 - 30 seconds.

In the Logfile from the leader monitor are a lot of messeages like
2016-11-30 11:32:20.966567 7f158f5ab700  0 log_channel(cluster) log [DBG] : osd.7 10.78.43.141:8120/106673 reported failed by osd.272 10.78.43.145:8106/117053
A deeper look at this. A lot of OSDs reported this exactly one time. In Hammer The OSDs reported a down OSD a few more times.

Finaly there is the following and the osd is marked down.
2016-11-30 11:36:22.633253 7f158fdac700  0 log_channel(cluster) log [INF] : osd.7 marked down after no pg stats for 900.982893seconds

In my ceph.conf I have the following lines in the global section
mon osd min down reporters = 10
mon osd min down reports = 3
mon osd report timeout = 900

It seems the parameter "mon osd min down reports" is removed in jewel but the documentation is not updated -> http://docs.ceph.com/docs/jewel/rados/configuration/mon-osd-interaction/


Can someone tell me how ceph jewel detects down OSDs and mark them down in a appropriated time?


The Cluster:
ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
24 hosts á 60 OSDs -> 1440 OSDs
2 pool with replication factor 4
65536 PGs
5 Mons

--
Manuel Lausch

Systemadministrator
Cloud Services

1&1 Mail & Media Development & Technology GmbH | Brauerstraße 48 | 76135 Karlsruhe | Germany
Phone: +49 721 91374-1847
E-Mail: manuel.lausch@xxxxxxxx | Web: www.1und1.de

Amtsgericht Montabaur, HRB 5452

Geschäftsführer: Frank Einhellinger, Thomas Ludwig, Jan Oetjen


Member of United Internet

Diese E-Mail kann vertrauliche und/oder gesetzlich geschützte Informationen enthalten. Wenn Sie nicht der bestimmungsgemäße Adressat sind oder diese E-Mail irrtümlich erhalten haben, unterrichten Sie bitte den Absender und vernichten Sie diese E-Mail. Anderen als dem bestimmungsgemäßen Adressaten ist untersagt, diese E-Mail zu speichern, weiterzuleiten oder ihren Inhalt auf welche Weise auch immer zu verwenden.

This e-mail may contain confidential and/or privileged information. If you are not the intended recipient of this e-mail, you are hereby notified that saving, distribution or use of the content of this e-mail in any way is prohibited. If you have received this e-mail in error, please notify the sender and delete the e-mail.


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux