Re: OSDs are flapping and marked down wrongly

Wido den Hollander <wido@xxxxxxxx> · Mon, 17 Oct 2016 09:23:02 +0200 (CEST)

> Op 17 oktober 2016 om 9:16 schreef Somnath Roy <Somnath.Roy@xxxxxxxxxxx>:
> 
> 
> Hi Sage et. al,
> 
> I know this issue is reported number of times in community and attributed to either network issue or unresponsive OSDs.
> Recently, we are seeing this issue when our all SSD cluster (Jewel based)  is stressed with large block size and very high QD. Lowering QD it is working just fine.
> We are seeing the lossy connection message like below and followed by the osd marked down by monitor.
> 
> 2016-10-15 14:30:13.957534 7f6297bff700  0 -- 10.10.10.94:6810/2461767 submit_message osd_op_reply(1463 rbd_data.55246b8b4567.000000000000d633 [set-alloc-hint object_size 4194304 write_size 4194304,write 3932160~262144] v222'95890 uv95890 ondisk = 0) v7 remote, 10.10.10.98:0/1174431362, dropping message
> 
> In the monitor log, I am seeing the osd is reported down by peers and subsequently monitor is marking it down.
> OSDs is rejoining the cluster after detecting it is marked down wrongly and rebalancing started. This is hurting performance very badly.
> 
> My question is the following.
> 
> 1. I have 40Gb network and I am seeing network is not utilized beyond 10-12Gb/s , no network error is reported. So, why this lossy connection message is coming ? what could go wrong here ? Is it network prioritization issue of smaller ping packets ? I tried to gaze ping round time during this and nothing seems abnormal.
> 
> 2. Nothing is saturated on the OSD side , plenty of network/memory/cpu/disk is left. So, I doubt my osds are unresponsive but yes it is really busy on IO path. Heartbeat is going through separate messenger and threads as well, so, busy op threads should not be making heartbeat delayed. Increasing osd heartbeat grace is only delaying this phenomenon , but, eventually happens after several hours. Anything else we can tune here ?
> 
> 3. What could be the side effect of big grace period ? I understand that detecting a faulty osd will be delayed, anything else ?
> 

You might want to look at:

OPTION(mon_osd_min_down_reporters, OPT_INT, 1)   // number of OSDs who need to report a down OSD for it to count
OPTION(mon_osd_min_down_reports, OPT_INT, 3)     // number of times a down OSD must be reported for it to count

Setting 'mon_osd_min_down_reporters' to 3 means that 3 individual OSDs have to mark a OSD as down. You could also increase the amount of reports.

On larger environments I always set reporters to 3 or 5, just to prevent such flapping.

> 4. I saw if an OSD is crashed, monitor will detect the down osd almost instantaneously and it is not waiting till this grace period. How it is distinguishing between unresponsive and crashed osds ? In which scenario this heartbeat grace is coming into picture ?
> 

A crashed OSD will not be detected by the MON. It are the other OSDs which inform the monitor about this OSD crashing. But you will have to wait for the heartbeats to time out.

Only when a OSD gracefully shuts down it will mark itself down instantly.

Wido

> Any help on clarifying this would be very helpful.
> 
> Thanks & Regards
> Somnath
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com