Re: Why one crippled osd can slow down or block all request to the whole ceph cluster?

"shadow_lin"<shadow_lin@xxxxxxx> · Wed, 7 Mar 2018 12:22:20 +0800

Hi Turner,
Thanks for your insight.
I am wondering if the mon can detect slow/blocked 
request from certain osd why can't mon mark a osd with blocked request down if 
the request is blocked for a certain time.

2018-03-07 

shadow_lin 

  发件人：David Turner <drakonstein@xxxxxxxxx>
  发送时间：2018-03-06 23:56
  主题：Re: [ceph-users] Why one crippled osd can slow down 
  or block all request to the whole ceph cluster?
  收件人："shadow_lin"<shadow_lin@xxxxxxx>
  抄送："ceph-users"<ceph-users@xxxxxxxxxxxxxx>

  There are multiple settings that affect this.  
  osd_heartbeat_grace is probably the most apt.  If an OSD is not getting a 
  response from another OSD for more than the heartbeat_grace period, then it 
  will tell the mons that the OSD is down.  Once mon_osd_min_down_reporters 
  have told the mons that an OSD is down, then the OSD will be marked down by 
  the cluster.  If the OSD does not then talk to the mons directly to say 
  that it is up, it will be marked out after mon_osd_down_out_interval is 
  reached.  If it does talk to the mons to say that it is up, then it 
  should be responding again and be fine.

  In your case where the OSD is half up, half down... I believe all you can 
  really do is monitor your cluster and troubleshoot OSDs causing problems like 
  this.  Basically every storage solution is vulnerable to this.  
  Sometimes an OSD just needs to be restarted due to being in a bad state 
  somehow, or simply removed from the cluster because the disk is going 
  bad.

  On Sun, Mar 4, 2018 at 2:28 AM shadow_lin <shadow_lin@xxxxxxx> wrote:

    Hi list,
    During my test of ceph,I find sometime the 
    whole ceph cluster are blocked and the reason was one unfunctional 
    osd.Ceph can heal itself if some osd is down, but it seems if some osd is 
    half dead (have heart beat but can't handle request) then all the request 
    which are directed to that osd would be blocked. If all osds are in one pool 
    and the whole cluster would be blocked due to that one hanged osd.
    I think this is because ceph will try to 
    distribute the request to all osds and if one of the osd wont confirm the 
    request is done then everything is blocked.
    Is there a way to let ceph to mark the 
    the crippled osd down if the requests direct to that osd are blocked more 
    than certain time to avoid the whole cluster is blocked?

    2018-03-04

    shadow_lin 

_______________________________________________
ceph-users 
    mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com