Re: Why one crippled osd can slow down or block all request to the whole ceph cluster?

"shadow_lin"<shadow_lin@xxxxxxx> · Wed, 7 Mar 2018 16:55:29 +0800

What you said make sense.
I have encountered a few hardware related issue that caused one osd 
to work abnormal and blocked all io of the whole cluster(all osd in one 
pool) which makes me think how to avoid this situation.

2018-03-07 

shadow_lin 

  发件人：David Turner <drakonstein@xxxxxxxxx>
  发送时间：2018-03-07 13:51
  主题：Re: Re: [ceph-users] Why one crippled osd can slow 
  down or block all request to the whole ceph cluster?
  收件人："shadow_lin"<shadow_lin@xxxxxxx>
  抄送："ceph-users"<ceph-users@xxxxxxxxxxxxxx>

  Marking osds down is not without risks. You are taking away one of the 
  copies of data for every PG on that osd. Also you are causing every PG on that 
  osd to peer. If that osd comes back up, every PG on it again needs to peer and 
  then they need to recover.

  That is a lot of load and risks to automate into the system. Now let's 
  take into consideration other causes of slow requests like having more IO load 
  than your spindle can handle, backfilling settings set to aggressively 
  (related to the first option), or networking problems. If the mon is detecting 
  slow requests on OSDs and marking them down, you could end up marking half of 
  your cluster down or causing corrupt data by flapping OSDs.

  The mon will mark osds down if those settings I mentioned are met. If the 
  osd isn't unresponsive enough to not respond to other OSDs or the mons, then 
  there really isn't much that ceph can do to automate this safely. There are 
  just so many variables. If ceph was a closed system on specific hardware, it 
  could certainly be monitoring that hardware closely for early warning signs... 
  But people are running Ceph on everything they can compile it for including 
  raspberry pis. The cluster admin, however, should be able to add their own 
  early detection for failures.

  You can monitor a lot about disks including things such as average await 
  in a host to see if the disks are taking longer than normal to respond. That 
  particular check led us to find that we had several storage nodes with bad 
  cache batteries on the controllers. Finding that explained some slowness we 
  had noticed in the cluster. It also led us to a better method to catch that 
  scenario sooner.

  On Tue, Mar 6, 2018, 11:22 PM shadow_lin <shadow_lin@xxxxxxx> wrote:

    Hi Turner,
    Thanks for your insight.
    I am wondering if the mon can detect 
    slow/blocked request from certain osd why can't mon mark a osd with blocked 
    request down if the request is blocked for a certain time.

    2018-03-07 

    shadow_lin 

      发件人：David Turner <drakonstein@xxxxxxxxx>
      发送时间：2018-03-06 23:56
      主题：Re: [ceph-users] Why one crippled osd can slow 
      down or block all request to the whole ceph cluster?
      收件人："shadow_lin"<shadow_lin@xxxxxxx>
      抄送："ceph-users"<ceph-users@xxxxxxxxxxxxxx>

      There are multiple settings that affect this.  
      osd_heartbeat_grace is probably the most apt.  If an OSD is not 
      getting a response from another OSD for more than the heartbeat_grace 
      period, then it will tell the mons that the OSD is down.  Once 
      mon_osd_min_down_reporters have told the mons that an OSD is down, then 
      the OSD will be marked down by the cluster.  If the OSD does not then 
      talk to the mons directly to say that it is up, it will be marked out 
      after mon_osd_down_out_interval is reached.  If it does talk to the 
      mons to say that it is up, then it should be responding again and be fine. 

      In your case where the OSD is half up, half down... I believe all you 
      can really do is monitor your cluster and troubleshoot OSDs causing 
      problems like this.  Basically every storage solution is vulnerable 
      to this.  Sometimes an OSD just needs to be restarted due to being in 
      a bad state somehow, or simply removed from the cluster because the disk 
      is going bad.

      On Sun, Mar 4, 2018 at 2:28 AM shadow_lin <shadow_lin@xxxxxxx> 
      wrote:

        Hi list,
        During my test of ceph,I find sometime the 
        whole ceph cluster are blocked and the reason was one unfunctional 
        osd.Ceph can heal itself if some osd is down, but it seems if some osd 
        is half dead (have heart beat but can't handle request) then all the 
        request which are directed to that osd would be blocked. If all osds are 
        in one pool and the whole cluster would be blocked due to that one 
        hanged osd.
        I think this is because ceph will try to 
        distribute the request to all osds and if one of the osd wont confirm 
        the request is done then everything is blocked.
        Is there a way to let ceph to mark 
        the the crippled osd down if the requests direct to that osd are blocked 
        more than certain time to avoid the whole cluster is blocked?

        2018-03-04

        shadow_lin 

_______________________________________________
ceph-users 
        mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com