Re: Blocked requests

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I would recommend pushing forward with the update instead of rolling back.  Ceph doesn't have a track record of rolling back to a previous version.

I don't have enough information to really make sense of the ceph health detail output.  Like are the osds listed all on the same host?  Over time of watching this output, are some of the requests clearing up?  Are there any other patterns?  I put the following in a script and run it in a watch command to try and follow patterns when I'm plagued with blocked requests.
    output=$(ceph --cluster $cluster health detail | grep 'ops are blocked' | sort -nrk6 | sed 's/ ops/+ops/' | sed 's/ sec/+sec/' | column -t -s'+')
    echo "$output" | grep -v 'on osd'
    echo "$output" | grep -Eo osd.[0-9]+ | sort -n | uniq -c | grep -v ' 1 '
    echo "$output" | grep 'on osd'

Why do you have backfilling?  You haven't mentioned that you have any backfilling yet.  Installing an update shouldn't cause backfilling, but it's likely related to your blocked requests.

On Thu, Sep 7, 2017 at 2:24 PM Matthew Stroud <mattstroud@xxxxxxxxxxxxx> wrote:

Well in the meantime things have gone from bad to worse now the cluster isn’t rebuilding and clients are unable to pass IO to the cluster. When this first took place, we started rolling back to 10.2.7, though that was successful, it didn’t help with the issue. Here is the command output:

 

HEALTH_WARN 39 pgs backfill_wait; 5 pgs backfilling; 43 pgs degraded; 43 pgs stuck degraded; 44 pgs stuck unclean; 43 pgs stuck undersized; 43 pgs undersized; 367 requests are blocked > 32 sec; 14 osds have slow requests; recovery 4678/1097738 objects degraded (0.426%); recovery 10364/1097738 objects misplaced (0.944%)

pg 3.624 is stuck unclean for 1402.022837, current state active+undersized+degraded+remapped+wait_backfill, last acting [12,9]

pg 3.587 is stuck unclean for 2536.693566, current state active+undersized+degraded+remapped+wait_backfill, last acting [18,13]

pg 3.45f is stuck unclean for 1421.178244, current state active+undersized+degraded+remapped+wait_backfill, last acting [14,10]

pg 3.41a is stuck unclean for 1505.091187, current state active+undersized+degraded+remapped+wait_backfill, last acting [9,23]

pg 3.4cc is stuck unclean for 1560.824332, current state active+undersized+degraded+remapped+wait_backfill, last acting [18,10]

< snip>

pg 3.188 is stuck degraded for 1207.118130, current state active+undersized+degraded+remapped+wait_backfill, last acting [14,17]

pg 3.768 is stuck degraded for 1123.722910, current state active+undersized+degraded+remapped+wait_backfill, last acting [11,18]

pg 3.77c is stuck degraded for 1211.981606, current state active+undersized+degraded+remapped+wait_backfill, last acting [9,2]

pg 3.7d1 is stuck degraded for 1074.422756, current state active+undersized+degraded+remapped+wait_backfill, last acting [10,12]

pg 3.7d1 is active+undersized+degraded+remapped+wait_backfill, acting [10,12]

pg 3.77c is active+undersized+degraded+remapped+wait_backfill, acting [9,2]

pg 3.768 is active+undersized+degraded+remapped+wait_backfill, acting [11,18]

pg 3.709 is active+undersized+degraded+remapped+wait_backfill, acting [10,4]

<snip>

pg 3.5d8 is active+undersized+degraded+remapped+wait_backfill, acting [2,10]

pg 3.5dc is active+undersized+degraded+remapped+wait_backfill, acting [8,19]

pg 3.5f8 is active+undersized+degraded+remapped+wait_backfill, acting [2,21]

pg 3.624 is active+undersized+degraded+remapped+wait_backfill, acting [12,9]

2 ops are blocked > 1048.58 sec on osd.9

3 ops are blocked > 65.536 sec on osd.9

7 ops are blocked > 1048.58 sec on osd.8

1 ops are blocked > 524.288 sec on osd.8

1 ops are blocked > 131.072 sec on osd.8

<snip>

1 ops are blocked > 524.288 sec on osd.2

1 ops are blocked > 262.144 sec on osd.2

2 ops are blocked > 65.536 sec on osd.21

9 ops are blocked > 1048.58 sec on osd.5

9 ops are blocked > 524.288 sec on osd.5

71 ops are blocked > 131.072 sec on osd.5

19 ops are blocked > 65.536 sec on osd.5

35 ops are blocked > 32.768 sec on osd.5

14 osds have slow requests

recovery 4678/1097738 objects degraded (0.426%)

recovery 10364/1097738 objects misplaced (0.944%)

 

 

From: David Turner <drakonstein@xxxxxxxxx>
Date: Thursday, September 7, 2017 at 11:33 AM
To: Matthew Stroud <mattstroud@xxxxxxxxxxxxx>, "ceph-users@xxxxxxxxxxxxxx" <ceph-users@xxxxxxxxxxxxxx>
Subject: Re: Blocked requests

 

To be fair, other times I have to go in and tweak configuration settings and timings to resolve chronic blocked requests.

 

On Thu, Sep 7, 2017 at 1:32 PM David Turner <drakonstein@xxxxxxxxx> wrote:

`ceph health detail` will give a little more information into the blocked requests.  Specifically which OSDs are the requests blocked on and how long have they actually been blocked (as opposed to '> 32 sec').  I usually find a pattern after watching that for a time and narrow things down to an OSD, journal, etc.  Some times I just need to restart a specific OSD and all is well.

 

On Thu, Sep 7, 2017 at 10:33 AM Matthew Stroud <mattstroud@xxxxxxxxxxxxx> wrote:

After updating from 10.2.7 to 10.2.9 I have a bunch of blocked requests for ‘currently waiting for missing object’. I have tried bouncing the osds and rebooting the osd nodes, but that just moves the problems around. Previous to this upgrade we had no issues. Any ideas of what to look at?

 

Thanks,

Matthew Stroud

 



CONFIDENTIALITY NOTICE: This message is intended only for the use and review of the individual or entity to which it is addressed and may contain information that is privileged and confidential. If the reader of this message is not the intended recipient, or the employee or agent responsible for delivering the message solely to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify sender immediately by telephone or return email. Thank you.

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




CONFIDENTIALITY NOTICE: This message is intended only for the use and review of the individual or entity to which it is addressed and may contain information that is privileged and confidential. If the reader of this message is not the intended recipient, or the employee or agent responsible for delivering the message solely to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify sender immediately by telephone or return email. Thank you.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux