Is it this? https://bugzilla.redhat.com/show_bug.cgi?id=1430588 On Fri, Sep 8, 2017 at 7:01 AM, Matthew Stroud <mattstroud@xxxxxxxxxxxxx> wrote: > After some troubleshooting, the issues appear to be caused by gnocchi using > rados. I’m trying to figure out why. > > > > Thanks, > > Matthew Stroud > > > > From: Brian Andrus <brian.andrus@xxxxxxxxxxxxx> > Date: Thursday, September 7, 2017 at 1:53 PM > To: Matthew Stroud <mattstroud@xxxxxxxxxxxxx> > Cc: David Turner <drakonstein@xxxxxxxxx>, "ceph-users@xxxxxxxxxxxxxx" > <ceph-users@xxxxxxxxxxxxxx> > > > Subject: Re: Blocked requests > > > > "ceph osd blocked-by" can do the same thing as that provided script. > > > > Can you post relevant osd.10 logs and a pg dump of an affected placement > group? Specifically interested in recovery_state section. > > > > Hopefully you were careful in how you were rebooting OSDs, and not rebooting > multiple in the same failure domain before recovery was able to occur. > > > > On Thu, Sep 7, 2017 at 12:30 PM, Matthew Stroud <mattstroud@xxxxxxxxxxxxx> > wrote: > > Here is the output of your snippet: > > [root@mon01 ceph-conf]# bash /tmp/ceph_foo.sh > > 6 osd.10 > > 52 ops are blocked > 4194.3 sec on osd.17 > > 9 ops are blocked > 2097.15 sec on osd.10 > > 4 ops are blocked > 1048.58 sec on osd.10 > > 39 ops are blocked > 262.144 sec on osd.10 > > 19 ops are blocked > 131.072 sec on osd.10 > > 6 ops are blocked > 65.536 sec on osd.10 > > 2 ops are blocked > 32.768 sec on osd.10 > > > > Here is some backfilling info: > > > > [root@mon01 ceph-conf]# ceph status > > cluster 55ebbc2d-c5b7-4beb-9688-0926cefee155 > > health HEALTH_WARN > > 5 pgs backfilling > > 5 pgs degraded > > 5 pgs stuck degraded > > 5 pgs stuck unclean > > 5 pgs stuck undersized > > 5 pgs undersized > > 122 requests are blocked > 32 sec > > recovery 2361/1097929 objects degraded (0.215%) > > recovery 5578/1097929 objects misplaced (0.508%) > > monmap e1: 3 mons at > {mon01=10.20.57.10:6789/0,mon02=10.20.57.11:6789/0,mon03=10.20.57.12:6789/0} > > election epoch 58, quorum 0,1,2 mon01,mon02,mon03 > > osdmap e6511: 24 osds: 21 up, 21 in; 5 remapped pgs > > flags sortbitwise,require_jewel_osds > > pgmap v6474659: 2592 pgs, 5 pools, 333 GB data, 356 kobjects > > 1005 GB used, 20283 GB / 21288 GB avail > > 2361/1097929 objects degraded (0.215%) > > 5578/1097929 objects misplaced (0.508%) > > 2587 active+clean > > 5 active+undersized+degraded+remapped+backfilling > > [root@mon01 ceph-conf]# ceph pg dump_stuck unclean > > ok > > pg_stat state up up_primary acting acting_primary > > 3.5c2 active+undersized+degraded+remapped+backfilling [17,2,10] 17 > [17,2] 17 > > 3.54a active+undersized+degraded+remapped+backfilling [10,19,2] 10 > [10,17] 10 > > 5.3b active+undersized+degraded+remapped+backfilling [3,19,0] 3 > [10,17] 10 > > 5.b3 active+undersized+degraded+remapped+backfilling [10,19,2] 10 > [10,17] 10 > > 3.180 active+undersized+degraded+remapped+backfilling [17,10,6] 17 > [22,19] 22 > > > > Most of the back filling is was caused by restarting osds to clear blocked > IO. Here are some of the blocked IOs: > > > > /var/log/ceph/ceph.log:2017-09-07 13:29:36.978559 osd.10 > 10.20.57.15:6806/7029 9362 : cluster [WRN] slow request 60.834494 seconds > old, received at 2017-09-07 13:28:36.143920: osd_op(client.114947.0:2039090 > 5.e637a4b3 (undecoded) > ack+read+balance_reads+skiprwlocks+known_if_redirected e6511) currently > queued_for_pg > > /var/log/ceph/ceph.log:2017-09-07 13:29:36.978565 osd.10 > 10.20.57.15:6806/7029 9363 : cluster [WRN] slow request 240.661052 seconds > old, received at 2017-09-07 13:25:36.317363: osd_op(client.246934107.0:3 > 5.f69addd6 (undecoded) ack+read+known_if_redirected e6511) currently > queued_for_pg > > /var/log/ceph/ceph.log:2017-09-07 13:29:36.978571 osd.10 > 10.20.57.15:6806/7029 9364 : cluster [WRN] slow request 240.660763 seconds > old, received at 2017-09-07 13:25:36.317651: osd_op(client.246944377.0:2 > 5.f69addd6 (undecoded) ack+read+known_if_redirected e6511) currently > queued_for_pg > > /var/log/ceph/ceph.log:2017-09-07 13:29:36.978576 osd.10 > 10.20.57.15:6806/7029 9365 : cluster [WRN] slow request 240.660675 seconds > old, received at 2017-09-07 13:25:36.317740: osd_op(client.246944377.0:3 > 5.f69addd6 (undecoded) ack+read+known_if_redirected e6511) currently > queued_for_pg > > /var/log/ceph/ceph.log:2017-09-07 13:29:42.979367 osd.10 > 10.20.57.15:6806/7029 9366 : cluster [WRN] 72 slow requests, 3 included > below; oldest blocked for > 1820.342287 secs > > /var/log/ceph/ceph.log:2017-09-07 13:29:42.979373 osd.10 > 10.20.57.15:6806/7029 9367 : cluster [WRN] slow request 30.606290 seconds > old, received at 2017-09-07 13:29:12.372999: > osd_op(client.115008.0:996024003 5.e637a4b3 (undecoded) > ondisk+write+skiprwlocks+known_if_redirected e6511) currently queued_for_pg > > /var/log/ceph/ceph.log:2017-09-07 13:29:42.979377 osd.10 > 10.20.57.15:6806/7029 9368 : cluster [WRN] slow request 30.554317 seconds > old, received at 2017-09-07 13:29:12.424972: osd_op(client.115020.0:1831942 > 5.39f2d3b (undecoded) ack+read+known_if_redirected e6511) currently > queued_for_pg > > /var/log/ceph/ceph.log:2017-09-07 13:29:42.979383 osd.10 > 10.20.57.15:6806/7029 9369 : cluster [WRN] slow request 30.368086 seconds > old, received at 2017-09-07 13:29:12.611204: osd_op(client.115014.0:73392774 > 5.e637a4b3 (undecoded) > ack+read+balance_reads+skiprwlocks+known_if_redirected e6511) currently > queued_for_pg > > /var/log/ceph/ceph.log:2017-09-07 13:29:43.979553 osd.10 > 10.20.57.15:6806/7029 9370 : cluster [WRN] 73 slow requests, 1 included > below; oldest blocked for > 1821.342499 secs > > /var/log/ceph/ceph.log:2017-09-07 13:29:43.979559 osd.10 > 10.20.57.15:6806/7029 9371 : cluster [WRN] slow request 30.452344 seconds > old, received at 2017-09-07 13:29:13.527157: > osd_op(client.115011.0:483954528 5.e637a4b3 (undecoded) > ack+read+balance_reads+skiprwlocks+known_if_redirected e6511) currently > queued_for_pg > > > > From: David Turner <drakonstein@xxxxxxxxx> > Date: Thursday, September 7, 2017 at 1:17 PM > > > To: Matthew Stroud <mattstroud@xxxxxxxxxxxxx>, "ceph-users@xxxxxxxxxxxxxx" > <ceph-users@xxxxxxxxxxxxxx> > Subject: Re: Blocked requests > > > > I would recommend pushing forward with the update instead of rolling back. > Ceph doesn't have a track record of rolling back to a previous version. > > > > I don't have enough information to really make sense of the ceph health > detail output. Like are the osds listed all on the same host? Over time of > watching this output, are some of the requests clearing up? Are there any > other patterns? I put the following in a script and run it in a watch > command to try and follow patterns when I'm plagued with blocked requests. > > output=$(ceph --cluster $cluster health detail | grep 'ops are blocked' > | sort -nrk6 | sed 's/ ops/+ops/' | sed 's/ sec/+sec/' | column -t -s'+') > > echo "$output" | grep -v 'on osd' > > echo "$output" | grep -Eo osd.[0-9]+ | sort -n | uniq -c | grep -v ' 1 ' > > echo "$output" | grep 'on osd' > > > > Why do you have backfilling? You haven't mentioned that you have any > backfilling yet. Installing an update shouldn't cause backfilling, but it's > likely related to your blocked requests. > > > > On Thu, Sep 7, 2017 at 2:24 PM Matthew Stroud <mattstroud@xxxxxxxxxxxxx> > wrote: > > Well in the meantime things have gone from bad to worse now the cluster > isn’t rebuilding and clients are unable to pass IO to the cluster. When this > first took place, we started rolling back to 10.2.7, though that was > successful, it didn’t help with the issue. Here is the command output: > > > > HEALTH_WARN 39 pgs backfill_wait; 5 pgs backfilling; 43 pgs degraded; 43 pgs > stuck degraded; 44 pgs stuck unclean; 43 pgs stuck undersized; 43 pgs > undersized; 367 requests are blocked > 32 sec; 14 osds have slow requests; > recovery 4678/1097738 objects degraded (0.426%); recovery 10364/1097738 > objects misplaced (0.944%) > > pg 3.624 is stuck unclean for 1402.022837, current state > active+undersized+degraded+remapped+wait_backfill, last acting [12,9] > > pg 3.587 is stuck unclean for 2536.693566, current state > active+undersized+degraded+remapped+wait_backfill, last acting [18,13] > > pg 3.45f is stuck unclean for 1421.178244, current state > active+undersized+degraded+remapped+wait_backfill, last acting [14,10] > > pg 3.41a is stuck unclean for 1505.091187, current state > active+undersized+degraded+remapped+wait_backfill, last acting [9,23] > > pg 3.4cc is stuck unclean for 1560.824332, current state > active+undersized+degraded+remapped+wait_backfill, last acting [18,10] > > < snip> > > pg 3.188 is stuck degraded for 1207.118130, current state > active+undersized+degraded+remapped+wait_backfill, last acting [14,17] > > pg 3.768 is stuck degraded for 1123.722910, current state > active+undersized+degraded+remapped+wait_backfill, last acting [11,18] > > pg 3.77c is stuck degraded for 1211.981606, current state > active+undersized+degraded+remapped+wait_backfill, last acting [9,2] > > pg 3.7d1 is stuck degraded for 1074.422756, current state > active+undersized+degraded+remapped+wait_backfill, last acting [10,12] > > pg 3.7d1 is active+undersized+degraded+remapped+wait_backfill, acting > [10,12] > > pg 3.77c is active+undersized+degraded+remapped+wait_backfill, acting [9,2] > > pg 3.768 is active+undersized+degraded+remapped+wait_backfill, acting > [11,18] > > pg 3.709 is active+undersized+degraded+remapped+wait_backfill, acting [10,4] > > <snip> > > pg 3.5d8 is active+undersized+degraded+remapped+wait_backfill, acting [2,10] > > pg 3.5dc is active+undersized+degraded+remapped+wait_backfill, acting [8,19] > > pg 3.5f8 is active+undersized+degraded+remapped+wait_backfill, acting [2,21] > > pg 3.624 is active+undersized+degraded+remapped+wait_backfill, acting [12,9] > > 2 ops are blocked > 1048.58 sec on osd.9 > > 3 ops are blocked > 65.536 sec on osd.9 > > 7 ops are blocked > 1048.58 sec on osd.8 > > 1 ops are blocked > 524.288 sec on osd.8 > > 1 ops are blocked > 131.072 sec on osd.8 > > <snip> > > 1 ops are blocked > 524.288 sec on osd.2 > > 1 ops are blocked > 262.144 sec on osd.2 > > 2 ops are blocked > 65.536 sec on osd.21 > > 9 ops are blocked > 1048.58 sec on osd.5 > > 9 ops are blocked > 524.288 sec on osd.5 > > 71 ops are blocked > 131.072 sec on osd.5 > > 19 ops are blocked > 65.536 sec on osd.5 > > 35 ops are blocked > 32.768 sec on osd.5 > > 14 osds have slow requests > > recovery 4678/1097738 objects degraded (0.426%) > > recovery 10364/1097738 objects misplaced (0.944%) > > > > > > From: David Turner <drakonstein@xxxxxxxxx> > Date: Thursday, September 7, 2017 at 11:33 AM > To: Matthew Stroud <mattstroud@xxxxxxxxxxxxx>, "ceph-users@xxxxxxxxxxxxxx" > <ceph-users@xxxxxxxxxxxxxx> > Subject: Re: Blocked requests > > > > To be fair, other times I have to go in and tweak configuration settings and > timings to resolve chronic blocked requests. > > > > On Thu, Sep 7, 2017 at 1:32 PM David Turner <drakonstein@xxxxxxxxx> wrote: > > `ceph health detail` will give a little more information into the blocked > requests. Specifically which OSDs are the requests blocked on and how long > have they actually been blocked (as opposed to '> 32 sec'). I usually find > a pattern after watching that for a time and narrow things down to an OSD, > journal, etc. Some times I just need to restart a specific OSD and all is > well. > > > > On Thu, Sep 7, 2017 at 10:33 AM Matthew Stroud <mattstroud@xxxxxxxxxxxxx> > wrote: > > After updating from 10.2.7 to 10.2.9 I have a bunch of blocked requests for > ‘currently waiting for missing object’. I have tried bouncing the osds and > rebooting the osd nodes, but that just moves the problems around. Previous > to this upgrade we had no issues. Any ideas of what to look at? > > > > Thanks, > > Matthew Stroud > > > > ________________________________ > > > CONFIDENTIALITY NOTICE: This message is intended only for the use and review > of the individual or entity to which it is addressed and may contain > information that is privileged and confidential. If the reader of this > message is not the intended recipient, or the employee or agent responsible > for delivering the message solely to the intended recipient, you are hereby > notified that any dissemination, distribution or copying of this > communication is strictly prohibited. If you have received this > communication in error, please notify sender immediately by telephone or > return email. Thank you. > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > ________________________________ > > > CONFIDENTIALITY NOTICE: This message is intended only for the use and review > of the individual or entity to which it is addressed and may contain > information that is privileged and confidential. If the reader of this > message is not the intended recipient, or the employee or agent responsible > for delivering the message solely to the intended recipient, you are hereby > notified that any dissemination, distribution or copying of this > communication is strictly prohibited. If you have received this > communication in error, please notify sender immediately by telephone or > return email. Thank you. > > > > ________________________________ > > > CONFIDENTIALITY NOTICE: This message is intended only for the use and review > of the individual or entity to which it is addressed and may contain > information that is privileged and confidential. If the reader of this > message is not the intended recipient, or the employee or agent responsible > for delivering the message solely to the intended recipient, you are hereby > notified that any dissemination, distribution or copying of this > communication is strictly prohibited. If you have received this > communication in error, please notify sender immediately by telephone or > return email. Thank you. > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > -- > > Brian Andrus | Cloud Systems Engineer | DreamHost > > brian.andrus@xxxxxxxxxxxxx | www.dreamhost.com > > > ________________________________ > > CONFIDENTIALITY NOTICE: This message is intended only for the use and review > of the individual or entity to which it is addressed and may contain > information that is privileged and confidential. If the reader of this > message is not the intended recipient, or the employee or agent responsible > for delivering the message solely to the intended recipient, you are hereby > notified that any dissemination, distribution or copying of this > communication is strictly prohibited. If you have received this > communication in error, please notify sender immediately by telephone or > return email. Thank you. > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Cheers, Brad _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com