-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 We have had two situations where I/O just seems to be indefinitely blocked on our production cluster today (0.94.3). In the case this morning, it was just normal I/O traffic, no recovery or backfill. The case this evening, we were backfilling to some new OSDs. I would have loved to have bumped up the debugging to get an idea of what was going on, but time was exhausted. The incident this evening I was able to do some additional troubleshooting, but got real anxious after I/O had been blocked for 10 minutes and OPs was getting hot around the collar. Here are the important parts of the logs: [osd.30] 2015-09-18 23:05:36.188251 7efed0ef0700 0 log_channel(cluster) log [WRN] : slow request 30.662958 seconds old, received at 2015-09-18 23:05:05.525220: osd_op(client.3117179.0:18654441 rbd_data.1099d2f67aaea.0000000000000f62 [set-alloc-hint object_size 8388608 write_size 8388608,write 1048576~643072] 4.5ba1672c ack+ondisk+write+known_if_redirected e55919) currently waiting for subops from 32,70,72 [osd.72] 2015-09-18 23:05:19.302985 7f3fa19f8700 0 log_channel(cluster) log [WRN] : slow request 30.200408 seconds old, received at 2015-09-18 23:04:49.102519: osd_op(client.4267090.0:3510311 rbd_data.3f41d41bd65b28.0000000000009e2b [set-alloc-hint object_size 4194304 write_size 4194304,write 1048576~421888] 17.40adcada ack+ondisk+write+known_if_redirected e55919) currently waiting for subops from 2,30,90 The other OSDs listed (32,70,2,90) did not have any errors in the logs about blocked I/O. It seems that osd.30 was waiting for osd.72 and visa versa. I looked at top and iostat of these two hosts and the OSD processes and disk I/O were pretty idle. I know that this isn't a lot to go on. Our cluster is under very heavy load and we get several blocked I/Os every hour, but they usually clear up within 15 seconds. We seem to get I/O blocked when the op latency of the cluster goes above 1 (average from all OSDs as seen by Graphite). Has anyone seen this infinite blocked I/O? Bouncing osd.72 immediately cleared all the blocked I/O and then it was fine after rejoining the cluster. Increasing what logs and to what level would be most beneficial in this case for troubleshooting? I hope this makes sense, it has been a long day. - ---------------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 -----BEGIN PGP SIGNATURE----- Version: Mailvelope v1.1.0 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJV/QiuCRDmVDuy+mK58QAAfskP/A0+RRAtq49pwfJcmuaV LKMsdaOFu0WL1zNLgnj4KOTR1oYyEShXW3Xn0axw1C2U2qXkJQfvMyQ7PTj7 cKqNeZl7rcgwkgXlij1hPYs9tjsetjYXBmmui+CqbSyNNo95aPrtUnWPcYnc K7blP6wuv7p0ddaF8wgw3Jf0GhzlHyykvVlxLYjQWwBh1CTrSzNWcEiHz5NE 9Y/GU5VZn7o8jeJDh6tQGgSbUjdk4NM2WuhyWNEP1klV+x1P51krXYDR7cNC DSWaud1hNtqYdquVPzx0UCcUVR0JfVlEX26uxRLgNd0dDkq+CRXIGhakVU75 Yxf8jwVdbAg1CpGtgHx6bWyho2rrsTzxeul8AFLWtELfod0e5nLsSUfQuQ2c MXrIoyHUcs7ySP3ozazPOdxwBEpiovUZOBy1gl2sCSGvYsmYokHEO0eop2rl kVS4dSAvDezmDhWumH60Y661uzySBGtrMlV/u3nw8vfvLhEAbuE+lLybMmtY nJvJIzbTqFzxaeX4PTWcUhXRNaPp8PDS5obmx5Fpn+AYOeLet/S1Alz1qNM2 4w34JKwKO92PtDYqzA6cj628fltdLkxFNoz7DFfqxr80DM7ndLukmSkPY+Oq qYOQMoownMnHuL0IrC9Jo8vK07H8agQyLF8/m4c3oTqnzZhh/rPRlPfyHEio Roj5 =ut4B -----END PGP SIGNATURE----- _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com