So it sounds like you've got two different things here: 1) You get a lot of slow operations that show up as warnings. 2) Rarely, you get blocked op warnings that don't seem to go away until the cluster state changes somehow. (2) is the interesting one. Since you say the cluster is under heavy load, I presume (1) is just you overloading your servers and getting some hot spots that take time to clear up. There have been bugs in the past where slow op warnings weren't getting removed when they should have. I don't *think* any are in .94.3 but could be wrong. Have you observed these from the other direction, where a client has blocked operations? If you want to go through the logs yourself, you should try and find all the lines about one of the operations which seems to be blocked. They aren't the most readable but if you grep for the operation ID (client.4267090.0:3510311) and then once you're in the right area look for what the threads processing it are doing you should get some idea of where things are going wrong you can share. -Greg On Sun, Sep 20, 2015 at 10:43 PM, Robert LeBlanc <robert@xxxxxxxxxxxxx> wrote: > We set the logging on an OSD that had problems pretty frequently, but > cleared up in less than 30 seconds. The logs are at > http://162.144.87.113/files/ceph-osd.112.log.xz and are uncompressed > at 8.6GB. Some of the messages we were seeing in ceph -w are: > > 2015-09-20 20:55:44.029041 osd.112 [WRN] 10 slow requests, 10 included > below; oldest blocked for > 30.132696 secs > 2015-09-20 20:55:44.029047 osd.112 [WRN] slow request 30.132696 > seconds old, received at 2015-09-20 20:55:13.896286: > osd_op(client.3289538.0:62497509 > rbd_data.29b9ae3f960770.0000000000000200 [stat,set-alloc-hint > object_size 8388608 write_size 8388608,write 2588672~4096] 17.118f0c67 > ack+ondisk+write+known_if_redirected e57590) currently reached_pg > 2015-09-20 20:55:44.029051 osd.112 [WRN] slow request 30.132619 > seconds old, received at 2015-09-20 20:55:13.896363: > osd_op(client.3289538.0:62497510 > rbd_data.29b9ae3f960770.0000000000000200 [stat,set-alloc-hint > object_size 8388608 write_size 8388608,write 2908160~12288] > 17.118f0c67 ack+ondisk+write+known_if_redirected e57590) currently > waiting for rw locks > 2015-09-20 20:55:44.029054 osd.112 [WRN] slow request 30.132520 > seconds old, received at 2015-09-20 20:55:13.896462: > osd_op(client.3289538.0:62497511 > rbd_data.29b9ae3f960770.0000000000000200 [stat,set-alloc-hint > object_size 8388608 write_size 8388608,write 2949120~4096] 17.118f0c67 > ack+ondisk+write+known_if_redirected e57590) currently waiting for rw > locks > 2015-09-20 20:55:44.029058 osd.112 [WRN] slow request 30.132415 > seconds old, received at 2015-09-20 20:55:13.896567: > osd_op(client.3289538.0:62497512 > rbd_data.29b9ae3f960770.0000000000000200 [stat,set-alloc-hint > object_size 8388608 write_size 8388608,write 2957312~4096] 17.118f0c67 > ack+ondisk+write+known_if_redirected e57590) currently waiting for rw > locks > 2015-09-20 20:55:44.029061 osd.112 [WRN] slow request 30.132302 > seconds old, received at 2015-09-20 20:55:13.896680: > osd_op(client.3289538.0:62497513 > rbd_data.29b9ae3f960770.0000000000000200 [stat,set-alloc-hint > object_size 8388608 write_size 8388608,write 2998272~4096] 17.118f0c67 > ack+ondisk+write+known_if_redirected e57590) currently waiting for rw > locks > 2015-09-20 20:55:45.029290 osd.112 [WRN] 9 slow requests, 5 included > below; oldest blocked for > 31.132843 secs > 2015-09-20 20:55:45.029298 osd.112 [WRN] slow request 31.132447 > seconds old, received at 2015-09-20 20:55:13.896759: > osd_op(client.3289538.0:62497514 > rbd_data.29b9ae3f960770.0000000000000200 [stat,set-alloc-hint > object_size 8388608 write_size 8388608,write 3035136~4096] 17.118f0c67 > ack+ondisk+write+known_if_redirected e57590) currently waiting for rw > locks > 2015-09-20 20:55:45.029303 osd.112 [WRN] slow request 31.132362 > seconds old, received at 2015-09-20 20:55:13.896845: > osd_op(client.3289538.0:62497515 > rbd_data.29b9ae3f960770.0000000000000200 [stat,set-alloc-hint > object_size 8388608 write_size 8388608,write 3047424~4096] 17.118f0c67 > ack+ondisk+write+known_if_redirected e57590) currently waiting for rw > locks > 2015-09-20 20:55:45.029309 osd.112 [WRN] slow request 31.132276 > seconds old, received at 2015-09-20 20:55:13.896931: > osd_op(client.3289538.0:62497516 > rbd_data.29b9ae3f960770.0000000000000200 [stat,set-alloc-hint > object_size 8388608 write_size 8388608,write 3072000~4096] 17.118f0c67 > ack+ondisk+write+known_if_redirected e57590) currently waiting for rw > locks > 2015-09-20 20:55:45.029315 osd.112 [WRN] slow request 31.132199 > seconds old, received at 2015-09-20 20:55:13.897008: > osd_op(client.3289538.0:62497517 > rbd_data.29b9ae3f960770.0000000000000200 [stat,set-alloc-hint > object_size 8388608 write_size 8388608,write 3211264~4096] 17.118f0c67 > ack+ondisk+write+known_if_redirected e57590) currently waiting for rw > locks > 2015-09-20 20:55:45.029326 osd.112 [WRN] slow request 31.132127 > seconds old, received at 2015-09-20 20:55:13.897079: > osd_op(client.3289538.0:62497518 > rbd_data.29b9ae3f960770.0000000000000200 [stat,set-alloc-hint > object_size 8388608 write_size 8388608,write 3235840~4096] 17.118f0c67 > ack+ondisk+write+known_if_redirected e57590) currently waiting for rw > locks > ---------------- > Robert LeBlanc > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 > > > On Sun, Sep 20, 2015 at 7:02 PM, Robert LeBlanc <robert@xxxxxxxxxxxxx> wrote: >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA256 >> >> I was able to catch the tail end of one of these and increased the >> logging on it. I had to kill it a minute or two after the logging was >> increased because of the time of the day. >> >> I've put the logs at https://robert.leblancnet.us/ceph-osd.8.log.xz . >> >> Thanks, >> - ---------------- >> Robert LeBlanc >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 >> >> >> On Sun, Sep 20, 2015 at 9:03 AM, Robert LeBlanc wrote: >>> -----BEGIN PGP SIGNED MESSAGE----- >>> Hash: SHA256 >>> >>> We had another incident of 100 long blocked I/O this morning, but I >>> didn't get to it in time. I wound up clearing itself after almost >>> 1,000 seconds. On interesting note is that the blocked I/O kept >>> creeping up until I see a bunch of entrys in the log like: >>> >>> 2015-09-20 08:20:01.870141 7f5fbe05c700 0 -- 10.217.72.12:6812/1468 >>>>> 10.217.72.35:0/4027675 pipe(0x2f7a3000 sd=363 :6812 s=0 pgs=0 cs=0 >>> l=1 c=0x1bd40840).accept replacing existing (lossy) channel (new one >>> lossy=1) >>> 2015-09-20 08:20:02.061539 7f5f43de3700 0 -- 10.217.72.12:6812/1468 >>>>> 10.217.72.33:0/2012691 pipe(0x28857000 sd=408 :6812 s=0 pgs=0 cs=0 >>> l=1 c=0x1bd43020).accept replacing existing (lossy) channel (new one >>> lossy=1) >>> 2015-09-20 08:20:02.817884 7f5fb2ca9700 0 -- 10.217.72.12:6812/1468 >>>>> 10.217.72.33:0/2040360 pipe(0x283ff000 sd=605 :6812 s=0 pgs=0 cs=0 >>> l=1 c=0x1bd402c0).accept replacing existing (lossy) channel (new one >>> lossy=1) >>> >>> after almost 100 of these in about a 2 second period, the I/O starts >>> draining over the next 3-4 seconds. What does this message mean? 1468 >>> appears to be the PID of one of the OSD processes, but the number in >>> the same position on the dest is not a PID. What would cause a channel >>> to be dropped and recreated? >>> >>> There are 10 OSDs on this host, but only two other OSDs show anything >>> and only two messages (but it is many seconds AFTER the problem clears >>> up). >>> >>> OSD.128 >>> >>> 2015-09-20 08:25:32.331268 7fe92e7e0700 0 -- 10.217.72.12:6816/7060 >>> submit_message osd_op_reply(5553069 >>> rbd_data.3e30b0275d493f.000000000000d800 [set-alloc-hint object_size >>> 4194304 write_size 4194304,write 143360~4096] v57084'16557680 >>> uv16557680 ack = 0) v6 remote, 10.217.72.33:0/4005620, failed lossy >>> con, dropping message 0xa099180 >>> 2015-09-20 08:25:32.331401 7fe92e7e0700 0 -- 10.217.72.12:6816/7060 >>> submit_message osd_op_reply(5553069 >>> rbd_data.3e30b0275d493f.000000000000d800 [set-alloc-hint object_size >>> 4194304 write_size 4194304,write 143360~4096] v57084'16557680 >>> uv16557680 ondisk = 0) v6 remote, 10.217.72.33:0/4005620, failed lossy >>> con, dropping message 0x16217600 >>> >>> OSD.121 >>> >>> 2015-09-20 08:29:15.192055 7f51c07e2700 0 -- 10.217.72.12:6802/25568 >>> submit_message osd_op_reply(1483497 >>> rbd_data.3fc0847f32a8f4.0000000000003200 [set-alloc-hint object_size >>> 4194304 write_size 4194304,write 135168~4096] v57086'14949171 >>> uv14949171 ack = 0) v6 remote, 10.217.72.31:0/5002159, failed lossy >>> con, dropping message 0x1dd8a840 >>> 2015-09-20 08:29:15.192612 7f51c07e2700 0 -- 10.217.72.12:6802/25568 >>> submit_message osd_op_reply(1483497 >>> rbd_data.3fc0847f32a8f4.0000000000003200 [set-alloc-hint object_size >>> 4194304 write_size 4194304,write 135168~4096] v57086'14949171 >>> uv14949171 ondisk = 0) v6 remote, 10.217.72.31:0/5002159, failed lossy >>> con, dropping message 0x1476ec00 >>> >>> Our Ceph processes start with some real high limits: >>> root 1465 0.0 0.0 127700 3492 ? Ss Jun11 0:00 >>> /bin/bash -c ulimit -n 32768; /usr/bin/ceph-osd -i 126 --pid-file >>> /var/run/ceph/osd.126.pid -c /etc/ceph/ceph.conf --cluster ceph -f >>> root 1468 9.7 1.4 4849560 923536 ? Sl Jun11 14190:59 >>> /usr/bin/ceph-osd -i 126 --pid-file /var/run/ceph/osd.126.pid -c >>> /etc/ceph/ceph.conf --cluster ceph -f >>> >>> [root@ceph2 1468]# cat /etc/security/limits.d/90-local.conf >>> * - nofile 16384 >>> >>> We have 130 OSDs on 13 hosts. We also have ~50 KVM VM running on this storage. >>> >>> cluster 48de182b-5488-42bb-a6d2-62e8e47b435c >>> health HEALTH_WARN >>> 198 pgs backfill >>> 4 pgs backfilling >>> 169 pgs degraded >>> 150 pgs recovery_wait >>> 169 pgs stuck degraded >>> 352 pgs stuck unclean >>> 12 pgs stuck undersized >>> 12 pgs undersized >>> recovery 161065/41285858 objects degraded (0.390%) >>> recovery 2871014/41285858 objects misplaced (6.954%) >>> noscrub,nodeep-scrub flag(s) set >>> monmap e2: 3 mons at >>> {mon1=10.217.72.27:6789/0,mon2=10.217.72.28:6789/0,mon3=10.217.72.29:6789/0} >>> election epoch 180, quorum 0,1,2 mon1,mon2,mon3 >>> osdmap e57086: 130 osds: 130 up, 130 in; 270 remapped pgs >>> flags noscrub,nodeep-scrub >>> pgmap v10921036: 2308 pgs, 3 pools, 39046 GB data, 9735 kobjects >>> 151 TB used, 320 TB / 472 TB avail >>> 161065/41285858 objects degraded (0.390%) >>> 2871014/41285858 objects misplaced (6.954%) >>> 1956 active+clean >>> 183 active+remapped+wait_backfill >>> 82 active+recovery_wait+degraded >>> 67 active+recovery_wait+degraded+remapped >>> 9 active+undersized+degraded+remapped+wait_backfill >>> 6 active+degraded+remapped+wait_backfill >>> 2 active+undersized+degraded+remapped+backfilling >>> 2 active+degraded+remapped+backfilling >>> 1 active+recovery_wait+undersized+degraded+remapped >>> recovery io 25770 kB/s, 6 objects/s >>> client io 78274 kB/s rd, 119 MB/s wr, 5949 op/s >>> >>> Any ideas would be helpful. >>> -----BEGIN PGP SIGNATURE----- >>> Version: Mailvelope v1.1.0 >>> Comment: https://www.mailvelope.com >>> >>> wsFcBAEBCAAQBQJV/sqiCRDmVDuy+mK58QAAdp0QAMEU7JW88NDSzpIidGfT >>> EZpUkN3UqcOWeb/szCasHzjEs+IPPNGVEICs4KoTfQQobJaVIi3T6RmioOER >>> RrI0S4SO/LCtxyb/3iAfSKgFd9nEDlZKBmywQ1HgZZR0a2uMwzbSj3zfnOaq >>> tuLqVj10HGoO3KnjYjv4z+itpGrOOPew1Zjj4mgani+feaR5KxnopVKTNaut >>> ZivEkjEseHbtj6rxio8w4XKj8LzZCARUeiLB6c1d1/NOegcK4sPZvCXV7syx >>> eOeP9spr4YWrBnO0HNKqxNcKW2GfWbVGEIEmEsZYFoj8E3u2uUZ9Y9s41BtT >>> Sy67Edr8prfe2+9Jb29buXnj9k2bnzQ87SjtbGdym8/WgkHRmKeV8enGJ+X4 >>> IpaYj3HuLfGL724tigxt685uYOdqS8ahyeED/1tW6194lvarXgfYycXt1lj5 >>> TGan/hsQQUCDFYA76Gqvfzj//Umytg8Ub1B38X/H1XlMDinqFfcmW8/R7Wwe >>> PubXTM3zNS3j3Fl4/+MZS1T3qNlKEMk+jWRC5nYwE7e1aomABY0QbHHGxgPK >>> pFl9sm9cOKfWCRXQX4w7mMRspiMosW1X1WbmLe2cU17xtudc0rsEZycPOt3n >>> XCwin7/+yxKwp/MSWCk/vR/pY7Q/73Pi4kKRzXpFHLpJnteZ3moATTmuSSQm >>> 3Ggb >>> =SJmS >>> -----END PGP SIGNATURE----- >>> ---------------- >>> Robert LeBlanc >>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 >>> >>> >>> On Sat, Sep 19, 2015 at 1:03 AM, Robert LeBlanc wrote: >>>> -----BEGIN PGP SIGNED MESSAGE----- >>>> Hash: SHA256 >>>> >>>> We have had two situations where I/O just seems to be indefinitely >>>> blocked on our production cluster today (0.94.3). In the case this >>>> morning, it was just normal I/O traffic, no recovery or backfill. The >>>> case this evening, we were backfilling to some new OSDs. I would have >>>> loved to have bumped up the debugging to get an idea of what was going >>>> on, but time was exhausted. The incident this evening I was able to do >>>> some additional troubleshooting, but got real anxious after I/O had >>>> been blocked for 10 minutes and OPs was getting hot around the collar. >>>> >>>> Here are the important parts of the logs: >>>> [osd.30] >>>> 2015-09-18 23:05:36.188251 7efed0ef0700 0 log_channel(cluster) log >>>> [WRN] : slow request 30.662958 seconds old, >>>> received at 2015-09-18 23:05:05.525220: osd_op(client.3117179.0:18654441 >>>> rbd_data.1099d2f67aaea.0000000000000f62 [set-alloc-hint object_size >>>> 8388608 write_size 8388608,write 1048576~643072] 4.5ba1672c >>>> ack+ondisk+write+known_if_redirected e55919) >>>> currently waiting for subops from 32,70,72 >>>> >>>> [osd.72] >>>> 2015-09-18 23:05:19.302985 7f3fa19f8700 0 log_channel(cluster) log >>>> [WRN] : slow request 30.200408 seconds old, >>>> received at 2015-09-18 23:04:49.102519: osd_op(client.4267090.0:3510311 >>>> rbd_data.3f41d41bd65b28.0000000000009e2b [set-alloc-hint object_size >>>> 4194304 write_size 4194304,write 1048576~421888] 17.40adcada >>>> ack+ondisk+write+known_if_redirected e55919) >>>> currently waiting for subops from 2,30,90 >>>> >>>> The other OSDs listed (32,70,2,90) did not have any errors in the logs >>>> about blocked I/O. It seems that osd.30 was waiting for osd.72 and >>>> visa versa. I looked at top and iostat of these two hosts and the OSD >>>> processes and disk I/O were pretty idle. >>>> >>>> I know that this isn't a lot to go on. Our cluster is under very heavy >>>> load and we get several blocked I/Os every hour, but they usually >>>> clear up within 15 seconds. We seem to get I/O blocked when the op >>>> latency of the cluster goes above 1 (average from all OSDs as seen by >>>> Graphite). >>>> >>>> Has anyone seen this infinite blocked I/O? Bouncing osd.72 immediately >>>> cleared all the blocked I/O and then it was fine after rejoining the >>>> cluster. Increasing what logs and to what level would be most >>>> beneficial in this case for troubleshooting? >>>> >>>> I hope this makes sense, it has been a long day. >>>> >>>> - ---------------- >>>> Robert LeBlanc >>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 >>>> -----BEGIN PGP SIGNATURE----- >>>> Version: Mailvelope v1.1.0 >>>> Comment: https://www.mailvelope.com >>>> >>>> wsFcBAEBCAAQBQJV/QiuCRDmVDuy+mK58QAAfskP/A0+RRAtq49pwfJcmuaV >>>> LKMsdaOFu0WL1zNLgnj4KOTR1oYyEShXW3Xn0axw1C2U2qXkJQfvMyQ7PTj7 >>>> cKqNeZl7rcgwkgXlij1hPYs9tjsetjYXBmmui+CqbSyNNo95aPrtUnWPcYnc >>>> K7blP6wuv7p0ddaF8wgw3Jf0GhzlHyykvVlxLYjQWwBh1CTrSzNWcEiHz5NE >>>> 9Y/GU5VZn7o8jeJDh6tQGgSbUjdk4NM2WuhyWNEP1klV+x1P51krXYDR7cNC >>>> DSWaud1hNtqYdquVPzx0UCcUVR0JfVlEX26uxRLgNd0dDkq+CRXIGhakVU75 >>>> Yxf8jwVdbAg1CpGtgHx6bWyho2rrsTzxeul8AFLWtELfod0e5nLsSUfQuQ2c >>>> MXrIoyHUcs7ySP3ozazPOdxwBEpiovUZOBy1gl2sCSGvYsmYokHEO0eop2rl >>>> kVS4dSAvDezmDhWumH60Y661uzySBGtrMlV/u3nw8vfvLhEAbuE+lLybMmtY >>>> nJvJIzbTqFzxaeX4PTWcUhXRNaPp8PDS5obmx5Fpn+AYOeLet/S1Alz1qNM2 >>>> 4w34JKwKO92PtDYqzA6cj628fltdLkxFNoz7DFfqxr80DM7ndLukmSkPY+Oq >>>> qYOQMoownMnHuL0IrC9Jo8vK07H8agQyLF8/m4c3oTqnzZhh/rPRlPfyHEio >>>> Roj5 >>>> =ut4B >>>> -----END PGP SIGNATURE----- >> >> -----BEGIN PGP SIGNATURE----- >> Version: Mailvelope v1.1.0 >> Comment: https://www.mailvelope.com >> >> wsFcBAEBCAAQBQJV/1coCRDmVDuy+mK58QAAV14QAKzoivHyvrlUBNa3ltti >> 7xOQv6QIvRNuaYX1jwUb0JA2JwpcUoZFMukTtrXJmon/MFTbIelIO4tzijLc >> kXGqOcV+z0RpXydOR5Rdu+wdb59TxIjmE/mkUHZfdPVSJGQnTjnCJ7WkAwtz >> D0RREo7odc8C6AfSYRzJ0RPz9grNflKh4fTAkDGdkWVwRPDMHtB0D7FEviZs >> JQlBsBuSNT3OIRqvjQsc0Yl+Xodmr4uaK3yj6+Sx7fQpilS6I5nqRQx0HuRk >> K41425/hkAH0Es9PjR54fl6xi8xPf/zLDw8brMjmZQxu0d+sJWkOTDXmmqk8 >> l0s0UcC/W2RDxkiTudNPvx4PIy8bm5/l0NGFXiA9uaq42w6l45vARrCHWCc6 >> FQkKmQLlW89CzSyCDwYVZEc73F/cr97DFkBQctSXND2FmeLWQNOTsUo02C8n >> qTksSKjH8qI0aAj3vnf1GKxFq62dPejnnmOhwxKiTulksAzbIRmr9F2uPDBo >> +wDD5acAqVCOYCX7QwxoHSW2HR3AUzf6dLOnM1bSYOM5RF09ZAGw7sBmXYvf >> b1CwpIS1hK98Ww8+A+RkEFlyFWJPIO3KEK7z1K5Hjrp0dIdCLcpl3D66v/c8 >> SB6B+KNoNh5IP6GTFBu72sOaAyeSnnPrt5oZbJ8Kt7vJ+wg3fvNhh7at/B1l >> E/IO >> =AVQ7 >> -----END PGP SIGNATURE----- > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com