Re: Potential OSD deadlock?

Robert LeBlanc <robert@xxxxxxxxxxxxx> · Sun, 20 Sep 2015 09:03:01 -0600

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

We had another incident of 100 long blocked I/O this morning, but I
didn't get to it in time. I wound up clearing itself after almost
1,000 seconds. On interesting note is that the blocked I/O kept
creeping up until I see a bunch of entrys in the log like:

2015-09-20 08:20:01.870141 7f5fbe05c700  0 -- 10.217.72.12:6812/1468
>> 10.217.72.35:0/4027675 pipe(0x2f7a3000 sd=363 :6812 s=0 pgs=0 cs=0
l=1 c=0x1bd40840).accept replacing existing (lossy) channel (new one
lossy=1)
2015-09-20 08:20:02.061539 7f5f43de3700  0 -- 10.217.72.12:6812/1468
>> 10.217.72.33:0/2012691 pipe(0x28857000 sd=408 :6812 s=0 pgs=0 cs=0
l=1 c=0x1bd43020).accept replacing existing (lossy) channel (new one
lossy=1)
2015-09-20 08:20:02.817884 7f5fb2ca9700  0 -- 10.217.72.12:6812/1468
>> 10.217.72.33:0/2040360 pipe(0x283ff000 sd=605 :6812 s=0 pgs=0 cs=0
l=1 c=0x1bd402c0).accept replacing existing (lossy) channel (new one
lossy=1)

after almost 100 of these in about a 2 second period, the I/O starts
draining over the next 3-4 seconds. What does this message mean? 1468
appears to be the PID of one of the OSD processes, but the number in
the same position on the dest is not a PID. What would cause a channel
to be dropped and recreated?

There are 10 OSDs on this host, but only two other OSDs show anything
and only two messages (but it is many seconds AFTER the problem clears
up).

OSD.128

2015-09-20 08:25:32.331268 7fe92e7e0700  0 -- 10.217.72.12:6816/7060
submit_message osd_op_reply(5553069
rbd_data.3e30b0275d493f.000000000000d800 [set-alloc-hint object_size
4194304 write_size 4194304,write 143360~4096] v57084'16557680
uv16557680 ack = 0) v6 remote, 10.217.72.33:0/4005620, failed lossy
con, dropping message 0xa099180
2015-09-20 08:25:32.331401 7fe92e7e0700  0 -- 10.217.72.12:6816/7060
submit_message osd_op_reply(5553069
rbd_data.3e30b0275d493f.000000000000d800 [set-alloc-hint object_size
4194304 write_size 4194304,write 143360~4096] v57084'16557680
uv16557680 ondisk = 0) v6 remote, 10.217.72.33:0/4005620, failed lossy
con, dropping message 0x16217600

OSD.121

2015-09-20 08:29:15.192055 7f51c07e2700  0 -- 10.217.72.12:6802/25568
submit_message osd_op_reply(1483497
rbd_data.3fc0847f32a8f4.0000000000003200 [set-alloc-hint object_size
4194304 write_size 4194304,write 135168~4096] v57086'14949171
uv14949171 ack = 0) v6 remote, 10.217.72.31:0/5002159, failed lossy
con, dropping message 0x1dd8a840
2015-09-20 08:29:15.192612 7f51c07e2700  0 -- 10.217.72.12:6802/25568
submit_message osd_op_reply(1483497
rbd_data.3fc0847f32a8f4.0000000000003200 [set-alloc-hint object_size
4194304 write_size 4194304,write 135168~4096] v57086'14949171
uv14949171 ondisk = 0) v6 remote, 10.217.72.31:0/5002159, failed lossy
con, dropping message 0x1476ec00

Our Ceph processes start with some real high limits:
root      1465  0.0  0.0 127700  3492 ?        Ss   Jun11   0:00
/bin/bash -c ulimit -n 32768; /usr/bin/ceph-osd -i 126 --pid-file
/var/run/ceph/osd.126.pid -c /etc/ceph/ceph.conf --cluster ceph -f
root      1468  9.7  1.4 4849560 923536 ?      Sl   Jun11 14190:59
/usr/bin/ceph-osd -i 126 --pid-file /var/run/ceph/osd.126.pid -c
/etc/ceph/ceph.conf --cluster ceph -f

[root@ceph2 1468]# cat /etc/security/limits.d/90-local.conf
*       -       nofile  16384

We have 130 OSDs on 13 hosts. We also have ~50 KVM VM running on this storage.

    cluster 48de182b-5488-42bb-a6d2-62e8e47b435c
     health HEALTH_WARN
            198 pgs backfill
            4 pgs backfilling
            169 pgs degraded
            150 pgs recovery_wait
            169 pgs stuck degraded
            352 pgs stuck unclean
            12 pgs stuck undersized
            12 pgs undersized
            recovery 161065/41285858 objects degraded (0.390%)
            recovery 2871014/41285858 objects misplaced (6.954%)
            noscrub,nodeep-scrub flag(s) set
     monmap e2: 3 mons at
{mon1=10.217.72.27:6789/0,mon2=10.217.72.28:6789/0,mon3=10.217.72.29:6789/0}
            election epoch 180, quorum 0,1,2 mon1,mon2,mon3
     osdmap e57086: 130 osds: 130 up, 130 in; 270 remapped pgs
            flags noscrub,nodeep-scrub
      pgmap v10921036: 2308 pgs, 3 pools, 39046 GB data, 9735 kobjects
            151 TB used, 320 TB / 472 TB avail
            161065/41285858 objects degraded (0.390%)
            2871014/41285858 objects misplaced (6.954%)
                1956 active+clean
                 183 active+remapped+wait_backfill
                  82 active+recovery_wait+degraded
                  67 active+recovery_wait+degraded+remapped
                   9 active+undersized+degraded+remapped+wait_backfill
                   6 active+degraded+remapped+wait_backfill
                   2 active+undersized+degraded+remapped+backfilling
                   2 active+degraded+remapped+backfilling
                   1 active+recovery_wait+undersized+degraded+remapped
recovery io 25770 kB/s, 6 objects/s
  client io 78274 kB/s rd, 119 MB/s wr, 5949 op/s

Any ideas would be helpful.
-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v1.1.0
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJV/sqiCRDmVDuy+mK58QAAdp0QAMEU7JW88NDSzpIidGfT
EZpUkN3UqcOWeb/szCasHzjEs+IPPNGVEICs4KoTfQQobJaVIi3T6RmioOER
RrI0S4SO/LCtxyb/3iAfSKgFd9nEDlZKBmywQ1HgZZR0a2uMwzbSj3zfnOaq
tuLqVj10HGoO3KnjYjv4z+itpGrOOPew1Zjj4mgani+feaR5KxnopVKTNaut
ZivEkjEseHbtj6rxio8w4XKj8LzZCARUeiLB6c1d1/NOegcK4sPZvCXV7syx
eOeP9spr4YWrBnO0HNKqxNcKW2GfWbVGEIEmEsZYFoj8E3u2uUZ9Y9s41BtT
Sy67Edr8prfe2+9Jb29buXnj9k2bnzQ87SjtbGdym8/WgkHRmKeV8enGJ+X4
IpaYj3HuLfGL724tigxt685uYOdqS8ahyeED/1tW6194lvarXgfYycXt1lj5
TGan/hsQQUCDFYA76Gqvfzj//Umytg8Ub1B38X/H1XlMDinqFfcmW8/R7Wwe
PubXTM3zNS3j3Fl4/+MZS1T3qNlKEMk+jWRC5nYwE7e1aomABY0QbHHGxgPK
pFl9sm9cOKfWCRXQX4w7mMRspiMosW1X1WbmLe2cU17xtudc0rsEZycPOt3n
XCwin7/+yxKwp/MSWCk/vR/pY7Q/73Pi4kKRzXpFHLpJnteZ3moATTmuSSQm
3Ggb
=SJmS
-----END PGP SIGNATURE-----
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

On Sat, Sep 19, 2015 at 1:03 AM, Robert LeBlanc <robert@xxxxxxxxxxxxx> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> We have had two situations where I/O just seems to be indefinitely
> blocked on our production cluster today (0.94.3). In the case this
> morning, it was just normal I/O traffic, no recovery or backfill. The
> case this evening, we were backfilling to some new OSDs. I would have
> loved to have bumped up the debugging to get an idea of what was going
> on, but time was exhausted. The incident this evening I was able to do
> some additional troubleshooting, but got real anxious after I/O had
> been blocked for 10 minutes and OPs was getting hot around the collar.
>
> Here are the important parts of the logs:
> [osd.30]
> 2015-09-18 23:05:36.188251 7efed0ef0700  0 log_channel(cluster) log
> [WRN] : slow request 30.662958 seconds old,
>  received at 2015-09-18 23:05:05.525220: osd_op(client.3117179.0:18654441
>  rbd_data.1099d2f67aaea.0000000000000f62 [set-alloc-hint object_size
> 8388608 write_size 8388608,write 1048576~643072] 4.5ba1672c
> ack+ondisk+write+known_if_redirected e55919)
>  currently waiting for subops from 32,70,72
>
> [osd.72]
> 2015-09-18 23:05:19.302985 7f3fa19f8700  0 log_channel(cluster) log
> [WRN] : slow request 30.200408 seconds old,
>  received at 2015-09-18 23:04:49.102519: osd_op(client.4267090.0:3510311
>  rbd_data.3f41d41bd65b28.0000000000009e2b [set-alloc-hint object_size
> 4194304 write_size 4194304,write 1048576~421888] 17.40adcada
> ack+ondisk+write+known_if_redirected e55919)
>  currently waiting for subops from 2,30,90
>
> The other OSDs listed (32,70,2,90) did not have any errors in the logs
> about blocked I/O. It seems that osd.30 was waiting for osd.72 and
> visa versa. I looked at top and iostat of these two hosts and the OSD
> processes and disk I/O were pretty idle.
>
> I know that this isn't a lot to go on. Our cluster is under very heavy
> load and we get several blocked I/Os every hour, but they usually
> clear up within 15 seconds. We seem to get I/O blocked when the op
> latency of the cluster goes above 1 (average from all OSDs as seen by
> Graphite).
>
> Has anyone seen this infinite blocked I/O? Bouncing osd.72 immediately
> cleared all the blocked I/O and then it was fine after rejoining the
> cluster. Increasing what logs and to what level would be most
> beneficial in this case for troubleshooting?
>
> I hope this makes sense, it has been a long day.
>
> - ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> -----BEGIN PGP SIGNATURE-----
> Version: Mailvelope v1.1.0
> Comment: https://www.mailvelope.com
>
> wsFcBAEBCAAQBQJV/QiuCRDmVDuy+mK58QAAfskP/A0+RRAtq49pwfJcmuaV
> LKMsdaOFu0WL1zNLgnj4KOTR1oYyEShXW3Xn0axw1C2U2qXkJQfvMyQ7PTj7
> cKqNeZl7rcgwkgXlij1hPYs9tjsetjYXBmmui+CqbSyNNo95aPrtUnWPcYnc
> K7blP6wuv7p0ddaF8wgw3Jf0GhzlHyykvVlxLYjQWwBh1CTrSzNWcEiHz5NE
> 9Y/GU5VZn7o8jeJDh6tQGgSbUjdk4NM2WuhyWNEP1klV+x1P51krXYDR7cNC
> DSWaud1hNtqYdquVPzx0UCcUVR0JfVlEX26uxRLgNd0dDkq+CRXIGhakVU75
> Yxf8jwVdbAg1CpGtgHx6bWyho2rrsTzxeul8AFLWtELfod0e5nLsSUfQuQ2c
> MXrIoyHUcs7ySP3ozazPOdxwBEpiovUZOBy1gl2sCSGvYsmYokHEO0eop2rl
> kVS4dSAvDezmDhWumH60Y661uzySBGtrMlV/u3nw8vfvLhEAbuE+lLybMmtY
> nJvJIzbTqFzxaeX4PTWcUhXRNaPp8PDS5obmx5Fpn+AYOeLet/S1Alz1qNM2
> 4w34JKwKO92PtDYqzA6cj628fltdLkxFNoz7DFfqxr80DM7ndLukmSkPY+Oq
> qYOQMoownMnHuL0IrC9Jo8vK07H8agQyLF8/m4c3oTqnzZhh/rPRlPfyHEio
> Roj5
> =ut4B
> -----END PGP SIGNATURE-----
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com