hung rbd requests for one pool

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



One guest VM on my test cluster has hung for more than 24 hours while running a fio test on an RBD device, but other VMs accessing other images in the same pool are fine.  I was able to reproduce the problem by running “rbd info” on the same pool as the stuck VM with some debug tracing on (see log below).  How can I narrow this down further or resolve the problem?

Here are a few details about the cluster:

ceph version 10.2.7
Three monitors and six OSD nodes with three OSDs each
Each OSD has one SSD with separate partitions for the journal and data, using XFS
Clients are KVM guests using rbd devices with virtio

Cluster is healthy:
ceph7:~$ sudo ceph status
    cluster 876a19e2-7f61-4774-a6b3-eaab4004f45f
     health HEALTH_OK
     monmap e1: 3 mons at {a=192.168.206.10:6789/0,b=192.168.206.11:6789/0,c=192.168.206.12:6789/0}
            election epoch 6, quorum 0,1,2 a,b,c
     osdmap e27: 18 osds: 18 up, 18 in
            flags sortbitwise,require_jewel_osds
      pgmap v240894: 576 pgs, 2 pools, 416 GB data, 104 kobjects
            1248 GB used, 2606 GB / 3854 GB avail
                 576 active+clean
  client io 2548 kB/s rd, 2632 kB/s wr, 493 op/s rd, 1121 op/s wr

Log output from “rbd info” on the client node (not in a VM):
ceph7:~$ sudo rbd -c debug/ceph.conf info app/image1
2017-04-24 11:30:57.048750 7f55365c5d40  1 -- :/0 messenger.start
2017-04-24 11:30:57.049223 7f55365c5d40  1 -- :/3282647735 --> 192.168.206.11:6789/0 -- auth(proto 0 30 bytes epoch 0) v1 -- ?+0 0x55c254e1ccc0 con 0x55c254e17850
2017-04-24 11:30:57.050077 7f55365bd700  1 -- 192.168.206.17:0/3282647735 learned my addr 192.168.206.17:0/3282647735
2017-04-24 11:30:57.051040 7f551a627700  1 -- 192.168.206.17:0/3282647735 <== mon.1 192.168.206.11:6789/0 1 ==== mon_map magic: 0 v1 ==== 473+0+0 (2270207254 0 0) 0x7f5500000b80 con 0x55c254e17850
2017-04-24 11:30:57.051148 7f551a627700  1 -- 192.168.206.17:0/3282647735 <== mon.1 192.168.206.11:6789/0 2 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 33+0+0 (2714966539 0 0) 0x7f5500001040 con 0x55c254e17850
2017-04-24 11:30:57.051328 7f551a627700  1 -- 192.168.206.17:0/3282647735 --> 192.168.206.11:6789/0 -- auth(proto 2 32 bytes epoch 0) v1 -- ?+0 0x7f5504001860 con 0x55c254e17850
2017-04-24 11:30:57.052239 7f551a627700  1 -- 192.168.206.17:0/3282647735 <== mon.1 192.168.206.11:6789/0 3 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 206+0+0 (3323982069 0 0) 0x7f5500001040 con 0x55c254e17850
2017-04-24 11:30:57.052399 7f551a627700  1 -- 192.168.206.17:0/3282647735 --> 192.168.206.11:6789/0 -- auth(proto 2 165 bytes epoch 0) v1 -- ?+0 0x7f5504003370 con 0x55c254e17850
2017-04-24 11:30:57.053313 7f551a627700  1 -- 192.168.206.17:0/3282647735 <== mon.1 192.168.206.11:6789/0 4 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 393+0+0 (1107778031 0 0) 0x7f55000008c0 con 0x55c254e17850
2017-04-24 11:30:57.053415 7f551a627700  1 -- 192.168.206.17:0/3282647735 --> 192.168.206.11:6789/0 -- mon_subscribe({monmap=0+}) v2 -- ?+0 0x55c254e1d290 con 0x55c254e17850
2017-04-24 11:30:57.053477 7f55365c5d40  1 -- 192.168.206.17:0/3282647735 --> 192.168.206.11:6789/0 -- mon_subscribe({osdmap=0}) v2 -- ?+0 0x55c254e12df0 con 0x55c254e17850
2017-04-24 11:30:57.053851 7f551a627700  1 -- 192.168.206.17:0/3282647735 <== mon.1 192.168.206.11:6789/0 5 ==== mon_map magic: 0 v1 ==== 473+0+0 (2270207254 0 0) 0x7f5500001360 con 0x55c254e17850
2017-04-24 11:30:57.054058 7f551a627700  1 -- 192.168.206.17:0/3282647735 <== mon.1 192.168.206.11:6789/0 6 ==== osd_map(27..27 src has 1..27) v3 ==== 13035+0+0 (2602332718 0 0) 0x7f5500000cc0 con 0x55c254e17850
2017-04-24 11:30:57.054376 7f55365c5d40  5 librbd::AioImageRequestWQ: 0x55c254e21c10 : ictx=0x55c254e20760
2017-04-24 11:30:57.054498 7f55365c5d40 20 librbd::ImageState: 0x55c254e19330 open
2017-04-24 11:30:57.054503 7f55365c5d40 10 librbd::ImageState: 0x55c254e19330 0x55c254e19330 send_open_unlock
2017-04-24 11:30:57.054512 7f55365c5d40 10 librbd::image::OpenRequest: 0x55c254e22590 send_v2_detect_header
2017-04-24 11:30:57.054632 7f55365c5d40  1 -- 192.168.206.17:0/3282647735 --> 192.168.206.13:6802/22690 -- osd_op(client.4375.0:1 1.ba46737 rbd_id.image1 [stat] snapc 0=[] ack+read+known_if_redirected e27) v7 -- ?+0 0x55c254e25d00 con 0x55c254e248d0
2017-04-24 11:30:57.056830 7f5518421700  1 -- 192.168.206.17:0/3282647735 <== osd.10 192.168.206.13:6802/22690 1 ==== osd_op_reply(1 rbd_id.image1 [stat] v0'0 uv7 _ondisk_ = 0) v7 ==== 133+0+16 (2025423138 0 1760854024) 0x7f54f0000b40 con 0x55c254e248d0
2017-04-24 11:30:57.056949 7f5512ffd700 10 librbd::image::OpenRequest: handle_v2_detect_header: r=0
2017-04-24 11:30:57.056965 7f5512ffd700 10 librbd::image::OpenRequest: 0x55c254e22590 send_v2_get_id
2017-04-24 11:30:57.057026 7f5512ffd700  1 -- 192.168.206.17:0/3282647735 --> 192.168.206.13:6802/22690 -- osd_op(client.4375.0:2 1.ba46737 rbd_id.image1 [call rbd.get_id] snapc 0=[] ack+read+known_if_redirected e27) v7 -- ?+0 0x7f54f40021f0 con 0x55c254e248d0
2017-04-24 11:30:57.058023 7f5518421700  1 -- 192.168.206.17:0/3282647735 <== osd.10 192.168.206.13:6802/22690 2 ==== osd_op_reply(2 rbd_id.image1 [call] v0'0 uv7 _ondisk_ = 0) v7 ==== 133+0+16 (1719037427 0 128656606) 0x7f54f0000b40 con 0x55c254e248d0
2017-04-24 11:30:57.058069 7f5512ffd700 10 librbd::image::OpenRequest: handle_v2_get_id: r=0
2017-04-24 11:30:57.058081 7f5512ffd700 10 librbd::image::OpenRequest: 0x55c254e22590 send_v2_get_immutable_metadata
2017-04-24 11:30:57.058233 7f5512ffd700  1 -- 192.168.206.17:0/3282647735 --> 192.168.206.13:6804/22934 -- osd_op(client.4375.0:3 1.af6f1e38 rbd_header.1058238e1f29 [call rbd.get_size,call rbd.get_object_prefix] snapc 0=[] ack+read+known_if_redirected e27) v7 -- ?+0 0x7f54f40077f0 con 0x7f54f40064e0
2017-04-24 11:31:12.053858 7f551ae28700  1 -- 192.168.206.17:0/3282647735 --> 192.168.206.11:6789/0 -- mon_subscribe({osdmap=28}) v2 -- ?+0 0x7f54ec000a50 con 0x55c254e17850
2017-04-24 11:31:12.053890 7f551ae28700  1 -- 192.168.206.17:0/3282647735 --> 192.168.206.13:6804/22934 -- ping magic: 0 v1 -- ?+0 0x7f54ec000fc0 con 0x7f54f40064e0
2017-04-24 11:31:17.054035 7f551ae28700  1 -- 192.168.206.17:0/3282647735 --> 192.168.206.13:6804/22934 -- ping magic: 0 v1 -- ?+0 0x7f54ec000a80 con 0x7f54f40064e0
2017-04-24 11:31:22.054211 7f551ae28700  1 -- 192.168.206.17:0/3282647735 --> 192.168.206.13:6804/22934 -- ping magic: 0 v1 -- ?+0 0x7f54ec0014e0 con 0x7f54f40064e0
2017-04-24 11:31:27.054381 7f551ae28700  1 -- 192.168.206.17:0/3282647735 --> 192.168.206.13:6804/22934 -- ping magic: 0 v1 -- ?+0 0x7f54ec001860 con 0x7f54f40064e0
2017-04-24 11:31:32.054558 7f551ae28700  1 -- 192.168.206.17:0/3282647735 --> 192.168.206.13:6804/22934 -- ping magic: 0 v1 -- ?+0 0x7f54ec001be0 con 0x7f54f40064e0
2017-04-24 11:31:37.054723 7f551ae28700  1 -- 192.168.206.17:0/3282647735 --> 192.168.206.13:6804/22934 -- ping magic: 0 v1 -- ?+0 0x7f54ec001f50 con 0x7f54f40064e0
2017-04-24 11:31:42.054895 7f551ae28700  1 -- 192.168.206.17:0/3282647735 --> 192.168.206.13:6804/22934 -- ping magic: 0 v1 -- ?+0 0x7f54ec0022c0 con 0x7f54f40064e0

The last message repeats indefinitely.

Thanks,
Phil

Attachment: smime.p7s
Description: S/MIME cryptographic signature

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux