One guest VM on my test cluster has hung for more than 24 hours while running a fio test on an RBD device, but other VMs accessing other images in the same pool are fine. I was able to reproduce the problem by running “rbd info” on the same pool as the stuck VM with some debug tracing on (see log below). How can I narrow this down further or resolve the problem? Here are a few details about the cluster: ceph version 10.2.7 Three monitors and six OSD nodes with three OSDs each Each OSD has one SSD with separate partitions for the journal and data, using XFS Clients are KVM guests using rbd devices with virtio Cluster is healthy: ceph7:~$ sudo ceph status cluster 876a19e2-7f61-4774-a6b3-eaab4004f45f health HEALTH_OK monmap e1: 3 mons at {a=192.168.206.10:6789/0,b=192.168.206.11:6789/0,c=192.168.206.12:6789/0} election epoch 6, quorum 0,1,2 a,b,c osdmap e27: 18 osds: 18 up, 18 in flags sortbitwise,require_jewel_osds pgmap v240894: 576 pgs, 2 pools, 416 GB data, 104 kobjects 1248 GB used, 2606 GB / 3854 GB avail 576 active+clean client io 2548 kB/s rd, 2632 kB/s wr, 493 op/s rd, 1121 op/s wr Log output from “rbd info” on the client node (not in a VM): ceph7:~$ sudo rbd -c debug/ceph.conf info app/image1 2017-04-24 11:30:57.048750 7f55365c5d40 1 -- :/0 messenger.start 2017-04-24 11:30:57.049223 7f55365c5d40 1 -- :/3282647735 --> 192.168.206.11:6789/0 -- auth(proto 0 30 bytes epoch 0) v1 -- ?+0 0x55c254e1ccc0 con 0x55c254e17850 2017-04-24 11:30:57.050077 7f55365bd700 1 -- 192.168.206.17:0/3282647735 learned my addr 192.168.206.17:0/3282647735 2017-04-24 11:30:57.051040 7f551a627700 1 -- 192.168.206.17:0/3282647735 <== mon.1 192.168.206.11:6789/0 1 ==== mon_map magic: 0 v1 ==== 473+0+0 (2270207254 0 0) 0x7f5500000b80 con 0x55c254e17850 2017-04-24 11:30:57.051148 7f551a627700 1 -- 192.168.206.17:0/3282647735 <== mon.1 192.168.206.11:6789/0 2 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 33+0+0 (2714966539 0 0) 0x7f5500001040 con 0x55c254e17850 2017-04-24 11:30:57.051328 7f551a627700 1 -- 192.168.206.17:0/3282647735 --> 192.168.206.11:6789/0 -- auth(proto 2 32 bytes epoch 0) v1 -- ?+0 0x7f5504001860 con 0x55c254e17850 2017-04-24 11:30:57.052239 7f551a627700 1 -- 192.168.206.17:0/3282647735 <== mon.1 192.168.206.11:6789/0 3 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 206+0+0 (3323982069 0 0) 0x7f5500001040 con 0x55c254e17850 2017-04-24 11:30:57.052399 7f551a627700 1 -- 192.168.206.17:0/3282647735 --> 192.168.206.11:6789/0 -- auth(proto 2 165 bytes epoch 0) v1 -- ?+0 0x7f5504003370 con 0x55c254e17850 2017-04-24 11:30:57.053313 7f551a627700 1 -- 192.168.206.17:0/3282647735 <== mon.1 192.168.206.11:6789/0 4 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 393+0+0 (1107778031 0 0) 0x7f55000008c0 con 0x55c254e17850 2017-04-24 11:30:57.053415 7f551a627700 1 -- 192.168.206.17:0/3282647735 --> 192.168.206.11:6789/0 -- mon_subscribe({monmap=0+}) v2 -- ?+0 0x55c254e1d290 con 0x55c254e17850 2017-04-24 11:30:57.053477 7f55365c5d40 1 -- 192.168.206.17:0/3282647735 --> 192.168.206.11:6789/0 -- mon_subscribe({osdmap=0}) v2 -- ?+0 0x55c254e12df0 con 0x55c254e17850 2017-04-24 11:30:57.053851 7f551a627700 1 -- 192.168.206.17:0/3282647735 <== mon.1 192.168.206.11:6789/0 5 ==== mon_map magic: 0 v1 ==== 473+0+0 (2270207254 0 0) 0x7f5500001360 con 0x55c254e17850 2017-04-24 11:30:57.054058 7f551a627700 1 -- 192.168.206.17:0/3282647735 <== mon.1 192.168.206.11:6789/0 6 ==== osd_map(27..27 src has 1..27) v3 ==== 13035+0+0 (2602332718 0 0) 0x7f5500000cc0 con 0x55c254e17850 2017-04-24 11:30:57.054376 7f55365c5d40 5 librbd::AioImageRequestWQ: 0x55c254e21c10 : ictx=0x55c254e20760 2017-04-24 11:30:57.054498 7f55365c5d40 20 librbd::ImageState: 0x55c254e19330 open 2017-04-24 11:30:57.054503 7f55365c5d40 10 librbd::ImageState: 0x55c254e19330 0x55c254e19330 send_open_unlock 2017-04-24 11:30:57.054512 7f55365c5d40 10 librbd::image::OpenRequest: 0x55c254e22590 send_v2_detect_header 2017-04-24 11:30:57.054632 7f55365c5d40 1 -- 192.168.206.17:0/3282647735 --> 192.168.206.13:6802/22690 -- osd_op(client.4375.0:1 1.ba46737 rbd_id.image1 [stat] snapc 0=[] ack+read+known_if_redirected e27) v7 -- ?+0 0x55c254e25d00 con 0x55c254e248d0 2017-04-24 11:30:57.056830 7f5518421700 1 -- 192.168.206.17:0/3282647735 <== osd.10 192.168.206.13:6802/22690 1 ==== osd_op_reply(1 rbd_id.image1 [stat] v0'0 uv7 _ondisk_ = 0) v7 ==== 133+0+16 (2025423138 0 1760854024) 0x7f54f0000b40 con 0x55c254e248d0 2017-04-24 11:30:57.056949 7f5512ffd700 10 librbd::image::OpenRequest: handle_v2_detect_header: r=0 2017-04-24 11:30:57.056965 7f5512ffd700 10 librbd::image::OpenRequest: 0x55c254e22590 send_v2_get_id 2017-04-24 11:30:57.057026 7f5512ffd700 1 -- 192.168.206.17:0/3282647735 --> 192.168.206.13:6802/22690 -- osd_op(client.4375.0:2 1.ba46737 rbd_id.image1 [call rbd.get_id] snapc 0=[] ack+read+known_if_redirected e27) v7 -- ?+0 0x7f54f40021f0 con 0x55c254e248d0 2017-04-24 11:30:57.058023 7f5518421700 1 -- 192.168.206.17:0/3282647735 <== osd.10 192.168.206.13:6802/22690 2 ==== osd_op_reply(2 rbd_id.image1 [call] v0'0 uv7 _ondisk_ = 0) v7 ==== 133+0+16 (1719037427 0 128656606) 0x7f54f0000b40 con 0x55c254e248d0 2017-04-24 11:30:57.058069 7f5512ffd700 10 librbd::image::OpenRequest: handle_v2_get_id: r=0 2017-04-24 11:30:57.058081 7f5512ffd700 10 librbd::image::OpenRequest: 0x55c254e22590 send_v2_get_immutable_metadata 2017-04-24 11:30:57.058233 7f5512ffd700 1 -- 192.168.206.17:0/3282647735 --> 192.168.206.13:6804/22934 -- osd_op(client.4375.0:3 1.af6f1e38 rbd_header.1058238e1f29 [call rbd.get_size,call rbd.get_object_prefix] snapc 0=[] ack+read+known_if_redirected e27) v7 -- ?+0 0x7f54f40077f0 con 0x7f54f40064e0 2017-04-24 11:31:12.053858 7f551ae28700 1 -- 192.168.206.17:0/3282647735 --> 192.168.206.11:6789/0 -- mon_subscribe({osdmap=28}) v2 -- ?+0 0x7f54ec000a50 con 0x55c254e17850 2017-04-24 11:31:12.053890 7f551ae28700 1 -- 192.168.206.17:0/3282647735 --> 192.168.206.13:6804/22934 -- ping magic: 0 v1 -- ?+0 0x7f54ec000fc0 con 0x7f54f40064e0 2017-04-24 11:31:17.054035 7f551ae28700 1 -- 192.168.206.17:0/3282647735 --> 192.168.206.13:6804/22934 -- ping magic: 0 v1 -- ?+0 0x7f54ec000a80 con 0x7f54f40064e0 2017-04-24 11:31:22.054211 7f551ae28700 1 -- 192.168.206.17:0/3282647735 --> 192.168.206.13:6804/22934 -- ping magic: 0 v1 -- ?+0 0x7f54ec0014e0 con 0x7f54f40064e0 2017-04-24 11:31:27.054381 7f551ae28700 1 -- 192.168.206.17:0/3282647735 --> 192.168.206.13:6804/22934 -- ping magic: 0 v1 -- ?+0 0x7f54ec001860 con 0x7f54f40064e0 2017-04-24 11:31:32.054558 7f551ae28700 1 -- 192.168.206.17:0/3282647735 --> 192.168.206.13:6804/22934 -- ping magic: 0 v1 -- ?+0 0x7f54ec001be0 con 0x7f54f40064e0 2017-04-24 11:31:37.054723 7f551ae28700 1 -- 192.168.206.17:0/3282647735 --> 192.168.206.13:6804/22934 -- ping magic: 0 v1 -- ?+0 0x7f54ec001f50 con 0x7f54f40064e0 2017-04-24 11:31:42.054895 7f551ae28700 1 -- 192.168.206.17:0/3282647735 --> 192.168.206.13:6804/22934 -- ping magic: 0 v1 -- ?+0 0x7f54ec0022c0 con 0x7f54f40064e0 The last message repeats indefinitely. Thanks, Phil |
Attachment:
smime.p7s
Description: S/MIME cryptographic signature
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com