I had already nuked the previous hang, but we have another one:osdc output:
70385 osd853 5.fb666328 rbd_data.36f804a163d632a.00000000000370ff read
11024940 osd1072 5.f438406c rbd_id.volume-44c74bb5-14f8-4279-b44f-8e867248531b call
11241684 osd872 5.175f624d rbd_id.volume-3e068bc7-75eb-4504-b109-df851a787f89 call
11689088 osd685 5.1fc9acd5 rbd_header.36f804a163d632a 442390'5605610926112768 watch
#ceph osd map rbd rbd_data.36f804a163d632a.00000000000370ff
osdmap e1309560 pool 'rbd' (5) object 'rbd_data.36f804a163d632a.00000000000370ff' -> pg 5.fb666328 (5.6328) -> up ([853,247,265], p853) acting ([853,247,265], p853)
As to the output of osd.853 ops, objecter_requests, dump_ops_in_flight, dump_historic_ops.. I see a single request in the output of osd.853 ops:
{
"ops": [
{
"description": "osd_op(client.58016244.1:70385 rbd_data.36f804a163d632a.00000000000370ff [read 7274496~4096] 5.fb666328 RETRY=1 retry+read e1309006)",
"initiated_at": "2015-12-17 10:03:35.360401",
"age": 0.000503,
"duration": 0.000233,
"type_data": [
"reached pg",
{
"client": "client.58016244",
"tid": 70385
},
[
{
"time": "2015-12-17 10:03:35.360401",
"event": "initiated"
},
{
"time": "2015-12-17 10:03:35.360635",
"event": "reached_pg"
}
]
]
}
],
"num_ops": 1
}
The other commands either return nothing (ops_in_flight, objecter_requests) or in the case of historic ops, it returns 20 ops (thats what its set to keep), but none of them are this request or reference this object. It seems this read is just retrying forever?
On Sat, Dec 12, 2015 at 12:10 PM, Ilya Dryomov <idryomov@xxxxxxxxx> wrote:If you haven't already nuked it, what's the output of:On Sat, Dec 12, 2015 at 6:37 PM, Tom Christensen <pavera@xxxxxxxxx> wrote:
> We had a kernel map get hung up again last night/this morning. The rbd is
> mapped but unresponsive, if I try to unmap it I get the following error:
> rbd: sysfs write failed
> rbd: unmap failed: (16) Device or resource busy
>
> Now that this has happened attempting to map another RBD fails, using lsblk
> fails as well, both of these tasks just hang forever.
>
> We have 1480 OSDs in the cluster so posting the osdmap seems excessive,
> however here is the beginning (didn't change in 5 runs):
> root@wrk-slc-01-02:~# cat
> /sys/kernel/debug/ceph/f3b7f409-e061-4e39-b4d0-ae380e29ae7e.client55440310/osdmap
> epoch 1284256
> flags
> pool 0 pg_num 2048 (2047) read_tier -1 write_tier -1
> pool 1 pg_num 512 (511) read_tier -1 write_tier -1
> pool 3 pg_num 2048 (2047) read_tier -1 write_tier -1
> pool 4 pg_num 512 (511) read_tier -1 write_tier -1
> pool 5 pg_num 32768 (32767) read_tier -1 write_tier -1
>
> Here is osdc output, it is not changed after 5 runs:
>
> root@wrk-slc-01-02:~# cat
> /sys/kernel/debug/ceph/f3b7f409-e061-4e39-b4d0-ae380e29ae7e.client55440310/osdc
> 93835 osd1206 5.6841959c rbd_data.34df3ac703ced61.0000000000001dff
> read
> 9065810 osd1382 5.a50fa0ea rbd_header.34df3ac703ced61
> 474103'5506530325561344 watch
> root@wrk-slc-01-02:~# cat
> /sys/kernel/debug/ceph/f3b7f409-e061-4e39-b4d0-ae380e29ae7e.client55440310/osdc
> 93835 osd1206 5.6841959c rbd_data.34df3ac703ced61.0000000000001dff
> read
> 9067286 osd1382 5.a50fa0ea rbd_header.34df3ac703ced61
> 474103'5506530325561344 watch
> root@wrk-slc-01-02:~# cat
> /sys/kernel/debug/ceph/f3b7f409-e061-4e39-b4d0-ae380e29ae7e.client55440310/osdc
> 93835 osd1206 5.6841959c rbd_data.34df3ac703ced61.0000000000001dff
> read
> 9067831 osd1382 5.a50fa0ea rbd_header.34df3ac703ced61
> 474103'5506530325561344 watch
> root@wrk-slc-01-02:~# ls /dev/rbd/rbd
> none volume-daac5f12-e39b-4d64-a4fa-86c810aeb72d
> volume-daac5f12-e39b-4d64-a4fa-86c810aeb72d-part1
> root@wrk-slc-01-02:~# rbd info volume-daac5f12-e39b-4d64-a4fa-86c810aeb72d
> rbd image 'volume-daac5f12-e39b-4d64-a4fa-86c810aeb72d':
> size 61439 MB in 7680 objects
> order 23 (8192 kB objects)
> block_name_prefix: rbd_data.34df3ac703ced61
> format: 2
> features: layering
> flags:
> parent:
> rbd/volume-93d9a102-260e-4500-b87d-9696c7fc2b67@snapshot-9ba998b6-ca57-40dd-8895-265023132e99
> overlap: 61439 MB
>
> ceph status indicates the current osdmap epoch
> osdmap e1284866: 1480 osds: 1480 up, 1480 in
> pgmap v10231386: 37888 pgs, 5 pools, 745 TB data, 293 Mobjects
>
> root@wrk-slc-01-02:~# uname -r
> 3.19.0-25-generic
>
> So, the kernel driver is some 600 epochs behind current. This does seem to
> be load related as we've been running 4 different kernels in our clients in
> our test environment and have not been able to recreate it there in a little
> over a week, however our production environment has had 2 of these hangs in
> the last 4 days. Unfortunately I wasn't able to grab data from the first
> one.
$ ceph osd map <pool name of pool with id 5>
rbd_data.34df3ac703ced61.0000000000001dff
$ ceph osd map <pool name of pool with id 5> rbd_header.34df3ac703ced61
$ ceph daemon osd.1206 ops
$ ceph daemon osd.1206 objecter_requests
$ ceph daemon osd.1206 dump_ops_in_flight
$ ceph daemon osd.1206 dump_historic_ops
and repeat for osd.1382.
Thanks,
Ilya
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com