On Sat, Dec 12, 2015 at 6:37 PM, Tom Christensen <pavera@xxxxxxxxx> wrote: > We had a kernel map get hung up again last night/this morning. The rbd is > mapped but unresponsive, if I try to unmap it I get the following error: > rbd: sysfs write failed > rbd: unmap failed: (16) Device or resource busy > > Now that this has happened attempting to map another RBD fails, using lsblk > fails as well, both of these tasks just hang forever. > > We have 1480 OSDs in the cluster so posting the osdmap seems excessive, > however here is the beginning (didn't change in 5 runs): > root@wrk-slc-01-02:~# cat > /sys/kernel/debug/ceph/f3b7f409-e061-4e39-b4d0-ae380e29ae7e.client55440310/osdmap > epoch 1284256 > flags > pool 0 pg_num 2048 (2047) read_tier -1 write_tier -1 > pool 1 pg_num 512 (511) read_tier -1 write_tier -1 > pool 3 pg_num 2048 (2047) read_tier -1 write_tier -1 > pool 4 pg_num 512 (511) read_tier -1 write_tier -1 > pool 5 pg_num 32768 (32767) read_tier -1 write_tier -1 > > Here is osdc output, it is not changed after 5 runs: > > root@wrk-slc-01-02:~# cat > /sys/kernel/debug/ceph/f3b7f409-e061-4e39-b4d0-ae380e29ae7e.client55440310/osdc > 93835 osd1206 5.6841959c rbd_data.34df3ac703ced61.0000000000001dff > read > 9065810 osd1382 5.a50fa0ea rbd_header.34df3ac703ced61 > 474103'5506530325561344 watch > root@wrk-slc-01-02:~# cat > /sys/kernel/debug/ceph/f3b7f409-e061-4e39-b4d0-ae380e29ae7e.client55440310/osdc > 93835 osd1206 5.6841959c rbd_data.34df3ac703ced61.0000000000001dff > read > 9067286 osd1382 5.a50fa0ea rbd_header.34df3ac703ced61 > 474103'5506530325561344 watch > root@wrk-slc-01-02:~# cat > /sys/kernel/debug/ceph/f3b7f409-e061-4e39-b4d0-ae380e29ae7e.client55440310/osdc > 93835 osd1206 5.6841959c rbd_data.34df3ac703ced61.0000000000001dff > read > 9067831 osd1382 5.a50fa0ea rbd_header.34df3ac703ced61 > 474103'5506530325561344 watch > root@wrk-slc-01-02:~# ls /dev/rbd/rbd > none volume-daac5f12-e39b-4d64-a4fa-86c810aeb72d > volume-daac5f12-e39b-4d64-a4fa-86c810aeb72d-part1 > root@wrk-slc-01-02:~# rbd info volume-daac5f12-e39b-4d64-a4fa-86c810aeb72d > rbd image 'volume-daac5f12-e39b-4d64-a4fa-86c810aeb72d': > size 61439 MB in 7680 objects > order 23 (8192 kB objects) > block_name_prefix: rbd_data.34df3ac703ced61 > format: 2 > features: layering > flags: > parent: > rbd/volume-93d9a102-260e-4500-b87d-9696c7fc2b67@snapshot-9ba998b6-ca57-40dd-8895-265023132e99 > overlap: 61439 MB > > ceph status indicates the current osdmap epoch > osdmap e1284866: 1480 osds: 1480 up, 1480 in > pgmap v10231386: 37888 pgs, 5 pools, 745 TB data, 293 Mobjects > > root@wrk-slc-01-02:~# uname -r > 3.19.0-25-generic > > So, the kernel driver is some 600 epochs behind current. This does seem to > be load related as we've been running 4 different kernels in our clients in > our test environment and have not been able to recreate it there in a little > over a week, however our production environment has had 2 of these hangs in > the last 4 days. Unfortunately I wasn't able to grab data from the first > one. If you haven't already nuked it, what's the output of: $ ceph osd map <pool name of pool with id 5> rbd_data.34df3ac703ced61.0000000000001dff $ ceph osd map <pool name of pool with id 5> rbd_header.34df3ac703ced61 $ ceph daemon osd.1206 ops $ ceph daemon osd.1206 objecter_requests $ ceph daemon osd.1206 dump_ops_in_flight $ ceph daemon osd.1206 dump_historic_ops and repeat for osd.1382. Thanks, Ilya _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com