rbd kernel module crashes with different kernels

Ugis <ugis22@xxxxxxxxx> · Thu, 20 Dec 2012 01:17:43 +0200

Hi all,

I have been struggling to map ceph rbd images for last week, but
constantly get kernel crashes.

What has been done:
Previously we had v0.48 set up as test cluster(4 hosts, 5 osds, 3
mons, 3 mds, custom crushmap) on Ubuntu 12.04 and client Ubuntu
Precise for mapping rbd+iscsi export, can't remember exact kernel
version when crashes appeared. At some point it was no longer possible
to map rbd images - on command "rbd map..." machine just crashed with
lots of dumped info on screen. Same rbd map commands that worked
before started to crash kernel at some point.

I red some advices on list to use kernels 3.4.20. or 3.6.7. as those
should have all known rbd module bugs fixed. I used one of those(I
believe 3.6.7.) and managed to map rbd images again for couple of
days. Then I discovered slow disk I/O on one host and removed OSD from
it and moved that OSD to other new host(following doc.). For time of
doing this rbd images were mapped. As I was busy moving osd I didn't
notice moment when client crashed again, but I think that was some
time after cluster had already recovered from degraded state after
adding new osd.
After this point I could not map rbd images from client no more - on
command "rbd map..." system just crashed. Reboots after crash did not
help.
I installed fresh Ubuntu Precise+3.6.7. kernel on spare box, crushes
remained, then set up VM with Ubuntu Precise + tried kernels mentioned
below and still got 100% crashes on "rbd map..." command.

Well, those are blurry memories of problem history, but during last
days I tried to solve problem by updating all possible components - it
did not help neither unfortunately.

What I have tried:
I completely removed demo cluster data(dd over osd data partitions,
journal partitions, rm for rest files, purged+upgraded ceph packages
to ceph version 0.55.1(8e25c8d984f9258644389a18997ec6bdef8e056b)) as
update was planned anyway. So ceph is now 0.55.1 on Ubuntu 12.04+xfs
for osds.
Then I compiled kernels 3.4.20, 3.4.24, 3.6.7, 3.7.1 for client and
tried to map rbd image - constant crash with all versions.

Interesting part about map command itself - as I installed new rbd
client box and  VM  I copy/pasted "rbd map.." commands that worked at
very beginning to these machines.
Command was "rbd map fileserver/testimage  -k /etc/ceph/ceph.keyring",
but this command still crashes kernel even now when there is no rbd
"testimage"(I recreated pool "fileserver").
Crash happens on command "rbd map notexistantpool/testimage  -k
/etc/ceph/ceph.keyring" as well. Could that be some issue with
backward compatibility as mapping like this was done on versions ago.

Then I decided to try different mapping syntax. Some intro+results:
# rados lspools
data
metadata
rbd
fileserver

# rbd ls -l
NAME             SIZE PARENT FMT PROT LOCK
testimage1_10G 10000M          1

# rbd ls -l --pool fileserver
rbd: pool fileserver doesn't contain rbd images

well, I do not understand what in doc
(http://ceph.com/docs/master/rbd/rbd-ko/) is meant by "myimage" so I
am ommiting that part, but in no way kernel should crash if wrongly
passed command has been given.
Excerpt from doc:
sudo rbd map foo --pool rbd myimage --id admin --keyring /path/to/keyring
sudo rbd map foo --pool rbd myimage --id admin --keyfile /path/to/file

My commands:
"rbd map testimage1_10G --pool rbd --id admin --keyring
/etc/ceph/ceph.keyring" -> crash
"rbd map testimage1_10G --pool rbd --id admin --keyfile
/tmp/secret"(only key extracted from keyring and writen to
/tmp/secret) -> crash

As crashes happen in client side and are immediate - I have no logs
about it. I can post screenshots from console when crash happens, but
they all are almost the same, containing strings:
"Stack: ...
Call Trace:...
Fixing recursive fault but reboot is needed!"

Also, when VM crashes - virtualization still shows high CPU
load(probably some loop?)
I tried default and custom CRUSH maps, but crashes are the same.

If anyone could advice how to get out of this magic compile
kernel->"rbd map.."->crash cycle - I would be happy :)
Probaby someone can reproduce crashes with similar commands? If I can
send any additional valuable info to track down the problem - please
let me know what is needed.

BR,
Ugis
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html