That looks like a bug that isn't familiar to Josh or I. Can you create a report in the tracker and provide as much debug info as you can come up with? :) On Friday, May 25, 2012 at 3:15 AM, Andrey Korolyov wrote: > Hi, > > Newer kernel rbd driver throws a quite strange messages on map|unmap, > comparing to 3.2 branch: > > rbd map 'path' # device appears as /dev/rbd1 instead of rbd0, then > rbd unmap /dev/rbd1 # causes following trace, w/ vanilla 3.4.0 from kernel.org (http://kernel.org): > > [ 99.700802] BUG: scheduling while atomic: rbd/3846/0x00000002 > [ 99.700857] Modules linked in: btrfs ip6table_filter ip6_tables > iptable_filter ip_tables ebtable_nat ebtables x_tables iscsi_tcp > libiscsi_tcp libiscsi scsi_transport_iscsi fuse nfsd nfs nfs_acl > auth_rpcgss lockd sunrpc kvm_intel kvm bridge stp llc ipv6 rbd libceph > loop 8250_pnp pcspkr firewire_ohci coretemp firewire_core hwmon 8250 > serial_core > [ 99.700899] Pid: 3846, comm: rbd Not tainted 3.4.0 #3 > [ 99.700902] Call Trace: > [ 99.700910] [<ffffffff81464d68>] ? __schedule+0x96/0x625 > [ 99.700916] [<ffffffff8105f98a>] ? __queue_work+0x254/0x27c > [ 99.700921] [<ffffffff81465d39>] ? _raw_spin_lock_irqsave+0x2a/0x32 > [ 99.700926] [<ffffffff81069b6d>] ? complete+0x31/0x40 > [ 99.700931] [<ffffffff8105f10a>] ? flush_workqueue_prep_cwqs+0x16e/0x180 > [ 99.700947] [<ffffffff81463bd8>] ? schedule_timeout+0x21/0x1af > [ 99.700951] [<ffffffff8107165d>] ? enqueue_entity+0x67/0x13d > [ 99.700955] [<ffffffff81464ad9>] ? wait_for_common+0xc5/0x143 > [ 99.700959] [<ffffffff8106d5fc>] ? try_to_wake_up+0x217/0x217 > [ 99.700963] [<ffffffff81063952>] ? kthread_stop+0x30/0x50 > [ 99.700967] [<ffffffff81060979>] ? destroy_workqueue+0x148/0x16b > [ 99.700977] [<ffffffffa004ce07>] ? ceph_osdc_stop+0x1f/0xaa [libceph] > [ 99.700984] [<ffffffffa00463b4>] ? ceph_destroy_client+0x10/0x44 [libceph] > [ 99.700989] [<ffffffffa00652ae>] ? rbd_client_release+0x38/0x4b [rbd] > [ 99.700993] [<ffffffffa0065719>] ? rbd_put_client.isra.10+0x28/0x3d [rbd] > [ 99.700998] [<ffffffffa006609d>] ? rbd_dev_release+0xc3/0x157 [rbd] > [ 99.701003] [<ffffffff81287387>] ? device_release+0x41/0x72 > [ 99.701007] [<ffffffff81202b95>] ? kobject_release+0x4e/0x6a > [ 99.701025] [<ffffffffa0065156>] ? rbd_remove+0x102/0x11e [rbd] > [ 99.701035] [<ffffffff8114b058>] ? sysfs_write_file+0xd3/0x10f > [ 99.701044] [<ffffffff810f8796>] ? vfs_write+0xaa/0x136 > [ 99.701052] [<ffffffff810f8a07>] ? sys_write+0x45/0x6e > [ 99.701062] [<ffffffff8146a839>] ? system_call_fastpath+0x16/0x1b > [ 99.701170] BUG: scheduling while atomic: rbd/3846/0x00000002 > [ 99.701220] Modules linked in: btrfs ip6table_filter ip6_tables > iptable_filter ip_tables ebtable_nat ebtables x_tables iscsi_tcp > libiscsi_tcp libiscsi scsi_transport_iscsi fuse nfsd nfs nfs_acl > auth_rpcgss lockd sunrpc kvm_intel kvm bridge stp llc ipv6 rbd libceph > loop 8250_pnp pcspkr firewire_ohci coretemp firewire_core hwmon 8250 > serial_core > [ 99.701251] Pid: 3846, comm: rbd Not tainted 3.4.0 #3 > [ 99.701253] Call Trace: > [ 99.701257] [<ffffffff81464d68>] ? __schedule+0x96/0x625 > [ 99.701261] [<ffffffff81465ef9>] ? _raw_spin_unlock_irq+0x5/0x2e > [ 99.701265] [<ffffffff81069f92>] ? finish_task_switch+0x4c/0xc1 > [ 99.701268] [<ffffffff8146525b>] ? __schedule+0x589/0x625 > [ 99.701272] [<ffffffff812084b2>] ? ip4_string+0x5a/0xc8 > [ 99.701276] [<ffffffff81208cbd>] ? string.isra.3+0x39/0x9f > [ 99.701281] [<ffffffff81208e33>] ? ip4_addr_string.isra.5+0x5a/0x76 > [ 99.701285] [<ffffffff81208b7a>] ? number.isra.1+0x10e/0x218 > [ 99.701290] [<ffffffff81463bd8>] ? schedule_timeout+0x21/0x1af > [ 99.701294] [<ffffffff81464ad9>] ? wait_for_common+0xc5/0x143 > [ 99.701298] [<ffffffff8106d5fc>] ? try_to_wake_up+0x217/0x217 > [ 99.701303] [<ffffffff8105f24c>] ? flush_workqueue+0x130/0x2a5 > [ 99.701309] [<ffffffffa00463b9>] ? ceph_destroy_client+0x15/0x44 [libceph] > [ 99.701314] [<ffffffffa00652ae>] ? rbd_client_release+0x38/0x4b [rbd] > [ 99.701319] [<ffffffffa0065719>] ? rbd_put_client.isra.10+0x28/0x3d [rbd] > [ 99.701324] [<ffffffffa006609d>] ? rbd_dev_release+0xc3/0x157 [rbd] > [ 99.701328] [<ffffffff81287387>] ? device_release+0x41/0x72 > [ 99.701334] [<ffffffff81202b95>] ? kobject_release+0x4e/0x6a > [ 99.701343] [<ffffffffa0065156>] ? rbd_remove+0x102/0x11e [rbd] > [ 99.701352] [<ffffffff8114b058>] ? sysfs_write_file+0xd3/0x10f > [ 99.701361] [<ffffffff810f8796>] ? vfs_write+0xaa/0x136 > [ 99.701369] [<ffffffff810f8a07>] ? sys_write+0x45/0x6e > [ 99.701377] [<ffffffff8146a839>] ? system_call_fastpath+0x16/0x1b > > > On Wed, May 16, 2012 at 12:24 PM, Andrey Korolyov <andrey@xxxxxxx (mailto:andrey@xxxxxxx)> wrote: > > > This is most likely due to a recently-fixed problem. > > > The fix is found in this commit, although there were > > > other changes that led up to it: > > > 32eec68d2f rbd: don't drop the rbd_id too early > > > It is present starting in Linux kernel 3.3; it appears > > > you are running 2.6? > > > > > > > > Nope, it`s just Debian kernel naming - they continue to name 3.x with > > 2.6 and I`m following them at own build naming. I have tried that on > > 3.2 first time, and just a couple of minutes ago on my notebook with > > 3.3.4 over relatively slow vpn connection - rbd failed with almost > > same backtrace(I have removed sleep from cycle and bug has been > > reproduce immediately after first map-unmap) and kernel has panicked > > after approx. four minutes when I stopped 'for...' execution, > > unfortunately no bt of panic because of X and lack of configured > > netconsole :) The symptoms are the same - 'rbd showmapped' shows > > latest volume, but unmap failed with 'xxx is not a block device remove > > failed: (22) Invalid argument' and a couple of 'null pointer > > dereference' messages in dmesg. I have used /dev/rbd0 instead of > > symlinks to reduce probability of involving udev-related timeout on > > symlinks creation. > > > > On Tue, May 15, 2012 at 7:40 PM, Josh Durgin <josh.durgin@xxxxxxxxxxx (mailto:josh.durgin@xxxxxxxxxxx)> wrote: > > > On 05/15/2012 04:49 AM, Andrey Korolyov wrote: > > > > > > > > Hi, > > > > > > > > There are strange bug when I tried to map excessive amounts of block > > > > devices inside the pool, like following > > > > > > > > for vol in $(rbd ls); do rbd map $vol; [some-microsleep]; [some > > > > operation or nothing, I have stubbed guestfs mount here] ; > > > > [some-microsleep]; unmap /dev/rbd/rbd/$vol ; [some-microsleep]; done, > > > > > > > > udev or rbd seems to be somehow late and mapping fails. There is no > > > > real-world harm at all, and such case can be easily avoided, but on > > > > busy cluster timeout increases and I was able to catch same thing on > > > > two-osd config in recovering state. For 0.1 second on healthy cluster, > > > > all works okay, for 0.05 it may fail with following trace(just for me, > > > > because I am testing on relatively old and crappy hardware, so others > > > > may catch that on smaller intervals): > > > > > > > > > > > > > > > udev is asynchronous by nature. The rbd tool itself doesn't wait for > > > /dev to be populated because you may not be using the default udev rule > > > (or not using udev at all). Our test framework polls for the device to > > > make sure 'rbd map' and udev completed: > > > > > > https://github.com/ceph/teuthology/blob/d6b9bd8b63c8c6c1181ece1f6941829d8d1d5152/teuthology/task/rbd.py#L190 > > > > > > Josh > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx (mailto:majordomo@xxxxxxxxxxxxxxx) > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html