Re: crash: kernel bug in bch_generic_make_request

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, 2015-07-30 at 18:11 +0200, Jens-U. Mozdzen wrote:
> Zitat von "Jens-U. Mozdzen" <jmozdzen@xxxxxx>:
> > Hi everybody,
> >
> > we experience reproducible server crashes (reboots) when creating  
> > new file systems on bcache'd devices.
> 
> update: If I set cache_mode to "writethrough", I can successfully  
> create the file system.
> 
> On a second system, with identical setup but other "workload" (other  
> file systems) we today experienced problems accessing disk content  
> (via bcache). But as the system partitions (on the same LVM and as  
> such bcache-backed as well) seemed not to be affected, this is a bit  
> more mysterious.
> 
> We're now running both machines in "writethrough" mode, which  
> expectedly leads to a much higher i/o load on the machines, but is  
> hopefully more stable.
> 
> 
> > - kernel is 3.18.8-5-default (it's an OpenSUSE 13.1 install with a  
> > custom kernel)
> >
> > - modinfo bcache
> > filename:        
> > /lib/modules/3.18.8-5-default/kernel/drivers/md/bcache/bcache.ko
> > license:        GPL
> > author:         Kent Overstreet <koverstreet@xxxxxxxxxx>
> > author:         Kent Overstreet <kent.overstreet@xxxxxxxxx>
> > license:        GPL
> > srcversion:     92A8EFA83993DEC9D37BD5E
> > depends:
> > intree:         Y
> > vermagic:       3.18.8-5-default SMP mod_unload modversions
> >
> > - modinfo drbd
> > filename:       /lib/modules/3.18.8-5-default/updates/drbd.ko
> > alias:          block-major-147-*
> > license:        GPL
> > version:        8.4.6
> > description:    drbd - Distributed Replicated Block Device v8.4.6
> > author:         Philipp Reisner <phil@xxxxxxxxxx>, Lars Ellenberg  
> > <lars@xxxxxxxxxx>
> > srcversion:     36DF3A3564C161B02724575
> > depends:        libcrc32c
> > vermagic:       3.18.8-5-default SMP mod_unload modversions
> >
> > - crash details from (remote) syslog:
> >
> > --- cut here ---
> > Jul 23 07:39:47 server kernel: [78293.803010] ------------[ cut here  
> > ]------------
> > Jul 23 07:39:47 server kernel: [78293.803059] kernel BUG at  
> > ../block/bio.c:1836!
> > Jul 23 07:39:47 server kernel: [78293.803097] invalid opcode: 0000 [#1] SMP
> > Jul 23 07:39:47 server kernel: [78293.803134] Modules linked in: md5  
> > binfmt_misc qla2x00tgt(O) scst_vdisk(O) nfsd lockd grace nfs_acl  
> > auth_rpcgss scst(O) sunrpc drbd(O) tcp_diag libcrc32c inet_diag  
> > bonding joydev hid_generic usbhid usb_storage iTCO_wdt  
> > iTCO_vendor_support x86_pkg_temp_thermal intel_powerclamp coretemp  
> > kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel  
> > ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw gf128mul  
> > glue_helper aes_x86_64 pcspkr qla2xxx_scst(O) scsi_transport_fc  
> > sb_edac edac_core lpc_ich i2c_i801 mfd_core xhci_pci xhci_hcd mei_me  
> > mei ioatdma igb dca ptp pps_core ses enclosure tpm_tis shpchp tpm  
> > wmi 8250_fintek ipmi_si ipmi_msghandler acpi_power_meter acpi_pad  
> > button sg linear ast ttm drm_kms_helper ehci_pci drm ehci_hcd  
> > i2c_algo_bit sysimgblt sysfillrect syscopyarea usbcore usb_common  
> > processor scsi_dh_alua scsi_dh_rdac scsi_dh_emc scsi_dh_hp_sw  
> > scsi_dh raid456 async_raid6_recov async_pq raid6_pq async_xor xor  
> > async_memcpy async_tx raid10 raid1 raid0 md_mod dm_snapshot dm_bufio  
> > dm_mirror dm_region_hash dm_log dm_mod mpt3sas scsi_transport_sas  
> > raid_class bcache
> > Jul 23 07:39:47 server kernel: [78293.820577] CPU: 2 PID: 25532  
> > Comm: kworker/u24:0 Tainted: G           O   3.18.8-5-default #1
> > Jul 23 07:39:47 server kernel: [78293.822659] Hardware name:  
> > Supermicro X10DRT-P Series/X10DRT-P, BIOS 1.0b 01/06/2015
> > Jul 23 07:39:47 server kernel: [78293.824771] Workqueue:  
> > drbd122_submit do_submit [drbd]
> > Jul 23 07:39:47 server kernel: [78293.826883] task: ffff880849a5a590  
> > ti: ffff8807d81f4000 task.ti: ffff8807d81f4000
> > Jul 23 07:39:47 server kernel: [78293.828997] RIP:  
> > 0010:[<ffffffff812f9e1e>]  [<ffffffff812f9e1e>] bio_split+0x6e/0x70
> > Jul 23 07:39:47 server kernel: [78293.831131] RSP:  
> > 0018:ffff8807d81f7ae8  EFLAGS: 00010246
> > Jul 23 07:39:47 server kernel: [78293.833225] RAX: 0000000000000000  
> > RBX: ffff88003710a020 RCX: ffff88105785f2c0
> > Jul 23 07:39:47 server kernel: [78293.835318] RDX: 0000000000000010  
> > RSI: 0000000000000000 RDI: ffff88003710a020
> > Jul 23 07:39:47 server kernel: [78293.837403] RBP: 0000000000000000  
> > R08: ffff8810574804c0 R09: ffff8808567f7810
> > Jul 23 07:39:47 server kernel: [78293.839480] R10: ffff88087fc985b8  
> > R11: 0000000000000025 R12: ffff8807cdd24140
> > Jul 23 07:39:47 server kernel: [78293.841544] R13: ffff8807cdd24168  
> > R14: ffff88105785f2c0 R15: ffff880856875e58
> > Jul 23 07:39:47 server kernel: [78293.843593] FS:   
> > 0000000000000000(0000) GS:ffff88087fc80000(0000)  
> > knlGS:0000000000000000
> > Jul 23 07:39:47 server kernel: [78293.845676] CS:  0010 DS: 0000 ES:  
> > 0000 CR0: 0000000080050033
> > Jul 23 07:39:47 server kernel: [78293.847766] CR2: 0000000002382390  
> > CR3: 0000000001e16000 CR4: 00000000001407e0
> > Jul 23 07:39:47 server kernel: [78293.849887] Stack:
> > Jul 23 07:39:47 server kernel: [78293.851999]  ffff88003710a020  
> > ffff88003710a020 ffff8807cdd24140 ffffffffa000d4dd
> > Jul 23 07:39:47 server kernel: [78293.854165]  ffff88003710a020  
> > ffff880856875d10 ffff880856875d10 ffffc90017581000
> > Jul 23 07:39:47 server kernel: [78293.856328]  ffff880856875df0  
> > ffffffffa0012bd3 ffff88003710a020 ffff880856875d10
> > Jul 23 07:39:47 server kernel: [78293.858488] Call Trace:
> > Jul 23 07:39:47 server kernel: [78293.860648]  [<ffffffffa000d4dd>]  
> > bch_generic_make_request+0x14d/0x1e0 [bcache]
> > Jul 23 07:39:47 server kernel: [78293.862842]  [<ffffffffa0012bd3>]  
> > bch_data_insert_start+0xc3/0x550 [bcache]
> > Jul 23 07:39:47 server kernel: [78293.865014]  [<ffffffffa0013dc4>]  
> > cached_dev_make_request+0xc34/0xe60 [bcache]
> > Jul 23 07:39:47 server kernel: [78293.867168]  [<ffffffff812fe210>]  
> > generic_make_request+0xb0/0x100
> > Jul 23 07:39:47 server kernel: [78293.869327]  [<ffffffffa06de3fa>]  
> > drbd_send_and_submit+0x92a/0x1cb0 [drbd]
> > Jul 23 07:39:47 server kernel: [78293.871498]  [<ffffffffa06dff5d>]  
> > send_and_submit_pending+0x7d/0xa0 [drbd]
> > Jul 23 07:39:47 server kernel: [78293.873637]  [<ffffffffa06e032d>]  
> > do_submit+0x3ad/0x490 [drbd]
> > Jul 23 07:39:47 server kernel: [78293.875758]  [<ffffffff810781b8>]  
> > process_one_work+0x148/0x3c0
> > Jul 23 07:39:47 server kernel: [78293.877898]  [<ffffffff81078806>]  
> > worker_thread+0x116/0x470
> > Jul 23 07:39:47 server kernel: [78293.880011]  [<ffffffff8107d57a>]  
> > kthread+0xca/0xe0
> > Jul 23 07:39:47 server kernel: [78293.882097]  [<ffffffff8160fc7c>]  
> > ret_from_fork+0x7c/0xb0
> > Jul 23 07:39:47 server kernel: [78293.884166] Code: 31 f6 89 ea 48  
> > 89 c7 e8 81 08 03 00 8b 73 28 4c 89 e7 e8 86 f1 ff ff 48 89 d8 5b 5d  
> > 41 5c c3 66 0f 1f 44 00 00 31 c0 eb f1 0f 0b <0f> 0b 0f 1f 44 00 00  
> > 41 57 49 89 d2 41 56 41 89 f6 be 01 00 00
> > Jul 23 07:39:47 server kernel: [78293.888557] RIP   
> > [<ffffffff812f9e1e>] bio_split+0x6e/0x70
> > Jul 23 07:39:47 server kernel: [78293.890696]  RSP <ffff8807d81f7ae8>
> > Jul 23 07:39:47 server kernel: [78294.110317] ---[ end trace  
> > ec3665f2de46ff27 ]---
> > Jul 23 07:39:47 server kernel: [78294.142139] BUG: unable to handle  
> > kernel paging request at ffffffffffffffd8
> > Jul 23 07:39:47 server kernel: [78294.142150] IP:  
> > [<ffffffff8107db2c>] kthread_data+0xc/0x20
> > Jul 23 07:39:47 server kernel: [78294.142153] PGD 1e17067 PUD 1e19067 PMD 0
> > Jul 23 07:39:47 server kernel: [78294.142156] Oops: 0000 [#2] SMP
> > Jul 23 07:39:47 server kernel: [78294.142204] Modules linked in: md5  
> > binfmt_misc qla2x00tgt(O) scst_vdisk(O) nfsd lockd grace nfs_acl  
> > auth_rpcgss scst(O) sunrpc drbd(O) tcp_diag libcrc32c inet_diag  
> > bonding joydev hid_generic usbhid usb_storage iTCO_wdt  
> > iTCO_vendor_support x86_pkg_temp_thermal intel_powerclamp coretemp  
> > kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel  
> > ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw gf128mul  
> > glue_helper aes_x86_64 pcspkr qla2xxx_scst(O) scsi_transport_fc  
> > sb_edac edac_core lpc_ich i2c_i801 mfd_core xhci_pci xhci_hcd mei_me  
> > mei ioatdma igb dca ptp pps_core ses enclosure tpm_tis shpchp tpm  
> > wmi 8250_fintek ipmi_si ipmi_msghandler acpi_power_meter acpi_pad  
> > button sg linear ast ttm drm_kms_helper ehci_pci drm ehci_hcd  
> > i2c_algo_bit sysimgblt sysfillrect syscopyarea usbcore usb_common  
> > processor scsi_dh_alua scsi_dh_rdac scsi_dh_emc scsi_dh_hp_sw  
> > scsi_dh raid456 async_raid6_recov async_pq raid6_pq async_xor xor  
> > async_memcpy async_tx raid10 raid1 raid0 md_mod dm_snapshot dm_bufio  
> > dm_mirror dm_region_hash dm_log dm_mod mpt3sas scsi_transport_sas  
> > raid_class bcache
> > Jul 23 07:39:47 server kernel: [78294.142224] CPU: 9 PID: 25532  
> > Comm: kworker/u24:0 Tainted: G      D    O   3.18.8-5-default #1
> > Jul 23 07:39:47 server kernel: [78294.142225] Hardware name:  
> > Supermicro X10DRT-P Series/X10DRT-P, BIOS 1.0b 01/06/2015
> > Jul 23 07:39:47 server kernel: [78294.142241] task: ffff880849a5a590  
> > ti: ffff8807d81f4000 task.ti: ffff8807d81f4000
> > Jul 23 07:39:47 server kernel: [78294.142244] RIP:  
> > 0010:[<ffffffff8107db2c>]  [<ffffffff8107db2c>] kthread_data+0xc/0x20
> > Jul 23 07:39:47 server kernel: [78294.142245] RSP:  
> > 0018:ffff8807d81f7860  EFLAGS: 00010002
> > Jul 23 07:39:47 server kernel: [78294.142246] RAX: 0000000000000000  
> > RBX: 0000000000000009 RCX: 0000000000000000
> > Jul 23 07:39:47 server kernel: [78294.142247] RDX: 0000000000000000  
> > RSI: 0000000000000009 RDI: ffff880849a5a590
> > Jul 23 07:39:47 server kernel: [78294.142248] RBP: ffff8807d81f78d8  
> > R08: 0000000000000000 R09: 0000000000000246
> > Jul 23 07:39:47 server kernel: [78294.142249] R10: ffffffff821ef3b0  
> > R11: 0000000000000000 R12: ffff88107fcd2480
> > Jul 23 07:39:47 server kernel: [78294.142250] R13: 0000000000000009  
> > R14: 0000000000000000 R15: ffff880849a5a590
> > Jul 23 07:39:47 server kernel: [78294.142252] FS:   
> > 0000000000000000(0000) GS:ffff88107fcc0000(0000)  
> > knlGS:0000000000000000
> > Jul 23 07:39:47 server kernel: [78294.142253] CS:  0010 DS: 0000 ES:  
> > 0000 CR0: 0000000080050033
> > Jul 23 07:39:47 server kernel: [78294.142254] CR2: 0000000000000028  
> > CR3: 0000000001e16000 CR4: 00000000001407e0
> > Jul 23 07:39:47 server kernel: [78294.142255] Stack:
> > Jul 23 07:39:47 server kernel: [78294.142258]  ffffffff81078bdd  
> > ffff880849a5a590 ffffffff8160ba69 ffff880849a5a590
> > Jul 23 07:39:47 server kernel: [78294.142259]  ffff8807d81f7fd8  
> > 0000000000012480 0000000000012480 ffff8807d81f7898
> > Jul 23 07:39:47 server kernel: [78294.142261]  ffff880849a5a590  
> > 0000000000000246 ffff880849a5ad08 ffff880849a5a580
> > Jul 23 07:39:47 server kernel: [78294.142262] Call Trace:
> > Jul 23 07:39:47 server kernel: [78294.142277]  [<ffffffff81078bdd>]  
> > wq_worker_sleeping+0xd/0xa0
> > Jul 23 07:39:47 server kernel: [78294.142285]  [<ffffffff8160ba69>]  
> > thread_return+0x28c/0x4c3
> > Jul 23 07:39:47 server kernel: [78294.142296]  [<ffffffff810637ef>]  
> > do_exit+0x6ff/0xa70
> > Jul 23 07:39:47 server kernel: [78294.142304]  [<ffffffff810066c7>]  
> > oops_end+0x97/0xe0
> > Jul 23 07:39:47 server kernel: [78294.142309]  [<ffffffff81003730>]  
> > do_error_trap+0x70/0xe0
> > Jul 23 07:39:47 server kernel: [78294.142316]  [<ffffffff816117be>]  
> > invalid_op+0x1e/0x30
> > Jul 23 07:39:47 server kernel: [78294.142324]  [<ffffffff812f9e1e>]  
> > bio_split+0x6e/0x70
> > Jul 23 07:39:47 server kernel: [78294.142343]  [<ffffffffa000d4dd>]  
> > bch_generic_make_request+0x14d/0x1e0 [bcache]
> > Jul 23 07:39:47 server kernel: [78294.142358]  [<ffffffffa0012bd3>]  
> > bch_data_insert_start+0xc3/0x550 [bcache]
> > Jul 23 07:39:47 server kernel: [78294.142370]  [<ffffffffa0013dc4>]  
> > cached_dev_make_request+0xc34/0xe60 [bcache]
> > Jul 23 07:39:47 server kernel: [78294.142377]  [<ffffffff812fe210>]  
> > generic_make_request+0xb0/0x100
> > Jul 23 07:39:47 server kernel: [78294.142393]  [<ffffffffa06de3fa>]  
> > drbd_send_and_submit+0x92a/0x1cb0 [drbd]
> > Jul 23 07:39:47 server kernel: [78294.142424]  [<ffffffffa06dff5d>]  
> > send_and_submit_pending+0x7d/0xa0 [drbd]
> > Jul 23 07:39:47 server kernel: [78294.142433]  [<ffffffffa06e032d>]  
> > do_submit+0x3ad/0x490 [drbd]
> > Jul 23 07:39:47 server kernel: [78294.142440]  [<ffffffff810781b8>]  
> > process_one_work+0x148/0x3c0
> > Jul 23 07:39:47 server kernel: [78294.142444]  [<ffffffff81078806>]  
> > worker_thread+0x116/0x470
> > Jul 23 07:39:47 server kernel: [78294.142448]  [<ffffffff8107d57a>]  
> > kthread+0xca/0xe0
> > Jul 23 07:39:47 server kernel: [78294.142454]  [<ffffffff8160fc7c>]  
> > ret_from_fork+0x7c/0xb0
> > Jul 23 07:39:47 server kernel: [78294.142475] Code: 00 05 00 00 48  
> > 8b 40 c8 48 c1 e8 02 83 e0 01 c3 66 66 66 66 66 66 2e 0f 1f 84 00 00  
> > 00 00 00 0f 1f 44 00 00 48 8b 87 00 05 00 00 <48> 8b 40 d8 c3 66 66  
> > 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f
> > Jul 23 07:39:47 server kernel: [78294.142478] RIP   
> > [<ffffffff8107db2c>] kthread_data+0xc/0x20
> > Jul 23 07:39:47 server kernel: [78294.142479]  RSP <ffff8807d81f7860>
> > Jul 23 07:39:47 server kernel: [78294.142479] CR2: ffffffffffffffd8
> > Jul 23 07:39:47 server kernel: [78294.142482] ---[ end trace  
> > ec3665f2de46ff28 ]---
> > Jul 23 07:39:47 server kernel: [78294.156953] Fixing recursive fault  
> > but reboot is needed!
> > --- cut here ---
> >
> > - bcache bucket size is 512k, tree_depth is 1
> >
> > - setup details:
> >
> > -- md RAID6 "data" via seven 1 TB physical disks ("blockdev --getsz"  
> > reports 9767605760 512byte blocks, approx 5 TB)
> >
> > -- md RAID1 "cache" via two 128 GB SSDs ("blockdev --getsz" reports  
> > 249936512 512byte blocks, approx 125 GB)
> >
> > -- "bcache0" made from "data" and "cache"
> >
> > -- LVM via single physical volume "bcache0"
> >
> > -- many LVs create with older kernel
> >
> > -- DRBD on top of each LV. This server is always primary, no  
> > "fail-over" to a different node!
> >
> > -- file system (to be) created on the DRBD devices
> >
> > We upgraded the kernel incl. DRBD as we experienced random server  
> > reboots (probably in DRBD, but that was just guessing based on DRBD  
> > change logs), to the versions mentioned above, and now have the  
> > first case where we need to create an additional logical volume plus  
> > file system. Creating the LV works nicely, as does DRBD resync, but  
> > when running the "mkfs" command, the above exceptions occur and the  
> > server reboots.
> >
> > I've seen 5 bcache patches that were reported on the list and  
> > back-ported to OpenSUSE 13.2  
> > (https://bugzilla.suse.com/attachment.cgi?id=617628&action=edit),  
> > but none of these immediately seem to apply to the above situation.
> >
> > While I couldn't find references on the net to the symptoms we  
> > experience, can anybody tell this is an already known problem, and  
> > probably point to a solution? Any other help would of course be  
> > appreciated as well :)
> 
> I'm still looking for advice on how to approach this. I've had a look  
> at the source code and noticed that NULL returns from bio_next_split()  
> were not caught in bch_generic_make_request()... but I cannot tell if  

No. From the crash log:

"kernel BUG at ../block/bio.c:1836!"

It's crashed by BUG_ON(sectors >= bio_sectors(bio)) in bio_split().

But that's weired. bio_next_split() calls bio_split()
and it already checked that:

static inline struct bio *bio_next_split(struct bio *bio, int sectors,
                                         gfp_t gfp, struct bio_set *bs)
{       
        if (sectors >= bio_sectors(bio))
                return bio;

        return bio_split(bio, sectors, gfp, bs);
}

Don't know how the BUG_ON() could happened.

> this is actually the cause of above messages, nor how this should get  
> handled (and simply catching that case did not avoid above reboots).
> 
> Anyone out there willing to share some thoughts?
> 
> Regards,
> Jens

--
To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Linux ARM Kernel]     [Linux Filesystem Development]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [ECOS]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux