On Thu, Jun 02, 2016 at 09:51:27AM +0200, Philipp Hahn wrote: > Hello, > > probably during heavy IO our file server crashed on a BUG_ON in dache.c, > probably triggered by NFS: > > > ------------[ cut here ]------------ > > kernel BUG at /var/build/temp/tmp.BPql4ErveJ/pbuilder/linux-3.16.7-ckt25/fs/dcache.c:2373! > > invalid opcode: 0000 [#1] SMP > > Modules linked in: nfsv4 dns_resolver nfsv3 btrfs raid6_pq xor crc32c_generic ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi > > s_acl auth_rpcgss nfs fscache lockd sunrpc quota_v2 quota_tree lpc_ich mfd_core psmouse i2c_i801 tpm_tis processor pcspkr serio_raw i2c_core tpm rng_core thermal_sys > > rc16 dm_snapshot dm_bufio dm_mirror dm_region_hash dm_log dm_mod sd_mod crc_t10dif crct10dif_common sg sr_mod cdrom ata_generic ehci_pci uhci_hcd ata_piix e1000 ehci_hcd mptspi mptscsih mptbase floppy scsi_transport_spi usbcore libata button usb_common > > CPU: 0 PID: 1455 Comm: nfsd Not tainted 3.16.0-ucs193-amd64 #1 Debian 3.16.7-ckt25-2~bpo70+1~ucs3.3.193.201604181018 > > Hardware name: FUJITSU SIEMENS PRIMERGY RX200S2/D1790/M73IL, BIOS 6.0 Rev. R04A5F5.1790 01/16/2006 > > task: ffff880121cb03d0 ti: ffff8800d0cac000 task.ti: ffff8800d0cac000 > > RIP: 0010:[<ffffffff811d2613>] [<ffffffff811d2613>] __d_rehash+0x53/0x60 > > RSP: 0018:ffff8800d0cafaa0 EFLAGS: 00010282 > > RAX: 0000000000043e72 RBX: ffff8800939d4578 RCX: 000000000000000d > > RDX: ffffc90000005000 RSI: ffffc90000224390 RDI: ffff8800b8d53018 > > RBP: ffff8800b8d53018 R08: 0000000000000000 R09: 0000000000000000 > > R10: 0100000000000b15 R11: 0000000000000000 R12: ffff8800d1388198 > > R13: ffff8800b8d53018 R14: ffff8800b8d53b58 R15: ffff8800d0cafbe0 > > FS: 0000000000000000(0000) GS:ffff880127c00000(0000) knlGS:0000000000000000 > > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > > CR2: ffffffffff600400 CR3: 00000001007d7000 CR4: 00000000000007f0 > > Stack: > > ffffffff811d2a3d ffff8800d0cafb27 ffff8800d1388198 ffff8800b8d53b58 > > ffff8800d1388198 ffff8800b8d53b58 ffff8800b8d53b58 ffff880122829860 > > ffff8800b8d53b58 ffff8800d0cafbe0 ffffffff811c44f9 ffff8800b8d53b58 > > Call Trace: > > [<ffffffff811d2a3d>] ? d_materialise_unique+0x8d/0x3d0 > > [<ffffffff811c44f9>] ? lookup_real+0x19/0x50 > > [<ffffffff811c4b27>] ? __lookup_hash+0x37/0x50 > > [<ffffffff811c92fd>] ? lookup_one_len+0xbd/0x110 > > [<ffffffff81241159>] ? reconnect_path+0x109/0x2d0 > > [<ffffffffa0367e20>] ? nfsd_setuser_and_check_port+0xa0/0xa0 [nfsd] > > [<ffffffff8124144f>] ? exportfs_decode_fh+0x12f/0x300 > > [<ffffffffa036cee5>] ? expkey_match+0x45/0x50 [nfsd] > > [<ffffffffa02df314>] ? cache_check+0x114/0x3b0 [sunrpc] > > [<ffffffffa036da5d>] ? exp_find+0xfd/0x1b0 [nfsd] > > [<ffffffffa036898c>] ? fh_verify+0x2fc/0x5a0 [nfsd] > > [<ffffffffa0378df3>] ? nfsd4_proc_compound+0x4c3/0x790 [nfsd] > > [<ffffffffa0364df4>] ? nfsd_dispatch+0xe4/0x230 [nfsd] > > [<ffffffffa02d3dae>] ? svc_process_common+0x33e/0x6a0 [sunrpc] > > [<ffffffffa0364680>] ? nfsd_destroy+0x70/0x70 [nfsd] > > [<ffffffffa02d44d0>] ? svc_process+0x110/0x160 [sunrpc] > > [<ffffffffa0364747>] ? nfsd+0xc7/0x140 [nfsd] > > [<ffffffff8108e141>] ? kthread+0xc1/0xe0 > > [<ffffffff8108e080>] ? flush_kthread_worker+0x80/0x80 > > [<ffffffff81572598>] ? ret_from_fork+0x58/0x90 > > [<ffffffff8108e080>] ? flush_kthread_worker+0x80/0x80 > > Code: 89 47 08 74 04 48 89 50 08 48 89 77 10 48 83 ca 01 48 89 16 0f ba 36 00 c3 0f 1f 80 00 00 00 00 f3 90 48 8b 06 a8 01 75 f7 eb b9 <0f> 0b 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 8b 47 > > RIP [<ffffffff811d2613>] __d_rehash+0x53/0x60 > > RSP <ffff8800d0cafaa0> > > ---[ end trace 18587f42c106645d ]--- > > There was a report 4 years ago where the same BUG_ON() was hit, but > during some renames: > <http://linux-fsdevel.vger.kernel.narkive.com/GcMw2p94/kernel-bug-at-fs-dcache-c-2105-d-rehash-bug-on-d-unhashed-entry> > > > Our kernel is 3.16.7-ckt25 from Debian + ckt26 and ckt27 on top. So this is the BUG_ON(!d_unhashed(..)) at the top of __d_rehash? I don't know the -ckt kernels, and I don't have a BUG at line 2373 of dcache.c, or something else? There have been a lot of changes in the area since 3.16.7. --b. > I don't know exactly what our server was doing at that time, but we were > setting up a Debian mirror at that time on "btrfs" on an iSCSI storage > connected to that host. > The host also imports 2 NFS shares from other servers and exports 3 NFS > volumes itself. > > > Last week-end the same server got stuck while being backed up: > > > BUG: soft lockup - CPU#1 stuck for 22s! [bacula-fd:16612] > > Modules linked in: nfsv4 dns_resolver nfsv3 btrfs raid6_pq xor crc32c_generic ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi rpcsec_gss_krb5 nfsd nfs_acl auth_rpcgss nfs fscache lockd sunrpc quota_v2 quota_tree lpc_ich mfd_core psmouse tpm_tis tpm i2c_i801 rng_core serio_raw processor i2c_core e752x_edac evdev shpchp edac_core pcspkr thermal_sys ext4 jbd2 crc16 dm_snapshot dm_bufio dm_mirror dm_region_hash dm_log dm_mod sd_mod crc_t10dif crct10dif_common usb_storage sg sr_mod cdrom ata_generic floppy ehci_pci uhci_hcd ata_piix ehci_hcd mptspi mptscsih libata usbcore e1000 mptbase scsi_transport_spi button usb_common > > CPU: 1 PID: 16612 Comm: bacula-fd Not tainted 3.16.0-ucs193-amd64 #1 Debian 3.16.7-ckt25-2~bpo70+1~ucs3.3.193.201604181018 > > Hardware name: FUJITSU SIEMENS PRIMERGY RX200S2/D1790/M73IL, BIOS 6.0 Rev. R04A5F5.1790 01/16/2006 > > task: ffff8800c6bf6ce0 ti: ffff8800732cc000 task.ti: ffff8800732cc000 > > RIP: 0010:[<ffffffff811d36c0>] [<ffffffff811d36c0>] __d_lookup_rcu+0x80/0x150 > > RSP: 0018:ffff8800732cfc58 EFLAGS: 00000282 > > RAX: 0000000000000000 RBX: 000102d000000000 RCX: 000000000000000d > > RDX: ffffc90000005000 RSI: ffff8800732cfe10 RDI: ffff880012c2b918 > > RBP: ffff880012c2b918 R08: ffff880104804050 R09: 0000000000000023 > > R10: ffffffffffffffff R11: 7bfe9e189e400000 R12: ffff8800beb5f080 > > R13: ffff88011b2102e8 R14: ffffffff811a035e R15: 00000000000102d0 > > FS: 00007fa2e701f700(0000) GS:ffff880127c80000(0000) knlGS:0000000000000000 > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > CR2: ffffffffff600400 CR3: 00000000bd005000 CR4: 00000000000007e0 > > Stack: > > 0000000000000006 ffff880104804050 0000000000000000 ffffffff811c6449 > > ffff8800732cfcc0 ffff8800732cfd48 ffff8800732cfd38 0000000000000000 > > ffff880012c2b918 ffff880121f0e060 ffff8800732cfe00 ffffffff811c569e > > Call Trace: > > [<ffffffff811c6449>] ? link_path_walk+0x69/0x850 > > [<ffffffff811c569e>] ? lookup_fast+0x3e/0x2c0 > > [<ffffffff811c6d68>] ? path_lookupat+0x138/0x740 > > [<ffffffff814bda33>] ? tcp_sendmsg+0xd3/0xd20 > > [<ffffffff812138f5>] ? posix_acl_xattr_get+0x45/0xb0 > > [<ffffffff811c73a4>] ? filename_lookup+0x34/0xd0 > > [<ffffffff811c5cf1>] ? getname_flags.part.21+0x91/0x140 > > [<ffffffff811cb289>] ? user_path_at_empty+0x99/0x120 > > [<ffffffff811f7db4>] ? fsnotify+0x234/0x300 > > [<ffffffff811ba4cf>] ? do_sync_write+0x5f/0x90 > > [<ffffffff811bfc00>] ? vfs_fstatat+0x40/0x90 > > [<ffffffff811bfcc2>] ? SYSC_newlstat+0x12/0x30 > > [<ffffffff811bb33c>] ? vfs_write+0x16c/0x200 > > [<ffffffff811bb80c>] ? SyS_write+0x7c/0xb0 > > [<ffffffff8157264d>] ? system_call_fast_compare_end+0x10/0x15 > > Code: 00 00 00 4d 89 e9 49 c7 c2 ff ff ff ff 49 c1 e9 20 eb 14 0f 1f 84 00 00 00 00 00 48 8b 1b 48 85 db 0f 84 84 00 00 00 4c 8d 63 f8 <8b> 43 fc 48 39 6b 10 75 e7 48 83 7b 08 00 74 e0 89 c2 83 e2 fe > > The trace-back triggered by arch_trigger_all_cpu_backtrace() shows the > following two functions in addition on top of link_path_walk(): > > > [<ffffffff811d36c7>] ? __d_lookup_rcu+0x87/0x150 > > [<ffffffff811c5f3b>] ? __inode_permission+0x7b/0xd0 > > [<ffffffff811c6449>] ? link_path_walk+0x69/0x850 > ... > > Has someone seen similar problems with 3.16? > > If you need more info, just ask. > > Please cc: on replies as I'm not subscribed to linux-fsdevel (but LKML) > > Philipp > -- > Philipp Hahn > Open Source Software Engineer > > Univention GmbH > be open. > Mary-Somerville-Str. 1 > D-28359 Bremen > Tel.: +49 421 22232-0 > Fax : +49 421 22232-99 > hahn@xxxxxxxxxxxxx > > http://www.univention.de/ > Geschäftsführer: Peter H. Ganten > HRB 20755 Amtsgericht Bremen > Steuer-Nr.: 71-597-02876 > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html