Re: XFS kernel bug 3.16.7 assertion failed xfs_log_cil.c

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, May 16, 2016 at 03:04:28PM +0000, Rich, Jason wrote:
> Gents,
> 
> I've been seeing some very strange lockups on XFS recently in our deployment and I'm looking for some help directing me to either a workaround or solution.  
> When I say 'lockup':  the process that is trying to do I/O to the XFS volume appears to be hung.  I have not had a chance to play with a system in this state yet, as the person who identified the issue just rebooted the machine instead of attempting to diagnose further.  So I'm not sure of the actual process state when this occurs.
> 

Are you running with XFS_DEBUG enabled in the kernel config? If so, this
mode converts any assertion failure to an explicit BUG(), which is
effectively a kernel crash. You probably don't want to run a production
system with XFS_DEBUG enabled unless you're actually trying to debug
something.

> A little background on the environment first:
> - We deploy a product running kernel 3.16.7 (debian 8's kernel), xfsprogs 4.3.0, and large storage arrays (40Tb x 14 raid volumes). 
> - We have large data (PB) and high throughput (2+ GB/sec) using async IO to several large volumes. 
> - I think it is important to note that these servers (Dell R720s & R730s) are actually net-booting and tend to be ungracefully power cycled at any time which we try to mitigate at all costs.  That said, the systems seem to run fine for days before hitting this issue.
> - The below error occurred on two different machines about 1 day apart.  I only copy/pasted one of them here.  The system was recovered via reboot so I don't have all the debugging you might hope for.
> - The volumes are configured in a raid 6 using NLSAS drives with 128k chunk size.
> 
> The error:
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.788188] XFS: Assertion failed: !list_empty(&cil->xc_cil), file: fs/xfs/xfs_log_cil.c, line: 666
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.799247] ------------[ cut here ]------------

You're on an older kernel and I don't recall how far back the original
issue goes, but this looks familiar to 4703da7b7877 ("xfs: close xc_cil
list_empty() races with cil commit sequence") from v4.3. Can you try
with that patch applied (it's basically a one-liner)?

Brian

> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.804862] kernel BUG at fs/xfs/xfs_message.c:107!
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.810773] invalid opcode: 0000 [#1] PREEMPT SMP
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.816603] Modules linked in: mst_pciconf(O) mst_pci(O) tun ib_uverbs(O) fuse xfs libcrc32c mptctl mptbase ipmi_poweroff mlx4_ib(O) mlx4_en(O) dm_round_robin ib_sa(O) ib_mad(O) ib_core(O) ib_addr(O) mlx4_core(O) mlx_compat(O) mpt2sas(O) ahci megaraid_sas libahci scsi_transport_sas bridge stp llc bonding igb(O) ptp pps_core usb_storage dm_multipath dm_mod edd ipmi_devintf ipmi_si ipmi_msghandler
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.858837] CPU: 6 PID: 59124 Comm: pio_writer_12 Tainted: G           O  3.16.7.TEK.V7.16.1.0.1b11 #1
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.870142] Hardware name: NETSCOUT SSU2000/04N3DF, BIOS 1.3.6 06/03/2015
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.878156] task: ffff882dd397e000 ti: ffff882db3e50000 task.ti: ffff882db3e50000
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.887471] RIP: 0010:[<ffffffffa0497f56>]  [<ffffffffa0497f56>] assfail+0x22/0x28 [xfs]
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.897461] RSP: 0018:ffff882db3e53b38  EFLAGS: 00010292
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.903806] RAX: 0000000000000057 RBX: ffff883e1c7c9e00 RCX: 0000000000000000
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.912227] RDX: ffff882db3e53978 RSI: ffff88407eece1c8 RDI: 0000000000000246
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.920659] RBP: ffff882db3e53b38 R08: 0000000000000000 R09: 0000000000000cc7
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.929032] R10: 0000000000000000 R11: 0000000000000005 R12: ffff883ef69bd800
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.937439] R13: ffff883f768bea40 R14: ffff883f768bea00 R15: ffff882db3e53ba0
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.945858] FS:  00007fffdec51700(0000) GS:ffff88407eec0000(0000) knlGS:0000000000000000
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.955775] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.962609] CR2: 000000001d593000 CR3: 0000003f7cac3000 CR4: 00000000001407e0
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.971003] Stack:
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.973679]  ffff882db3e53b88 ffffffffa04ea310 00000000b3e53b88 ffff883f768bea00
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.982930]  ffff883d3556b920 ffff883e1c7c9e00 0000000000000041 ffff8840608be000
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.992164]  0000000000000000 0000000000000000 ffff882db3e53bd8 ffffffffa049ef7b
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.001348] Call Trace:
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.004576]  [<ffffffffa04ea310>] xfs_log_commit_cil+0xcd/0x145 [xfs]
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.012232]  [<ffffffffa049ef7b>] xfs_trans_commit+0xc3/0x1fd [xfs]
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.019669]  [<ffffffffa0496bba>] xfs_vn_update_time+0x16f/0x181 [xfs]
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.027427]  [<ffffffff8114711c>] update_time+0x25/0xac
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.033763]  [<ffffffffa047e19a>] ? ftrace_raw_event_xfs_lock_class+0x64/0x73 [xfs]
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.043270]  [<ffffffff81147230>] file_update_time+0x8d/0xb0
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.050011]  [<ffffffffa048c441>] ? xfs_rw_ilock+0x2c/0x33 [xfs]
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.057178]  [<ffffffffa048c940>] xfs_file_aio_write_checks+0xae/0xcf [xfs]
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.065414]  [<ffffffffa048ca9d>] xfs_file_dio_aio_write+0x116/0x270 [xfs]
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.073593]  [<ffffffffa048dae7>] ? xfs_file_buffered_aio_write+0x176/0x176 [xfs]
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.082804]  [<ffffffffa048db61>] xfs_file_write_iter+0x7a/0x105 [xfs]
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.090579]  [<ffffffff8116ac40>] aio_run_iocb+0x1dc/0x2dd
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.097151]  [<ffffffff8116af02>] io_submit_one+0x1c1/0x201
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.103812]  [<ffffffff8116b02c>] do_io_submit+0xea/0x13f
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.110326]  [<ffffffff8116b091>] SyS_io_submit+0x10/0x12
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.116775]  [<ffffffff815de6e9>] system_call_fastpath+0x16/0x1b
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.123899] Code: 45 c8 e8 4a fc ff ff c9 c3 55 48 89 e5 0f 1f 44 00 00 31 c0 48 89 f1 41 89 d0 48 c7 c6 89 84 50 a0 48 89 fa 31 ff e8 89 ff ff ff <0f> 0b eb fe 90 90 55 48 89 e5 53 0f 1f 44 00 00 48 8b 8f c0 02
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.147242] RIP  [<ffffffffa0497f56>] assfail+0x22/0x28 [xfs]
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.154142]  RSP <ffff882db3e53b38>
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.161239] ---[ end trace dbd3bab9608b5697 ]---
> 
> 2016 May 14 17:31:17 g512-ssu-2 kernel: [85687.891368] XFS: Assertion failed: !list_empty(&cil->xc_cil), file: fs/xfs/xfs_log_cil.c, line: 666   <--- appropriate line number
> 2016 May 14 17:31:17 g512-ssu-2 kernel: [85687.902402] ------------[ cut here ]------------
> 
> HOST INFO:
> Proc info:
> processor	: 47
> vendor_id	: GenuineIntel
> cpu family	: 6
> model		: 63
> model name	: Intel(R) Xeon(R) CPU E5-2658 v3 @ 2.20GHz
> 
> Meminfo:
> root@g512-ssu2000-2: ~ # cat /proc/meminfo
> MemTotal:       264516692 kB
> MemFree:        181040604 kB
> MemAvailable:   200066392 kB
> Buffers:              32 kB
> Cached:          3722024 kB
> SwapCached:            0 kB
> Active:         31687480 kB
> Inactive:        3307212 kB
> Active(anon):   31300740 kB
> Inactive(anon):  2534776 kB
> Active(file):     386740 kB
> Inactive(file):   772436 kB
> Unevictable:        5188 kB
> Mlocked:            5188 kB
> SwapTotal:             0 kB
> SwapFree:              0 kB
> Dirty:                 0 kB
> Writeback:             0 kB
> AnonPages:      31303028 kB
> Mapped:          2647080 kB
> Shmem:           2535460 kB
> Slab:           19452480 kB
> SReclaimable:   18120028 kB
> SUnreclaim:      1332452 kB
> KernelStack:       20704 kB
> PageTables:        83504 kB
> NFS_Unstable:          0 kB
> Bounce:                0 kB
> WritebackTmp:          0 kB
> CommitLimit:    125966888 kB
> Committed_AS:   42757972 kB
> VmallocTotal:   34359738367 kB
> VmallocUsed:     1263588 kB
> VmallocChunk:   34358261760 kB
> AnonHugePages:  28125184 kB
> HugePages_Total:    6144
> HugePages_Free:        0
> HugePages_Rsvd:        0
> HugePages_Surp:        0
> Hugepagesize:       2048 kB
> DirectMap4k:       12080 kB
> DirectMap2M:     1986560 kB
> DirectMap1G:    268435456 kB
> 
> root@g512-ssu2000-2: ~ # cat /proc/mounts
> rootfs / rootfs rw,size=132244472k,nr_inodes=33061118 0 0
> mbc:/iris/opt/TEKplat/V7.16.1.0.1b11/x86_64_root / nfs ro,noatime,vers=3,rsize=4096,wsize=4096,namlen=255,acregmin=300,acregmax=300,acdirmin=300,acdirmax=300,hard,nolock,proto=udp,port=200
> none /mnt tmpfs rw,relatime 0 0
> dev /dev tmpfs rw,relatime,size=10240k 0 0
> none /tmp tmpfs rw,relatime 0 0
> proc /proc proc rw,relatime 0 0
> sys /sys sysfs rw,relatime 0 0
> none /local-blade tmpfs rw,relatime 0 0
> mbc:/iris /iris nfs ro,noatime,vers=3,rsize=4096,wsize=4096,namlen=255,acregmin=300,acregmax=300,acdirmin=300,acdirmax=300,hard,nolock,proto=udp,port=2049,timeo=7,retrans=3,sec=sys,local_0
> none /mnt/initramfs tmpfs rw,relatime 0 0
> tmpfs /lib/init/rw tmpfs rw,nosuid,relatime,mode=755 0 0
> tmpfs /dev/shm tmpfs rw,nosuid,nodev,relatime 0 0
> debugfs /sys/kernel/debug debugfs rw,relatime 0 0
> devpts /dev/pts devpts rw,relatime,gid=5,mode=620,ptmxmode=000 0 0
> tek.iris.mbc:/iris/scratch /local-blade/scratch nfs rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,soft,nolock,proto=tcp,timeo=5,retrans=2,sec=sys,mountaddr=192.168.0.1,mountve0
> /dev/mapper/tek_ssu_sdb-SCRATCH /mnt/scratch xfs rw,relatime,attr2,inode64,noquota 0 0
> none /proc/sys/fs/binfmt_misc binfmt_misc rw,relatime 0 0
> nodev /local-blade/hugepages hugetlbfs rw,relatime,mode=1770,gid=1000,size=12G 0 0
> /dev/mapper/3600c0ff0001deaddd676a45501000000 /mnt/storage07 xfs rw,noatime,nodiratime,attr2,nobarrier,inode64,logbsize=128k,sunit=256,swidth=3072,noquota 0 0
> /dev/mapper/3600c0ff0001deaddc378a45501000000 /mnt/storage11 xfs rw,noatime,nodiratime,attr2,nobarrier,inode64,logbsize=128k,sunit=256,swidth=3072,noquota 0 0
> /dev/mapper/3600c0ff0001dcf727774a45501000000 /mnt/storage02 xfs rw,noatime,nodiratime,attr2,nobarrier,inode64,logbsize=128k,sunit=256,swidth=3072,noquota 0 0
> /dev/mapper/3600c0ff0001deadde175a45501000000 /mnt/storage05 xfs rw,noatime,nodiratime,attr2,nobarrier,inode64,logbsize=128k,sunit=256,swidth=3072,noquota 0 0
> /dev/mapper/3600c0ff0001dcf723b7aa45501000000 /mnt/storage14 xfs rw,noatime,nodiratime,attr2,nobarrier,inode64,logbsize=128k,sunit=256,swidth=3072,noquota 0 0
> /dev/mapper/3600c0ff0001dcf724878a45501000000 /mnt/storage10 xfs rw,noatime,nodiratime,attr2,nobarrier,inode64,logbsize=128k,sunit=256,swidth=3072,noquota 0 0
> /dev/mapper/3600c0ff0001deaddfe73a45501000000 /mnt/storage01 xfs rw,noatime,nodiratime,attr2,nobarrier,inode64,logbsize=128k,sunit=256,swidth=3072,noquota 0 0
> /dev/mapper/3600c0ff0001dcf724279a45501000000 /mnt/storage12 xfs rw,noatime,nodiratime,attr2,nobarrier,inode64,logbsize=128k,sunit=256,swidth=3072,noquota 0 0
> /dev/mapper/3600c0ff0001deaddbd79a45501000000 /mnt/storage13 xfs rw,noatime,nodiratime,attr2,nobarrier,inode64,logbsize=128k,sunit=256,swidth=3072,noquota 0 0
> /dev/mapper/3600c0ff0001dcf726875a45501000000 /mnt/storage04 xfs rw,noatime,nodiratime,attr2,nobarrier,inode64,logbsize=128k,sunit=256,swidth=3072,noquota 0 0
> /dev/mapper/3600c0ff0001deaddcb77a45501000000 /mnt/storage09 xfs rw,noatime,nodiratime,attr2,nobarrier,inode64,logbsize=128k,sunit=256,swidth=3072,noquota 0 0
> /dev/mapper/3600c0ff0001deadd9a73a45501000000 /mnt/index01 xfs rw,noatime,nodiratime,attr2,nobarrier,inode64,logbsize=128k,sunit=256,swidth=1280,noquota 0 0
> /dev/mapper/3600c0ff0001dcf72cd73a45501000000 /mnt/index02 xfs rw,noatime,nodiratime,attr2,nobarrier,inode64,logbsize=128k,sunit=256,swidth=1280,noquota 0 0
> /dev/mapper/3600c0ff0001dcf725c76a45501000000 /mnt/storage06 xfs rw,noatime,nodiratime,attr2,nobarrier,inode64,logbsize=128k,sunit=256,swidth=3072,noquota 0 0
> /dev/mapper/3600c0ff0001dcf724f77a45501000000 /mnt/storage08 xfs rw,noatime,nodiratime,attr2,nobarrier,inode64,logbsize=128k,sunit=256,swidth=3072,noquota 0 0
> /dev/mapper/3600c0ff0001deaddee74a45501000000 /mnt/storage03 xfs rw,noatime,nodiratime,attr2,nobarrier,inode64,logbsize=128k,sunit=256,swidth=3072,noquota 0 0
> 
> 
> I am not sure how to effectively reproduce the issue (and to be frank, am far from being knowledgable about the innerds of XFS) and would like to understand the essence of the problem here.  While that rarely leads to root cause, knowing what the failure is trying to tell me would be nice.
> 
> Jason Rich  |  Senior Platform Engineer
> NETSCOUT
> O: 469.330.3678 
> 
> _______________________________________________
> xfs mailing list
> xfs@xxxxxxxxxxx
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs




[Index of Archives]     [Linux XFS Devel]     [Linux Filesystem Development]     [Filesystem Testing]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux