On Wed, Nov 27, 2013 at 11:01:43AM +0100, Arkadiusz Bubała wrote: > Hello, > > we're running test script that starts and stops > snapshots in a loop while overfilling them. After a few days of running > system hangs. We've captured following call trace: > > [116649.755761] XFS (dm-42): metadata I/O error: block 0xfa2b06 > ("xlog_iodone") error 5 buf count 1024 > [116649.947247] XFS (dm-42): Log I/O Error Detected. Shutting down > filesystem > [116650.073881] XFS (dm-42): Please umount the filesystem and rectify > the problem(s) So, an EIO error on a log IO, resulting in a shutdown.... > [116650.207186] BUG: unable to handle kernel paging request at > 00000000000010a8 That's an interesting offset - quite large for a null pointer dereference. > [116650.335185] IP: [<ffffffff8102e1d6>] __ticket_spin_lock+0x6/0x20 > [116650.451052] PGD 0 > [116650.518151] Oops: 0002 [#1] SMP > [116650.599477] CPU 0 > [116650.622838] Modules linked in: iscsi_scst(O) scst_vdisk(O) scst(O) > drbd(O) twofish_x86_64 twofish_generic twofish_common > serpent_sse2_x86_64 lrw xts gf1] > [116651.479730] > [116651.540674] Pid: 30173, comm: kworker/0:5 Tainted: G O > 3.4.63-oe64-00000-g1a33902 #38 Intel Corporation S1200BTL/S1200BTL Running a custom built 3.4.63 kernel with a bunch of out of tree modules installed. can you reproduce this on a vanilla 3.12 kernel? > [116653.923833] Call Trace: > [116653.995006] [<ffffffff815f4b45>] ? _raw_spin_lock+0x5/0x10 > [116654.103462] [<ffffffff812685f2>] ? xlog_state_done_syncing+0x32/0xc0 > [116654.221716] [<ffffffff81051843>] ? process_one_work+0xf3/0x320 > [116654.333195] [<ffffffff810534f2>] ? worker_thread+0xe2/0x280 > [116654.441031] [<ffffffff81053410>] ? gcwq_mayday_timeout+0x80/0x80 > [116654.553512] [<ffffffff8105776b>] ? kthread+0x9b/0xb0 Which is this line: STATIC void xlog_state_done_syncing( xlog_in_core_t *iclog, int aborted) { struct xlog *log = iclog->ic_log; spin_lock(&log->l_icloglock); So, the icloglock is at offset 296 bytes into the struct xlog, and the iclog structure is only 256 bytes in size itself, so that structure offset is way outside anything the code should be trying to access (ignoring the null pointer issue). Even if we assume that the 0x1000 bit is a memory corruption, offset 0xa8 lands in a hole in the struct xlog_in_core, and isin the middle of a bunch of log size constants in the struct xlog (l_sectBBsize to be exact). So this doesn't make much sense to me. BTW, you should compile you kernels with frame pointers enabled so that the kernel emits stack traces that can be trusted rather than just dumping a list of symbols found on the stack... > It looks like a race condition. Looks more like memory corruption to me.... > Test script source: I'll see if I can reproduce it locally. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs