Re: [BUG] Call trace during snapshot start/stop sequence

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 28 Nov 2013 09:19:23 +1100

On Wed, Nov 27, 2013 at 11:01:43AM +0100, Arkadiusz Bubała wrote:
> Hello,
> 
> we're running test script that starts and stops
> snapshots in a loop while overfilling them.  After a few days of running
> system hangs. We've captured following call trace:
> 
> [116649.755761] XFS (dm-42): metadata I/O error: block 0xfa2b06
> ("xlog_iodone") error 5 buf count 1024
> [116649.947247] XFS (dm-42): Log I/O Error Detected.  Shutting down
> filesystem
> [116650.073881] XFS (dm-42): Please umount the filesystem and rectify
> the problem(s)

So, an EIO error on a log IO, resulting in a shutdown....

> [116650.207186] BUG: unable to handle kernel paging request at
> 00000000000010a8

That's an interesting offset - quite large for a null pointer
dereference.

> [116650.335185] IP: [<ffffffff8102e1d6>] __ticket_spin_lock+0x6/0x20
> [116650.451052] PGD 0
> [116650.518151] Oops: 0002 [#1] SMP
> [116650.599477] CPU 0
> [116650.622838] Modules linked in: iscsi_scst(O) scst_vdisk(O) scst(O)
> drbd(O) twofish_x86_64 twofish_generic twofish_common
> serpent_sse2_x86_64 lrw xts gf1]
> [116651.479730]
> [116651.540674] Pid: 30173, comm: kworker/0:5 Tainted: G           O
> 3.4.63-oe64-00000-g1a33902 #38 Intel Corporation S1200BTL/S1200BTL

Running a custom built 3.4.63 kernel with a bunch of out of tree
modules installed. can you reproduce this on a vanilla 3.12 kernel?

> [116653.923833] Call Trace:
> [116653.995006]  [<ffffffff815f4b45>] ? _raw_spin_lock+0x5/0x10
> [116654.103462]  [<ffffffff812685f2>] ? xlog_state_done_syncing+0x32/0xc0
> [116654.221716]  [<ffffffff81051843>] ? process_one_work+0xf3/0x320
> [116654.333195]  [<ffffffff810534f2>] ? worker_thread+0xe2/0x280
> [116654.441031]  [<ffffffff81053410>] ? gcwq_mayday_timeout+0x80/0x80
> [116654.553512]  [<ffffffff8105776b>] ? kthread+0x9b/0xb0

Which is this line:

STATIC void
xlog_state_done_syncing(
        xlog_in_core_t  *iclog,
        int             aborted)
{
        struct xlog        *log = iclog->ic_log;

        spin_lock(&log->l_icloglock);

So, the icloglock is at offset 296 bytes into the struct xlog, and
the iclog structure is only 256 bytes in size itself, so that
structure offset is way outside anything the code should be trying
to access (ignoring the null pointer issue). Even if we assume that
the 0x1000 bit is a memory corruption, offset 0xa8 lands in a hole
in the struct xlog_in_core, and isin the middle of a bunch of log
size constants in the struct xlog (l_sectBBsize to be exact).

So this doesn't make much sense to me.

BTW, you should compile you kernels with frame pointers enabled so
that the kernel emits stack traces that can be trusted rather than
just dumping a list of symbols found on the stack...

> It looks like a race condition.

Looks more like memory corruption to me....

> Test script source:

I'll see if I can reproduce it locally.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs