On Thu, Nov 28, 2013 at 09:19:23AM +1100, Dave Chinner wrote: > On Wed, Nov 27, 2013 at 11:01:43AM +0100, Arkadiusz Bubała wrote: > > Hello, > > > > we're running test script that starts and stops > > snapshots in a loop while overfilling them. After a few days of running > > system hangs. We've captured following call trace: > > > > [116649.755761] XFS (dm-42): metadata I/O error: block 0xfa2b06 > > ("xlog_iodone") error 5 buf count 1024 > > [116649.947247] XFS (dm-42): Log I/O Error Detected. Shutting down > > filesystem > > [116650.073881] XFS (dm-42): Please umount the filesystem and rectify > > the problem(s) > > So, an EIO error on a log IO, resulting in a shutdown.... > > > [116650.207186] BUG: unable to handle kernel paging request at > > 00000000000010a8 > > That's an interesting offset - quite large for a null pointer > dereference. > > > [116650.335185] IP: [<ffffffff8102e1d6>] __ticket_spin_lock+0x6/0x20 > > [116650.451052] PGD 0 > > [116650.518151] Oops: 0002 [#1] SMP > > [116650.599477] CPU 0 > > [116650.622838] Modules linked in: iscsi_scst(O) scst_vdisk(O) scst(O) > > drbd(O) twofish_x86_64 twofish_generic twofish_common > > serpent_sse2_x86_64 lrw xts gf1] > > [116651.479730] > > [116651.540674] Pid: 30173, comm: kworker/0:5 Tainted: G O > > 3.4.63-oe64-00000-g1a33902 #38 Intel Corporation S1200BTL/S1200BTL > > Running a custom built 3.4.63 kernel with a bunch of out of tree > modules installed. can you reproduce this on a vanilla 3.12 kernel? > > > [116653.923833] Call Trace: > > [116653.995006] [<ffffffff815f4b45>] ? _raw_spin_lock+0x5/0x10 > > [116654.103462] [<ffffffff812685f2>] ? xlog_state_done_syncing+0x32/0xc0 > > [116654.221716] [<ffffffff81051843>] ? process_one_work+0xf3/0x320 > > [116654.333195] [<ffffffff810534f2>] ? worker_thread+0xe2/0x280 > > [116654.441031] [<ffffffff81053410>] ? gcwq_mayday_timeout+0x80/0x80 > > [116654.553512] [<ffffffff8105776b>] ? kthread+0x9b/0xb0 > > Which is this line: > > STATIC void > xlog_state_done_syncing( > xlog_in_core_t *iclog, > int aborted) > { > struct xlog *log = iclog->ic_log; > > spin_lock(&log->l_icloglock); > > So, the icloglock is at offset 296 bytes into the struct xlog, and > the iclog structure is only 256 bytes in size itself, so that > structure offset is way outside anything the code should be trying > to access (ignoring the null pointer issue). Even if we assume that > the 0x1000 bit is a memory corruption, offset 0xa8 lands in a hole > in the struct xlog_in_core, and isin the middle of a bunch of log > size constants in the struct xlog (l_sectBBsize to be exact). > > So this doesn't make much sense to me. > > BTW, you should compile you kernels with frame pointers enabled so > that the kernel emits stack traces that can be trusted rather than > just dumping a list of symbols found on the stack... > > > It looks like a race condition. > > Looks more like memory corruption to me.... > > > Test script source: > > I'll see if I can reproduce it locally. The script is full of bugs, and i don't have time to debug it - it hard codes /dev/sda in places despite taking the device as a CLI parameter. It has hard coded mount points. It sometimes fails to make the filesystem on the base LV after it's been created. start_snap() appears to fail for some reason, as it doesn't result in mounted snapshots. stop_snap fails as well: Starting snap19 : Thursday 28 November 10:01:26 EST 2013 Logical volume lv1+snap19 converted to snapshot. [ FAIL ] Can't umount snapshot [ FAIL ] Can't remove snapshot [ FAIL ] Can't umount snapshot [ FAIL ] Can't remove snapshot [ FAIL ] Can't umount snapshot [ FAIL ] Can't remove snapshot [ FAIL ] Can't umount snapshot [ FAIL ] Can't remove snapshot [ FAIL ] Can't umount snapshot [ FAIL ] Can't remove snapshot [ FAIL ] Can't umount snapshot [ FAIL ] Can't remove snapshot [ FAIL ] Can't umount snapshot [ FAIL ] Can't remove snapshot [ OK ] lv1+snap19 activated. Starting time : 37 s. --------------------------- Stopping snap0 : Thursday 28 November 10:02:06 EST 2013 [ FAIL ] Can't umount snapshot [ FAIL ] Can't remove snapshot [ FAIL ] lv0+snap00 still active !!! [ OK ] lv0+snap00 umounted. Stopping time : 0 s. I've got no idea is this is intended behaviour, but it sure doesn't seem right to me... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs