Re: Fwd: Ext4 bug with fallocate

Eric Sandeen <sandeen@xxxxxxxxxx> · Tue, 20 Oct 2009 21:20:18 -0500

Fredrik Andersson wrote:
Hi, here is the data for this process:

Including all of the processes in D state (everything reported by 
sysrq-w) would probably be most helpful.

Feel free to file an ext4 bug on bugzilla.kernel.org w/ this 
information, too, so it doesn't get lost in busy schedules ...

Thanks,
-Eric

5958816.744013] drdbmake      D ffff88021e4c7800     0 27019  13796
[5958816.744013]  ffff8801d1bcda88 0000000000000082 ffff8801f4ce9bf0
ffff8801678b1380
[5958816.744013]  0000000000010e80 000000000000c748 ffff8800404963c0
ffffffff81526360
[5958816.744013]  ffff880040496730 00000000f4ce9bf0 000000025819cebe
0000000000000282
[5958816.744013] Call Trace:
[5958816.744013]  [<ffffffff813a9639>] schedule+0x9/0x20
[5958816.744013]  [<ffffffff81177ea5>] start_this_handle+0x365/0x5d0
[5958816.744013]  [<ffffffff8105b900>] ? autoremove_wake_function+0x0/
0x40
[5958816.744013]  [<ffffffff811781ce>] jbd2_journal_restart+0xbe/0x150
[5958816.744013]  [<ffffffff8116243d>] ext4_ext_truncate+0x6dd/0xa20
[5958816.744013]  [<ffffffff81095b3b>] ? find_get_pages+0x3b/0xf0
[5958816.744013]  [<ffffffff81150a78>] ext4_truncate+0x198/0x680
[5958816.744013]  [<ffffffff810ac984>] ? unmap_mapping_range+0x74/0x280
[5958816.744013]  [<ffffffff811772c0>] ? jbd2_journal_stop+0x1e0/0x360
[5958816.744013]  [<ffffffff810acd25>] vmtruncate+0xa5/0x110
[5958816.744013]  [<ffffffff810dda10>] inode_setattr+0x30/0x180
[5958816.744013]  [<ffffffff8114d073>] ext4_setattr+0x173/0x310
[5958816.744013]  [<ffffffff810ddc79>] notify_change+0x119/0x330
[5958816.744013]  [<ffffffff810c6df3>] do_truncate+0x63/0x90
[5958816.744013]  [<ffffffff810d0cc3>] ? get_write_access+0x23/0x60
[5958816.744013]  [<ffffffff810c70cb>] sys_truncate+0x17b/0x180
[5958816.744013]  [<ffffffff8100bfab>] system_call_fastpath+0x16/0x1b

Don't know if this has anything to do with it, but  I also noticed
that another process of mine,
which is working just fine, is executing a suspicious looking function
called raid0_unplug.
It operates on the same raid0/ext4 filesystem as the hung process. I
include the calltrace for it here too:

[5958816.744013] nodeserv      D ffff880167bd7ca8     0 17900  13796
[5958816.744013]  ffff880167bd7bf8 0000000000000082 ffff88002800a588
ffff88021e5b56e0
[5958816.744013]  0000000000010e80 000000000000c748 ffff880100664020
ffffffff81526360
[5958816.744013]  ffff880100664390 000000008119bd17 000000026327bfa9
0000000000000002
[5958816.744013] Call Trace:
[5958816.744013]  [<ffffffffa0039291>] ? raid0_unplug+0x51/0x70 [raid0]
[5958816.744013]  [<ffffffff813a9639>] schedule+0x9/0x20
[5958816.744013]  [<ffffffff813a9687>] io_schedule+0x37/0x50
[5958816.744013]  [<ffffffff81095e35>] sync_page+0x35/0x60
[5958816.744013]  [<ffffffff81095e69>] sync_page_killable+0x9/0x50
[5958816.744013]  [<ffffffff813a99d2>] __wait_on_bit_lock+0x52/0xb0
[5958816.744013]  [<ffffffff81095e60>] ? sync_page_killable+0x0/0x50
[5958816.744013]  [<ffffffff81095d74>] __lock_page_killable+0x64/0x70
[5958816.744013]  [<ffffffff8105b940>] ? wake_bit_function+0x0/0x40
[5958816.744013]  [<ffffffff81095c0b>] ? find_get_page+0x1b/0xb0
[5958816.744013]  [<ffffffff81097908>] generic_file_aio_read+0x3b8/0x6b0
[5958816.744013]  [<ffffffff810c7dc1>] do_sync_read+0xf1/0x140
[5958816.744013]  [<ffffffff8106a5e8>] ? do_futex+0xb8/0xb20
[5958816.744013]  [<ffffffff813ab78f>] ? _spin_unlock_irqrestore+0x2f/0x40
[5958816.744013]  [<ffffffff8105b900>] ? autoremove_wake_function+0x0/0x40
[5958816.744013]  [<ffffffff8105bc73>] ? add_wait_queue+0x43/0x60
[5958816.744013]  [<ffffffff81062a6c>] ? getnstimeofday+0x5c/0xf0
[5958816.744013]  [<ffffffff810c85b8>] vfs_read+0xc8/0x170
[5958816.744013]  [<ffffffff810c86fa>] sys_pread64+0x9a/0xa0
[5958816.744013]  [<ffffffff8100bfab>] system_call_fastpath+0x16/0x1b

Hope this makes sense to anyone, and please let me know if there is
more info I can provide.

/Fredrik

On Sun, Oct 18, 2009 at 5:57 PM, Eric Sandeen <sandeen@xxxxxxxxxx> wrote:
Fredrik Andersson wrote:
Hi, I'd like to report what I'm fairly certain is an ext4 bug. I hope
this is the right place to do so.

My program creates a big file (around 30 GB) with posix_fallocate (to
utilize extents), fills it with data and uses ftruncate to crop it to
its final size (usually somewhere between 20 and 25 GB).
The problem is that in around 5% of the cases, the program locks up
completely in a syscall. The process can thus not be killed even with
kill -9, and a reboot is all that will do.
does echo w > /proc/sysrq-trigger (this does sleeping processes; or use echo t for all processes) show you where the stuck threads are?

-Eric

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html