On 09/26/2013 02:49 PM, Jeff Liu wrote: > From: Jie Liu <jeff.liu@xxxxxxxxxx> > > Write a file with an offset greater than 16TB on 32-bit system and > then trigger page write-back via sync(1) as below will cause the > task hang in a little while: > > INFO: task sync:2590 blocked for more than 120 seconds. > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > sync D c1064a28 0 2590 2097 0x00000000 > ..... > Call Trace: > [<c1064a28>] ? ttwu_do_wakeup+0x18/0x130 > [<c1066d0e>] ? try_to_wake_up+0x1ce/0x220 > [<c1066dbf>] ? wake_up_process+0x1f/0x40 > [<c104fc2e>] ? wake_up_worker+0x1e/0x30 > [<c15b6083>] schedule+0x23/0x60 > [<c15b3c2d>] schedule_timeout+0x18d/0x1f0 > [<c12a143e>] ? do_raw_spin_unlock+0x4e/0x90 > [<c10515f1>] ? __queue_delayed_work+0x91/0x150 > [<c12a12ef>] ? do_raw_spin_lock+0x3f/0x100 > [<c12a143e>] ? do_raw_spin_unlock+0x4e/0x90 > [<c15b5b5d>] wait_for_completion+0x7d/0xc0 > [<c1066d60>] ? try_to_wake_up+0x220/0x220 > [<c116a4d2>] sync_inodes_sb+0x92/0x180 > [<c116fb05>] sync_inodes_one_sb+0x15/0x20 > [<c114a8f8>] iterate_supers+0xb8/0xc0 > [<c116faf0>] ? fdatawrite_one_bdev+0x20/0x20 > [<c116fc21>] sys_sync+0x31/0x80 > [<c15be18d>] sysenter_do_call+0x12/0x28 > > The reason is that the end_index is unsigned long with maximum value > '2^32-1=4294967295' on 32-bit platform, and the given offset cause it > wrapped to 0, so that the following codes will repeat again and again > until the task schedule time out: > > end_index = offset >> PAGE_CACHE_SHIFT; > last_index = (offset - 1) >> PAGE_CACHE_SHIFT; > if (page->index >= end_index) { > unsigned offset_into_page = offset & (PAGE_CACHE_SIZE - 1); > /* > * Just skip the page if it is fully outside i_size, e.g. due > * to a truncate operation that is in progress. > */ > if (page->index >= end_index + 1 || offset_into_page == 0) { > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > unlock_page(page); > return 0; > } > > To check a page is fully outsids i_size or not, we can change the > logic to: > if (page->index > end_index || > (page->index == end_index && offset_into_page == 0)) > > Secondly, there still has another similar issue when calculating the > end offset for mapping the filesystem blocks to the file blocks for > delalloc. With the same tests to above, run unmount(8) will cause > kernel panic if CONFIG_XFS_DEBUG is enabled: > > XFS: Assertion failed: XFS_FORCED_SHUTDOWN(ip->i_mount) || \ > ip->i_delayed_blks == 0, file: fs/xfs/xfs_super.c, line: 964 > > kernel BUG at fs/xfs/xfs_message.c:108! > invalid opcode: 0000 [#1] SMP > task: edddc100 ti: ec6ee000 task.ti: ec6ee000 > EIP: 0060:[<f83d87cb>] EFLAGS: 00010296 CPU: 1 > EIP is at assfail+0x2b/0x30 [xfs] > .............. > Call Trace: > [<f83d9cd4>] xfs_fs_destroy_inode+0x74/0x120 [xfs] > [<c115ddf1>] destroy_inode+0x31/0x50 > [<c115deff>] evict+0xef/0x170 > [<c115dfb2>] dispose_list+0x32/0x40 > [<c115ea3a>] evict_inodes+0xca/0xe0 > [<c1149706>] generic_shutdown_super+0x46/0xd0 > [<c11497b9>] kill_block_super+0x29/0x70 > [<c1149a14>] deactivate_locked_super+0x44/0x70 > [<c114a427>] deactivate_super+0x47/0x60 > [<c1161c3d>] mntput_no_expire+0xcd/0x120 > [<c1162ae8>] SyS_umount+0xa8/0x370 > [<c1162dce>] SyS_oldumount+0x1e/0x20 > [<c15be18d>] sysenter_do_call+0x12/0x28 > > That because the end_offset is evaluated to 0 same to above, hence the > mapping and covertion for dealloc file blocks to file system blocks did > not happened. > > This patch just fixed both issues. > > Reported-by: Michael L. Semon <mlsemon35@xxxxxxxxx> > Signed-off-by: Jie Liu <jeff.liu@xxxxxxxxxx> > --- > v3: Add code comments to reflect this change. > v2: don't reset the s_max_bytes to MAX_LFS_FILESIZE, instead, revise the page offset > check up strategy to avoid the potential overflow. > v1: http://oss.sgi.com/archives/xfs/2013-07/msg00154.html > > fs/xfs/xfs_aops.c | 49 +++++++++++++++++++++++++++++++++++++++++++------ > 1 file changed, 43 insertions(+), 6 deletions(-) > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c > index e51e581..541c837 100644 > --- a/fs/xfs/xfs_aops.c > +++ b/fs/xfs/xfs_aops.c > @@ -960,7 +960,32 @@ xfs_vm_writepage( > offset = i_size_read(inode); > end_index = offset >> PAGE_CACHE_SHIFT; > last_index = (offset - 1) >> PAGE_CACHE_SHIFT; > - if (page->index >= end_index) { > + > + /* > + * The page index is less than the end_index, adjust the end_offset > + * to the highest offset that this page should represent. > + * ----------------------------------------------------- > + * | file mapping | <EOF> | > + * ----------------------------------------------------- > + * | Page ... | Page N-2 | Page N-1 | Page N | | > + * ^--------------------------------^----------|-------- > + * | desired writeback range | see else | > + * ---------------------------------^------------------| > + */ > + if (page->index < end_index) > + end_offset = (xfs_off_t)(page->index + 1) << PAGE_CACHE_SHIFT; > + else { > + /* > + * Check whether the page to write out is beyond or straddles > + * i_size or not. > + * ------------------------------------------------------- > + * | file mapping | <EOF> | > + * ------------------------------------------------------- > + * | Page ... | Page N-2 | Page N-1 | Page N | Beyond | > + * ^--------------------------------^-----------|--------- > + * | | Straddles | > + * ---------------------------------^-----------|--------| > + */ > unsigned offset_into_page = offset & (PAGE_CACHE_SIZE - 1); > > /* > @@ -968,24 +993,36 @@ xfs_vm_writepage( > * truncate operation that is in progress. We must redirty the > * page so that reclaim stops reclaiming it. Otherwise > * xfs_vm_releasepage() is called on it and gets confused. > + * > + * Note that the end_index is unsigned long, it would cause it > + * overflow if the given offset is greater than 16TB on 32-bit > + * system if we do check the page is fully outside i_size or not > + * via "if (page->index >= end_index + 1)" as "end_idex + 1" > + * will be evaluated to 0. Hence this page will be redirtied > + * and be written out again and again which would result in > + * dead loop, the user program that perform this operation will ^^^^^^^^^^^^^ Should I say this is an infinite or endless loop rather than dead loop, although I was intended to comments the same behavior, but someone was kindly reminded me of that the infinite loop is more suitable in English world just now. Thanks, -Jeff > + * hang. Instead, we can verify this situation by checking if > + * the page to write if totally beyond the i_size or if it's > + * offset just equal to the EOF. > */ > - if (page->index >= end_index + 1 || offset_into_page == 0) > + if (page->index > end_index || > + (page->index == end_index && offset_into_page == 0)) > goto redirty; > > /* > * The page straddles i_size. It must be zeroed out on each > * and every writepage invocation because it may be mmapped. > * "A file is mapped in multiples of the page size. For a file > - * that is not a multiple of the page size, the remaining > + * that is not a multiple of the page size, the remaining > * memory is zeroed when mapped, and writes to that region are > * not written out to the file." > */ > zero_user_segment(page, offset_into_page, PAGE_CACHE_SIZE); > + > + /* Adjust the end_offset to the end of file */ > + end_offset = offset; > } > > - end_offset = min_t(unsigned long long, > - (xfs_off_t)(page->index + 1) << PAGE_CACHE_SHIFT, > - offset); > len = 1 << inode->i_blkbits; > > bh = head = page_buffers(page); _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs