Re: process hangs in ext4_sync_file

Sandeep Joshi <sanjos100@xxxxxxxxx> · Tue, 29 Oct 2013 20:30:19 +0530

On Tue, Oct 29, 2013 at 8:16 PM, Jan Kara <jack@xxxxxxx> wrote:
> On Tue 29-10-13 11:00:25, Sandeep Joshi wrote:
>> On Wed, Oct 23, 2013 at 8:28 PM, Sandeep Joshi <sanjos100@xxxxxxxxx> wrote:
>> > On Wed, Oct 23, 2013 at 3:50 PM, Jan Kara <jack@xxxxxxx> wrote:
>> > > On Mon 21-10-13 18:09:02, Sandeep Joshi wrote:
>> > >> I am seeing a problem reported 4 years earlier
>> > >> https://lkml.org/lkml/2009/3/12/226
>> > >> (same stack as seen by Alexander)
>> > >>
>> > >> The problem is reproducible.  Let me know if you need any info in
>> > >> addition to that seen below.
>> > >>
>> > >> I have multiple threads in a process doing heavy IO on a ext4
>> > >> filesystem mounted with (discard, noatime) on a SSD or HDD.
>> > >>
>> > >> This is on Linux 3.8.0-29-generic #42~precise1-Ubuntu SMP Wed Aug 14
>> > >> 16:19:23 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
>> > >>
>> > >> For upto minutes at a time, one of the threads seems to hang in sync to
>> > disk.
>> > >>
>> > >> When I check the thread stack in /proc, I find that the stack is one
>> > >> of the following two
>> > >>
>> > >> <ffffffff81134a4e>] sleep_on_page+0xe/0x20
>> > >> [<ffffffff81134c88>] wait_on_page_bit+0x78/0x80
>> > >> [<ffffffff81134d9c>] filemap_fdatawait_range+0x10c/0x1a0
>> > >> [<ffffffff811367d8>] filemap_write_and_wait_range+0x68/0x80
>> > >> [<ffffffff81236a4f>] ext4_sync_file+0x6f/0x2b0
>> > >> [<ffffffff811cba9b>] vfs_fsync+0x2b/0x40
>> > >> [<ffffffff81168fb3>] sys_msync+0x143/0x1d0
>> > >> [<ffffffff816fc8dd>] system_call_fastpath+0x1a/0x1f
>> > >> [<ffffffffffffffff>] 0xffffffffffffffff
>> > >>
>> > >>
>> > >> OR
>> > >>
>> > >>
>> > >> [<ffffffff812947f5>] jbd2_log_wait_commit+0xb5/0x130
>> > >> [<ffffffff81297213>] jbd2_complete_transaction+0x53/0x90
>> > >> [<ffffffff81236bcd>] ext4_sync_file+0x1ed/0x2b0
>> > >> [<ffffffff811cba9b>] vfs_fsync+0x2b/0x40
>> > >> [<ffffffff81168fb3>] sys_msync+0x143/0x1d0
>> > >> [<ffffffff816fc8dd>] system_call_fastpath+0x1a/0x1f
>> > >> [<ffffffffffffffff>] 0xffffffffffffffff
>> > >>
>> > >> Any clues?
>> > >   We are waiting for IO to complete. As the first thing, try to remount
>> > > your filesystem without 'discard' mount option. That is often causing
>> > > problems.
>> > >
>> > >                                                                 Honza
>> >
>>
>>
>> Update : I removed the "discard" option as suggested and I dont see
>> processes hanging in ext4_sync_file anymore.   I also tried ext2 and no
>> problems there either.
>>
>> So isn't the "discard' option recommended for SSDs?   Is this a known
>> problem with ext4?
>   No, it isn't really recommended for ordinary SSDs. If you have one of
> those fancy PCIe attached SSDs, 'discard' option might be useful for you
> but for usual SATA attached ones it's usually a disaster. There you might
> be better off running 'fstrim' command once a week or something like that.
>
>                                                                 Honza

Could you briefly point out what problematic code paths come into play
when the "discard" option is enabled?    I want to read the code to
understand the problem better.

--Sandeep

> --
> Jan Kara <jack@xxxxxxx>
> SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html