Re: [syzbot] WARNING in ext4_dirty_folio

syzbot <syzbot+ecab51a4a5b9f26eeaa1@xxxxxxxxxxxxxxxxxxxxxxxxx> · Sat, 29 Apr 2023 14:47:58 -0700

> #syz set subsystems: mm

Your commands are accepted, but please keep syzkaller-bugs@xxxxxxxxxxxxxxxx mailing list in CC next time. It serves as a history of what happened with each bug report. Thank you.

>
> On Wed, Jun 08, 2022 at 04:36:20AM -0700, syzbot wrote:
>> syzbot has found a reproducer for the following issue on:
>> 
>> HEAD commit:    cf67838c4422 selftests net: fix bpf build error
>> git tree:       net
>> console+strace: https://syzkaller.appspot.com/x/log.txt?x=123c2173f00000
>> kernel config:  https://syzkaller.appspot.com/x/.config?x=fc5a30a131480a80
>> dashboard link: https://syzkaller.appspot.com/bug?extid=ecab51a4a5b9f26eeaa1
>> compiler:       gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2
>> syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=1342d5abf00000
>> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=11ecafebf00000
>
> The root cause of this failure is a fundamental bug / design flaw in
> get_user_pages and related functions, which file system developers
> have been complaining about for literally **years**.  See the recent
> discussion at [1] and going back earlier to 2018[2][3] and 2019[4].
>
> [1] https://lore.kernel.org/all/6b73e692c2929dc4613af711bdf92e2ec1956a66.1682638385.git.lstoakes@xxxxxxxxx/
> [2] https://lwn.net/Articles/753027/
> [3] https://lwn.net/Articles/774411/
> [4] https://lwn.net/Articles/784574/
>
> I'm going to reassign this to the mm subsystem, since there's not much
> we can do on the file system end.  The WARNING is considered a good
> thing because users can see silent data corruption/loss if they use
> process_vm_writev() or RDMA to write to memory backed by a file.  And
> while most users at large hyperscale scientific compute farms probably
> won't be paying attention to the system logs, at least we've done
> something to warn them.
>
> Fortunately data corruption is rare (but when it happens it could
> really screw with your results!), but if they are doing some large
> scale simulation to evaluate the safety of nuclear weapons (for
> example), it would be nice if they got at least some hint.
>
> There is a potential solution discussed at [1], but there is push back
> since it could break users by disallowing the thing that might cause
> data corruption.  Why breaking user applications is bad, turning a
> possible silent data corruption to a very visible, hard failure is
> arguably a good thing....
>
> 						- Ted