On Fri, Oct 28, 2016 at 5:52 PM, Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote:
(switched to email. Please respond via emailed reply-to-all, not via the
bugzilla web interface).
On Mon, 24 Oct 2016 01:27:15 +0000 bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=180101
>
> Bug ID: 180101
> Summary: BUG: unable to handle kernel paging request at x with
> "mm: remove gup_flags FOLL_WRITE games from
> __get_user_pages()"
> Product: Memory Management
> Version: 2.5
> Kernel Version: 4.8.4
> Hardware: x86-64
> OS: Linux
> Tree: Mainline
> Status: NEW
> Severity: high
> Priority: P1
> Component: Other
> Assignee: akpm@xxxxxxxxxxxxxxxxxxxx
> Reporter: joe.yasi@xxxxxxxxx
> Regression: No
>
> After updating to 4.8.3 and 4.8.4, I am having stability issues. I can also
> reproduce them with 4.7.10. This issue does not occur with 4.8.2. I can also
> not reproduce after reverting the security fix
> 89eeba1594ac641a30b91942961e80fae978f839 "mm: remove gup_flags FOLL_WRITE games
> from __get_user_pages()" with 4.8.4.
That's 19be0eaffa3ac7d8eb ("mm: remove gup_flags FOLL_WRITE games from
__get_user_pages()") in the upstream tree.
I seem to recall a fix for that patch went flying past earlier this
week. Perhaps Linus recalls?
19be0eaffa3ac7d8eb has gone into a billion -stable trees so we'll need
to be attentive...
I've been able to reproduce the issue with 19be0eaffa3ac7d8eb ("mm: remove gup_flags FOLL_WRITE games from __get_user_pages()") reverted. I initially suspected it because I hadn't seen the issue until 4.8.3, and also saw it when I tried 4.7.10. Initially, I wasn't able to reproduce it with 4.8.2, but I've since been able to do that. This smells like a race condition somewhere. It's possible I just happened to never encounter that race before.
The /home partition in question is btrfs on bcache in writethrough mode. The cache drive is an 180 GB Intel SATA SSD, and the backing device is two WD 3 TB SATA HDDs configured in MD RAID 10 f2 layout. / is btrfs on an NVMe SSD.
I've also seen btrfs checksum errors in the kernel log when reproducing this. Rebooting and running btrfs scrub finds nothing though so it seems like in memory corruption.
Thanks,
Joe