On 12/22/2015 05:09 PM, Steven Rostedt wrote:
OK, I started with 4.4-rc4 to add some urgent ftrace patches and started testing. My tests started to fail, and then I noticed they failed with v4.4-rc4 as well. I got strange errors. Finally, I noticed that I was constantly getting messages like this: ata2.00: exception Emask 0x60 SAct 0x7800000 SErr 0x800 action 0x6 frozen ata2.00: irq_stat 0x20000000, host bus error ata2: SError: { HostInt } ata2.00: failed command: WRITE FPDMA QUEUED ata2.00: cmd 61/00:b8:f3:f2:2e/08:00:0e:00:00/40 tag 23 ncq 1048576 out res 40/00:d4:f3:0a:2f/00:00:0e:00:00/40 Emask 0x60 (host bus error) ata2.00: status: { DRDY } ata2.00: failed command: WRITE FPDMA QUEUED ata2.00: cmd 61/00:c0:f3:fa:2e/08:00:0e:00:00/40 tag 24 ncq 1048576 out res 40/00:d4:f3:0a:2f/00:00:0e:00:00/40 Emask 0x60 (host bus error) ata2.00: status: { DRDY } ata2.00: failed command: WRITE FPDMA QUEUED ata2.00: cmd 61/00:c8:f3:02:2f/08:00:0e:00:00/40 tag 25 ncq 1048576 out res 40/00:d4:f3:0a:2f/00:00:0e:00:00/40 Emask 0x60 (host bus error) ata2.00: status: { DRDY } ata2.00: failed command: WRITE FPDMA QUEUED ata2.00: cmd 61/b8:d0:f3:0a:2f/08:00:0e:00:00/40 tag 26 ncq 1142784 out res 40/00:d4:f3:0a:2f/00:00:0e:00:00/40 Emask 0x60 (host bus error) ata2.00: status: { DRDY } ata2: hard resetting link ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300) ata2.00: configured for UDMA/100 ata2: EH complete The test box has a relatively new mobo and such, but I know the HD was old. So I thought that the HD was simply failing. I installed a new HD and spent lots of time since last Thursday trying to set it up to work with my testing scripts. Unfortunately, I installed a newer Fedora that no longer supported the older grub1 and I wasted lots of time trying to get grub2 to do what I wanted. I finally gave up and used syslinux/extlinux and got it working again. Unfortunately, I still got these ata2 errors! I started thinking that the mobo may be bad. But then I decided to try an older kernel, and the errors never showed up. I booted back and forth several times and the errors were very reliable. I have multiple OSes on this box so every time I got an error, I would boot into one of the other OSes and do fsck on the filesystems. Because the longer I ran my tests with this bug, it would eventually start corrupting the ext4 filesystem. Since it seemed very reliable, I started my bisect. It came down to this patch: From 578270bfbd2803dc7b0b03fbc2ac119efbc73195 Mon Sep 17 00:00:00 2001 From: Ming Lei <ming.lei@xxxxxxxxxxxxx> Date: Tue, 24 Nov 2015 10:35:29 +0800 Subject: [PATCH] block: fix segment split I thought this strange, because I don't see anything wrong with this patch. But if I removed it, the problem went away, and when I added it back, the problem would show up easily. I checkout v4.4-rc6 and tested again, thinking something else may be wrong and has since been fixed. Nope, the error still showed up. I then removed this commit and tried again. Sure enough, the problem went away!
Probably the other way around, I think, it uncovered an issue with the segment counting for certain cases.
My guess is that there's another bug lurking around somewhere, and the bug that this patch fixed hid the problem. Now that this patch fixed a bug that would hide the issue, the issue is showing up. I'll pass this along to the block experts and see what you can think of it. I attached my config, and the test was a script that stress trace-cmd filters. Oh, and I ran this on my i386 kernel and OS. I haven't tried testing much on x86_64 as my tests start with i386. It originally had issues in x86_64 but that may be because the i386 test corrupted the filesystem which is shared. There may be a 32bit vs 64bit issue somewhere?
I'm guessing it's the same issue that was recently diagnosed, which would make sense if you hit this on 32-bit with highmem. Patch is pending, if you feel inclined, it'd be great if you could add this patch and retry:
http://git.kernel.dk/cgit/linux-block/commit/?h=for-linus&id=23688bf4f830a89866fd0ed3501e342a7360fe4f -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html