On Thu, Feb 16, 2017 at 01:54:08AM +0800, Patrick Dung wrote: > Hello, > > I had a RHEL 7.3 vmware workstaion VM (host is Linux running Fedora). > The VM used LSI SAS adapter. I had previously used it for a few weeks > without problem. > > When I reboot it in yesterday, there is file system corruption (check > the end of the log): > > Feb 15 04:37:51 server02 kernel: EXT4-fs error (device dm-0): > mb_free_blocks:1448: group 38, block 1249221:freeing already freed > block (bit 4037); block bitmap corrupt. > Feb 15 04:37:51 server02 kernel: EXT4-fs error (device dm-0): > ext4_mb_generate_buddy:757: group 38, block bitmap and bg descriptor > inconsistent: 14676 vs 14677 free clusters So one of the things to understand is that this is just a symptom. But there are many different causes which can cause that particular symptom. It can be caused by hardware problems; it can caused by kernel bugs. For example, there was one bug[1] which had the exact symptom, but it was *only* showing up on people using Debian 3.16 kernels on Guest VM's. The people who were reporting this problem where *sure* that it was an ext4 problem, and there were multiple people who were seeing it. As it turns out, it was actually a bug in KVM subsystem, and it only showed up if you were running in a guest VM ***and*** you were using a specific generation of buggy Intel CPU's. [1] https://bugzilla.kernel.org/show_bug.cgi?id=102731 Worse, the bug was fixed in 3.17 in this commit[2] --- and Debian had frozen their stable kernel on 3.16, which was not a long-term support kernel as far as the upstream kernel developers were concerned. You can see the whole debugging experience here[2]. Note that it took over half a year to finally figure out what was going on, because I couldn't reproduce it. [2] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?h=7dec5603b6b8dc4c3e1c65d318bd2a5a8c62a424 Since kernel developers tend to run bleeding edge kernels (I'm currently running 4.9, and Linus just released 4.10 a few hours ago; my laptop will probably be upgraded to 4.10 plus the ext4 changes due to be pushed to Linus in the next few days), *we* certainly tend not to see any of these problems. So when end-users asked kernel developers to fix what they were **sure** was an ext4 bug, about all we could do is: ¯\_(ツ)_/¯ Which is to say, I run extensive regression testing on ext4 as we do development. (This is why I'm willing to run bleeding edge kernels; I'm actually very confident in the stability of the ext4 code, because it's been well tested; when I go the latest bleeding edge kernel, my problems tend to be regressions in the WiFi or Intel i915 graphics drivers; and since these don't cause data loss, it's pretty easy to roll back to 4.9 if 4.10 proves problematic.) The test system I use is called gce-xfstests, which runs a *very* exhaustive regression test suite on Google Compute Engine. (For more information, please see [3].) [3] https://thunk.org/gce-xfstests But the problem is that distributions, especially the enterprise distributions, tend to freeze on ancient kernels, and then they cherry pick new features from newer kernels. So even though RHEL 7.3 uses a 3.10 kernel, which was released upstream over ***three*** years ago (June 30, 2013), Red Had has applied literally thousands of commits on top of 3.10. So they are using an ext4 file system which has some huge number of new features and bug fixes backported to their 3.10 kernel. I do actually periodically run regression tests on the upstream 3.10 stable kernel branch --- but that has very little to do with what Red Hat is using, or the 3.10 kernel that was used on a number of Android Handsets. The main reason why I do the 3.10.y testing is mainly because it doesn't take that much effort (I can kick off a regression test with a single command, and seven hours later I get a test report e-mailed to me), but then what I can do with that knowledge is that latest 3.10.x kernel is passing all or most tests cleanly. And then when some mobile developer complains (especially if they work at my company :-), I can say, well.... you're using a Qualcomm kernel which was branched off of 3.10.23, and the bug fix was backported to the 3.10.89 kernel as commit xyzzy. Cherry pick that patch, and you should be good. And then I curse Qualcomm for doing such a terrible job maintaining their Board Support Kernels, and so they foist off what should be *their* job on me. As a result, I am sometimes am not so patient when it's some other random mobile handset developer asking me for help --- because I don't get paid enough for this sh*t. (Actually, as far as supporting random customers and handset vendors, I don't get paid at *all* for this kind of thing. It's all volunteer work.) In the case of Red Hat, this is why Red Hat sells support contracts. They have made so many changes, which represent bug fixes and features requested by their customers, and their customers **want** to use an ancient kernel so their proprietary device drivers and kernel modules from EMC, VMWare, et. al., still work. But as a result ***only*** Red Hat can really maintain and support their kernel for their customers. Which, in fact, is a key part of Red Hat's business model. :-) The bottom line is that this could very easily not be an ext4 bug, but rather a fiber channel or other device driver bug; it could be that KVM bug (although I would have expected they should have cherry picked that into their kernel if it was applicable), or any number of other things. But your best bet is really to open a support ticket with Red Hat's help desk. Cheers, and good luck, - Ted