The following is a copy of http://neil.brown.name/blog/20120615073245 which I posted a link to under a less obvious subject, so I'm posting it again complete with a better subject to maximise the number of people who will see it. NeilBrown A Nasty md/raid bug =-=-=-=-=-=-=-=-=-= There is a rather nasty RAID bug in some released versions of the Linux kernel. It won't destroy your data, but it could make it hard to access that data. If you are concerned that this might affect you, the first thing you should do (after not panicking) is to gather the output of mdadm -Evvvvs and save this somewhere that is not on a RAID array. The second thing to do is read to the end of this note and then proceed accordingly. You most likely will never need use the output of that command, but if you do it could be extremely helpful. Who is vulnerable ================= The bug was introduced by commit c744a65c1e2d59acc54333ce8 md: don't set md arrays to readonly on shutdown. and fixed by commit 30b8aa9172dfeaac6d77897c67ee9f9fc574cdbb md: fix possible corruption of array metadata on shutdown. These entered the upstream kernel for v3.4-rc1 and v3.4-rc5 respectively, so no main-line released kernel is vulnerable. However the first patch was tagged "Cc: stable@xxxxxxxxxxxxxxx" as it fixed a bug, and so it was added to some stable releases. For v3.3.y the bug was introduced by commit ed1b69c5592d1 in v3.3.1 and fixed by commit ff459d1ea87ea7 in v3.3.4, so v3.3.1, v3.3,2, and v3.3.3 are vulnerable. For v3.2.y the bug was introduced by commit 6bd620a44f7fd in v3.2.14 and fixed by commit 31097a1c490c in v3.2.17 so v3.2.14, v3.2.15. v3.2.16 are all vulnerable. The bug was not backported to any other kernel.org kernels. so only those 6 are vulnerable. Some distributors may have picked up the patch applied it to their own kernel so it is possible that other kernels are vulnerable too. e.g SLES11-SP2 introduced the bug in 3.0.26-0.7 and fixed it in 3.0.31-0.9. Ubuntu-precise introduced the bug in Ubuntu-3.2.0-22.35 and fixed the bug in Ubuntu-3.2.0-24.38 What does it do =============== The bug only fires when you shutdown/poweroff/reboot the machine. While the machine remains up the bug is completely inactive. So you will only notice the bug when you boot up again. The effect of the bug is to erase important information from the metadata that is stored on the disk drives. In particular the level, chunksize, number of devices, data_offset and role of each device in the array are erased ... and probably some other information too. This means that if you know those details you can recover your data, but if you don't, it will be harder. Hence the "mdadm -E" command suggested earlier. The bug will only fire if, while the machine is shutting down, and array is partially assembled but not yet started. If the array is started and running it is safe ... as long as it stays that way. The typical way that an array can get into this "partially assembled by not started" start is for "mdadm --incremental" to have been run on some, but not all, of the devices in the array. If it had been run on all it would have started the array, but until then it is assembled but not started and this is the dangerous situation. Another way it can happen is if you use "mdadm -A" to assemble the array and explicitly list some, but not all, of the devices that make up the array. This also will assemble the array and not start it. It is very unlikely that an array will be in this start at shutdown, though certainly not impossible, so it seems unlikely that the bug will affect many people. But it has affected some. Most of the people who have reported problems have been running Ubuntu. This might just be because Ubuntu happened to make a release with a vulnerable kernel and no other distro did, or it might be that Ubuntu does something during shutdown that makes triggering the bug more likely. For example, if Ubuntu stop arrays as part of shutdown (which is a good idea) and if it has udev scripts which automatically pass changed devices to "mdadm --incremental" in case they are part of an array (which is also a good idea, and I think they do), and further if something causes udev to think that the devices that were part of the array had changed - which seems unlikely but is not completely impossible, then arrays could get partially re-assembled late in the shutdown sequence, which could trigger exactly the bug we see. Note that I don't know that Ubuntu does this, and I'd be surprised if they did, and there is probably some other explanation. But I like try to think of all possibilities to try to understand things. What should I do to avoid being bitten by the bug ================================================= If you are running a newer kernel that those identified above which contain the fix (3.4, 3.3.4, or 3.2.17) then you are not vulnerable and there is nothing else that you need to do. If you are running a kernel that was compiled before March 2012 (when the bug was created), then you aren't vulnerable, and as long as you don't upgraded to a vulnerable kernel you will continue to be safe. If you are running a vendor kernel compiled since 19th Match 2012 but with a version earlier than those listed above as containing the fix, then I cannot tell you if you are vulnerable or not. You might need to check with your vendor. If you decide to upgrade your kernel, you should do so carefully. Remember that the bug triggers on shutdown/reboot so you aren't safe until the new kernel is running. To be completely safe you must ensure that no arrays are partially assembled at the moment that the shutdown process in the kernel looks at md array. To do this you can mv /sbin/mdadm /sbin/mdadm.moved /sbin/mdadm.moved --stop --scan Any array that is partially assembled (and any other array that is not in use) will be stopped. As "/sbin/mdadm" will not exist, no array can be partially - or fully - assembled. Doing this may cause the shutdown process to complain if it cannot find mdadm to stop arrays itself, but this should not be a problem. The boot process in the new kernel might also complain as it won't be able to find /sbin/mdadm either. You might have to boot into a rescue mode and mv /sbin/mdadm.moved /sbin/mdadm back into place. Then boot again. Also this fiddling with moving mdadm might be completely unnecessary. Or it might not. I cannot be sure. What I can be sure of is that doing it this way is safest. Once you are running a non-vulnerable kernel, you are safe and cannot be bitten - by this bug at least. What should I do if I have already been bitten ============================================== You will know that you have been bitten if some array or arrays don't seem to work and all the devices appear to be spares, and if you then use "mdadm --examine" to look at the devices and find that the RAID level is "-unknown-". At this point you might want to send mail to linux-raid@xxxxxxxxxxxxxxx and ask for help. Or you might be able to work your way through the following and fix it by yourself. The way to fix an array that has been affected by the bug is to "Create" the array again with mdadm. Only the metadata has been destroyed and "mdadm --create" only writes new metadata so this is a simple and effective fix and makes all data available again. Constructing the correct "mdadm --create" command might be tricky. If you have recent "mdadm --examine" output, then that will be a big help. If not, you can probably get some of the required information from somewhere. Maybe from your memory, maybe from kernel logs, maybe from guessing and seeing if it works. First you need to know the metadata version that was in use. This is one piece of information that is *not* destroyed so "mdadm --examine" of one of the devices will give it to you. It might be 0.90 for older arrays, or 1.2 for newer arrays, or possibly 1.0 or 1.1 if they were chosen when the array was originally created. Next you need to know the number of devices, RAID level, layout, and chunk size. Some of these you might simply know (Maybe you know it is a 5-device RAID6 array) or can determine by examining kernel logs or asking a colleague. Others you can probably assume were the default if you don't know otherwise. People rarely set a non-default layout for RAID5 or RAID6, though they do for RAID10. If you have 0.90 metadata, you probably have 64K chunks while if you have 1.2 you probably have 512K chunks so if you don't have any reason to think otherwise, this is a good place to start. It is entirely possible that there is one device that was in the array in the recent past, but wasn't in the array at the moment when it was corrupted. If this happens, then that device is a good source for some of this information. Then you need to know the order of devices, and whether the array was degraded, in which case some devices will have to be specified as "missing". As devices can change order when there are failures and spares are added, you cannot be sure that the obvious order is the correct order. e.g it might be sda1, sda2, sda3, and sda4, but it could be sd1, sda2, sda4, sda3. For RAID1, this doesn't matter of course. For RAID4,6,10 it is very important. Your kernel logs might have this information in a "RAID conf printout", so it is worth checking. Remember however that device names can change after a reboot so prepare to be flexible. If you have a recent output of "cat /proc/mdstat" that might be useful, but be careful. The number is brackets "sda1[3]" are not always the position of the device in the array. For 0.90 metadata they are but for 1.x, they show the order in which devices were added to the array. When a spare is changed to being an active member of the array, this number does not change. So it might be an indicator, but it is definitely not a promise. Finally you need to know the "Data Offset" of each device. This only applies to 1.1 and 1.2 metadata. Different versions of mdadm use different values for the default Data Offset, so it is best to try to recreate the array using the same version of mdadm as was used to create the array. Worse - if you created the array with an older version of mdadm, then added a spare with a newer version then it is possible that different devices have different data offsets. If that seems to be the case it would be best to ask for help on linux-raid@xxxxxxxxxxxxxxx as you'll need a special non-released version of mdadm to fix things up. Once you have assembled all this information you can try creating the array. Note that you don't need information about whether there was a write-intent bitmap or not. Just assume there wasn't and don't try adding one. Once you have the data back you can always add a bitmap later. When you create the array, give all the details and make sure to specify "--assume-clean". This ensure that mdadm doesn't start any resync. This is import as if you get details wrong and need to try again, then resync could corrupt data causing subsequent attempts to be pointless. So something like: mdadm --create /dev/md0 --assume-clean --level=5 --raid-disk=4 \ --chunk 128 --metadata=1.2 /dev/sda1 /dev/sd2 missing /dev/sda4 This will write out new metadata, assemble the array, but not write to any data. You should then try to determine if the array contains the correct data. How you do this depends on what was there before. If you had an ext2/3/4 filesystem "fsck -n /dev/md0" is the thing to use. If XFS, then xfs_check is the tool of choice. If you used LVM, then you might need to try starting LVM (vgchange -ay) and if that works, run 'fsck' or whatever on the contents. Only if 'fsck' reports positive results should you try mounting the filesystem. Even if you mount the filesystem read-only, some filesytems might try to write to the device anyway, and you don't want that until you are sure. If you don't know the correct chunk size, or don't know the correct order of devices you might need to try multiple permutations until you find one that works. To speed this up a little you can assume that once "fsck" or "vgcreate" reports that it sees something that looks vaguely right, even if there are lots of errors, then you probably have the first device in the array correct, so you only need to continue permuting the other devices. If your array is RAID6, and if it is missing at most one device, then an alternate approach to checking if the order is correct is to echo check > /sys/block/md0/md/sync_action (where "md0" might be replaced by a different mdX in your case). If /sys/block/md0/md/mismatch_cnt grows into the thousands quickly, then you certainly have the order wrong and should stop the array and try again. If it reports zero or only a small number of hundreds then the order is probably correct and it is worth running 'fsck' etc. This does not work for RAID5, as the parity used for RAID5 is insensitive to the order of devices. It can be useful for RAID10 but a positive result is not as strong as a positive result for RAID6. Once you have a valid filesystem and you can see all your data, I recommend a "check" and possibly a "repair" sync_action, a full "fsck", and probably it's a good time to refresh your backups. What was the bug anyway ======================= So, you want to know the background do you? Well..... are you sitting comfortably? md has always had a "reboot notifier" that tried to make sure nice things happened to arrays at shutdown. The particular purpose was to reduce the possibility that a resync might been needed on reboot, even if the shutdown wasn't particularly clean. So at shutdown, md would try to switch all arrays to "read-only" mode so no more writes are allowed, and so the array would be marked as clean. This originally applied only to arrays which were not active, but when people started have root on md more often, they started getting complaints that md couldn't stop all arrays - because the array holding the root filesystem would still be busy at this point. So in 2008, for 2.6.27, md changed to switch any array to read-only, even if it was still in use. No more writes should be happening at this point, the only use should effectively be read-only, so everyone should be happy. However sometime around the 3.0 kernel there were changes to the way filesystem write-back was happening so that writes could still arrive while the reboot was progressing. A normal clean shutdown that called "sync" first should avoid that, but people don't always do that - sometimes with good reason. md strongly believes that no-one should be writing to a read-only device so it has a BUG_ON if a write arrives while the device is readonly. Around 3.0/3.1 people started reporting this bug being triggered at shutdown. Obviously a BUG at shutdown isn't a very big problem, but it is untidy so once the problem was identified it should be fixed. As switching to read-only was no longer an option, I decided to switch to immediate-safemode instead. 'safemode' is a precursor to write-intent bitmaps. It is like a WIB, but with only one bit. Before writing, md makes sure the "dirty" bit is set in the metadata. 200ms after the last write the bit is cleared in the metadata. immediate-safemode is the same without the 200ms delay. So as soon as there are no outstanding writes, the bit is cleared. This is very nearly as safe as switching to readonly mode, and if people a rebooting without calling 'sync' first, then they are deliberately giving up some safety and I don't feel the need to try any harder. If no writes comes, immediate safe mode is just as good as read-only. So the code was changed to set immediate-safemode instead of read-only, and this patch was marked for 'stable' kernels as it fixed a BUG_ON crash. Unfortunately the patch was imperfect. Setting immediate-safemode involves clearing the bit and writing the metadata out immediately (if there are no outstanding writes). This is normally good ... unless the array is not actually active. If the array is not active, the code still writes out the metadata, but does this from its knowledge of the array, which is that the array is inactive. The result is that it corrupts the metadata exactly as described above. The fix, once the problem was understood, was simple. Only set immediate-safemode if the array is active.
Attachment:
signature.asc
Description: PGP signature