Neil, Doug's patch looks good to me. I'll apply it to v2.4 mainline. On Thu, Sep 16, 2004 at 06:34:26PM -0400, Doug Ledford wrote: > On Mon, 2004-09-13 at 20:56, Neil Brown wrote: > > On Monday September 13, agustin@xxxxxxxxxxxxxxxx wrote: > > > Hi Everybody, > > > > > > I was wondering if you could take a look to the following and give some advice... I just > > > have TWO SATA WD800 disks and "mdadm" says I have "three" and one of > > > them has "failed"!!! > > > > It's just some odd accounting in the kernel. Don't worry about it. > > It is just some odd accounting stuff, but it's hardly something I would > say "don't worry about it". I had to chase a similar looking problem > down in our RHEL3 product line. In fact, it was raid1 where I was > seeing this problem. I've attached the email I sent about the problem > to the internal kernel people, the patch I originally sent in for it, > and the update patch that catches a couple spots I originally missed in > my first patch (apologies if my mailer horks this up, haven't tried to > send sent mail in exactly the way I'm trying it now). > > From: > Doug Ledford > <dledford@xxxxxxxxxx> > To: > > Subject: > [Taroon patch] > RAID1 oops fix > Date: > Mon, 21 Jun 2004 > 10:14:09 -0400 > > Relevant bz121616 > > OK, basic problem is that if you use mdadm to fail a device in a raid1 > array and then immediately remove that device, you can end up triggering > a race condition in the raid1 code. This only shows up on SMP systems > (and the one I have here which is a 2 physical, 4 logical processor > system shows it very easily, but for some reason nmi_watchdog didn't > ever help and the system always just locked hard and refused to do > anything, so I didn't have an oops to work from, just a hardlock). > > In the raid1 code, we keep an array of devices that are part of the > raid1 array. Each of these devices can have multiple states, but for > the most part we check the operational bit of a device before deciding > to use it. If we decide to use that device, then we grab the device > number from the array (kdev_t, aka this is the device's major/minor and > is what we are going to pass to generic_make_request in order to pass > the buffer head on to the underlying device). > > When we fail a device, we set that operational bit to 0. When we remove > a device, we also set the dev item in the struct to MKDEV(0,0). > > There is no locking whatsoever between the failing of a device (setting > the operational bit to 0) and the make_request functions in the raid1 > code. So, even though it's safe to fail a device without this locking, > before we can safely remove the device we need to know that every > possible context that might be checking that operational bit has in fact > seen the failed operational bit. If not, then we can end up setting the > dev to 0, then the other context grabs it and tries to pass that off to > generic_make_request, unnice things ensue. > > So, this patch does these things: > > 1. Whenever we are calling mark_disk_bad(), hold the > conf->device_lock > 2. Whenever we are walking the device array looking for an > operational device, always grab the conf->device_lock first and > hold it until after we have gotten not only the operational bit > but also the dev number for the device > 3. Correct an accounting problem in the superblock. If we fail a > device and it's currently counted as a spare device instead of > an active device, then we failed to decrement the superblocks > spare disk count. This accounting error is preserved across > shutdown and restart of the array, and although it doesn't oops > the kernel (the kernel will refuse to try and read beyond disk > 26 even if the spare count indicates it should, although I'm not > sure it doesn't try and write past 26 so this could be a disk > corruptor when the spare count + active counts exceeds the > amount of space available in the on disk superblock format) it > does in fact cause mdadm to segfault on trying to read the > superblock. > > So, that's the description. Testing. Well, without this patch, my test > machine dies on the following command *very* quickly: > > while true; do mdadm /dev/md0 -f /dev/sdc1 -r /dev/sdc1 -a /dev/sdc1; > sleep 1; done > > In addition, without the patch you can watch the superblock's spare > count go up with every single invocation of that command. > > With my patch, the same machine survived the above command running over > the weekend, and in addition I mounted the raid1 array and ran a > continuous loop of bonnie++ sessions to generate as much load as > possible. I've verified that the spare count stays consistent when > failing a spare device, and I've verfied that once a device is synced up > then the spare count is also decremented as the device is switched to > being accounted as an active device. > > -- > Doug Ledford <dledford@xxxxxxxxxx> - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html