Re: Raid 1 - md "Total Disks" question

Marcelo Tosatti <marcelo.tosatti@xxxxxxxxxxxx> · Fri, 17 Sep 2004 10:28:01 -0300

Neil,

Doug's patch looks good to me. I'll apply it to v2.4 mainline.

On Thu, Sep 16, 2004 at 06:34:26PM -0400, Doug Ledford wrote:
> On Mon, 2004-09-13 at 20:56, Neil Brown wrote:
> > On Monday September 13, agustin@xxxxxxxxxxxxxxxx wrote:
> > > Hi Everybody,
> > > 
> > > I was wondering if you could take a look to the following and give some advice... I just
> > > have TWO SATA WD800 disks and "mdadm" says I have "three" and one of
> > > them has "failed"!!!
> > 
> > It's just some odd accounting in the kernel.  Don't worry about it.
> 
> It is just some odd accounting stuff, but it's hardly something I would
> say "don't worry about it".  I had to chase a similar looking problem
> down in our RHEL3 product line.  In fact, it was raid1 where I was
> seeing this problem.  I've attached the email I sent about the problem
> to the internal kernel people, the patch I originally sent in for it,
> and the update patch that catches a couple spots I originally missed in
> my first patch (apologies if my mailer horks this up, haven't tried to
> send sent mail in exactly the way I'm trying it now).
> 
>            From: 
> Doug Ledford
> <dledford@xxxxxxxxxx>
>              To: 
> 
>         Subject: 
> [Taroon patch]
> RAID1 oops fix
>            Date: 
> Mon, 21 Jun 2004
> 10:14:09 -0400
> 
> Relevant bz121616
> 
> OK, basic problem is that if you use mdadm to fail a device in a raid1
> array and then immediately remove that device, you can end up triggering
> a race condition in the raid1 code.  This only shows up on SMP systems
> (and the one I have here which is a 2 physical, 4 logical processor
> system shows it very easily, but for some reason nmi_watchdog didn't
> ever help and the system always just locked hard and refused to do
> anything, so I didn't have an oops to work from, just a hardlock).
> 
> In the raid1 code, we keep an array of devices that are part of the
> raid1 array.  Each of these devices can have multiple states, but for
> the most part we check the operational bit of a device before deciding
> to use it.  If we decide to use that device, then we grab the device
> number from the array (kdev_t, aka this is the device's major/minor and
> is what we are going to pass to generic_make_request in order to pass
> the buffer head on to the underlying device).
> 
> When we fail a device, we set that operational bit to 0.  When we remove
> a device, we also set the dev item in the struct to MKDEV(0,0).
> 
> There is no locking whatsoever between the failing of a device (setting
> the operational bit to 0) and the make_request functions in the raid1
> code.  So, even though it's safe to fail a device without this locking,
> before we can safely remove the device we need to know that every
> possible context that might be checking that operational bit has in fact
> seen the failed operational bit.  If not, then we can end up setting the
> dev to 0, then the other context grabs it and tries to pass that off to
> generic_make_request, unnice things ensue.
> 
> So, this patch does these things:
> 
>      1. Whenever we are calling mark_disk_bad(), hold the
>         conf->device_lock
>      2. Whenever we are walking the device array looking for an
>         operational device, always grab the conf->device_lock first and
>         hold it until after we have gotten not only the operational bit
>         but also the dev number for the device
>      3. Correct an accounting problem in the superblock.  If we fail a
>         device and it's currently counted as a spare device instead of
>         an active device, then we failed to decrement the superblocks
>         spare disk count.  This accounting error is preserved across
>         shutdown and restart of the array, and although it doesn't oops
>         the kernel (the kernel will refuse to try and read beyond disk
>         26 even if the spare count indicates it should, although I'm not
>         sure it doesn't try and write past 26 so this could be a disk
>         corruptor when the spare count + active counts exceeds the
>         amount of space available in the on disk superblock format) it
>         does in fact cause mdadm to segfault on trying to read the
>         superblock.
> 
> So, that's the description.  Testing.  Well, without this patch, my test
> machine dies on the following command *very* quickly:
> 
> while true; do mdadm /dev/md0 -f /dev/sdc1 -r /dev/sdc1 -a /dev/sdc1;
> sleep 1; done
> 
> In addition, without the patch you can watch the superblock's spare
> count go up with every single invocation of that command.
> 
> With my patch, the same machine survived the above command running over
> the weekend, and in addition I mounted the raid1 array and ran a
> continuous loop of bonnie++ sessions to generate as much load as
> possible.  I've verified that the spare count stays consistent when
> failing a spare device, and I've verfied that once a device is synced up
> then the spare count is also decremented as the device is switched to
> being accounted as an active device.
> 
> -- 
> Doug Ledford <dledford@xxxxxxxxxx>

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html