Re: 2.6.23.1: mdadm/raid5 hung/d-state

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On Thu, 8 Nov 2007, BERTRAND Joël wrote:

BERTRAND Joël wrote:
Chuck Ebbert wrote:
On 11/05/2007 03:36 AM, BERTRAND Joël wrote:
Neil Brown wrote:
On Sunday November 4, jpiszcz@xxxxxxxxxxxxxxx wrote:
# ps auxww | grep D
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root       273  0.0  0.0      0     0 ?        D    Oct21  14:40
[pdflush]
root       274  0.0  0.0      0     0 ?        D    Oct21  13:00
[pdflush]

After several days/weeks, this is the second time this has happened,
while doing regular file I/O (decompressing a file), everything on
the device went into D-state.
At a guess (I haven't looked closely) I'd say it is the bug that was
meant to be fixed by

commit 4ae3f847e49e3787eca91bced31f8fd328d50496

except that patch applied badly and needed to be fixed with
the following patch (not in git yet).
These have been sent to stable@ and should be in the queue for 2.6.23.2
    My linux-2.6.23/drivers/md/raid5.c contains your patch for a long
time :

...
        spin_lock(&sh->lock);
        clear_bit(STRIPE_HANDLE, &sh->state);
        clear_bit(STRIPE_DELAYED, &sh->state);

        s.syncing = test_bit(STRIPE_SYNCING, &sh->state);
        s.expanding = test_bit(STRIPE_EXPAND_SOURCE, &sh->state);
        s.expanded = test_bit(STRIPE_EXPAND_READY, &sh->state);
        /* Now to look around and see what can be done */

        /* clean-up completed biofill operations */
        if (test_bit(STRIPE_OP_BIOFILL, &sh->ops.complete)) {
                clear_bit(STRIPE_OP_BIOFILL, &sh->ops.pending);
                clear_bit(STRIPE_OP_BIOFILL, &sh->ops.ack);
                clear_bit(STRIPE_OP_BIOFILL, &sh->ops.complete);
        }

        rcu_read_lock();
        for (i=disks; i--; ) {
                mdk_rdev_t *rdev;
                struct r5dev *dev = &sh->dev[i];
...

but it doesn't fix this bug.


Did that chunk starting with "clean-up completed biofill operations" end
up where it belongs? The patch with the big context moves it to a different
place from where the original one puts it when applied to 2.6.23...

Lately I've seen several problems where the context isn't enough to make
a patch apply properly when some offsets have changed. In some cases a
patch won't apply at all because two nearly-identical areas are being
changed and the first chunk gets applied where the second one should,
leaving nowhere for the second chunk to apply.

I always apply this kind of patches by hands, and no by patch command. Last patch sent here seems to fix this bug :

gershwin:[/usr/scripts] > cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md7 : active raid1 sdi1[2] md_d0p1[0]
      1464725632 blocks [2/1] [U_]
[=====>...............] recovery = 27.1% (396992504/1464725632) finish=1040.3min speed=17104K/sec

	Resync done. Patch fix this bug.

	Regards,

	JKB


Excellent!

I cannot easily re-produce the bug on my system so I will wait for the next stable patch set to include it and let everyone know if it happens again, thanks.

[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux