John, yes, your scenario is exactly the one we were hitting. Strangely I did not see anybody else complaining about this issue. Alex. On Mon, Jan 30, 2012 at 10:55 PM, John Gehring <john.gehring@xxxxxxxxx> wrote: > Alex, > > Thank you very much for bringing that thread to my attention! That fixed the > problem I outlined below in my earlier post. It also fixed another problem > which I'll briefly outline in case someone else is trying to connect the > dots with the same issue: > > - linux 2.6.38.8 > - 3 device RAID5 array > - mdadm zero-superblock each of the three devices > - mdadm create ... > - start IO on the new md device > - pull a device from the raid set > > At this point, raid gets stuck in a loop resyncing, then for some reason, > resyncing again, and again, and ... And each resync would happen virtually > instantaneously. > > Thanks again. > > John G > > On Sat, Jan 21, 2012 at 10:16 AM, Alexander Lyakas <alex.bolshoy@xxxxxxxxx> > wrote: >> >> Hi John, >> not sure if still relevant, but you may be affected by a bug in >> 2.6.38-8 kernel. We hit exactly the same issue with raid5/6. >> >> Please take a look at this (long) email thread: >> http://www.spinics.net/lists/raid/msg34881.html >> >> Eventually (please look towards the end of the thread) Neil provided a >> patch, which solved the issue. >> >> Thanks, >> Alex. >> >> >> >> >> On Mon, Dec 5, 2011 at 8:15 AM, NeilBrown <neilb@xxxxxxx> wrote: >> > On Fri, 2 Dec 2011 09:34:40 -0700 John Gehring <john.gehring@xxxxxxxxx> >> > wrote: >> > >> >> I am having trouble with a hot-pull scenario. >> >> >> >> - linux 2.6.38.8 >> >> - LSI 2008 sas >> >> - RAID6 via md >> >> - 8 drives (2 TB each) >> >> >> >> Suspect sequence: >> >> >> >> 1 - Create Raid6 array using all 8 drives (/dev/md1). Each drive is >> >> partitioned identically with two partitions. The second partition of >> >> each drive is used for the raid set. The size of the partition varies, >> >> but I have been using a 4GB partition for testing in order to have >> >> quick re-sync times. >> >> 2 - Wait for raid re-sync to complete. >> >> 3 - Start read-only IO against /dev/md1 via following command: dd >> >> if=/dev/md1 of=/dev/null bs=1 This step insures that pulled drives >> >> are detected by the md. >> >> 4 - Physically pull a drive from the array. >> >> 5 - Verify that the md has removed the drive/device from the array. >> >> mdadm --detail /dev/md1 should show it as faulty and removed from the >> >> array. >> >> 6 - Remove the device from the raid array: mdadm /dev/md1 -r >> >> /dev/sd[?]2 >> >> 7 - Re-insert the drive back into the slot. >> >> 8 - Take a look at dmesg to see what device name has been assigned. >> >> Typically has the same letter assigned as before. >> >> 9 - Add the drive back into the raid array: mdadm /dev/md1 -a >> >> /dev/sd[?]2 Now some folks might say that I should use --re-add, but >> >> the mdadm documentation states that re-add will be used anyway if the >> >> system detects that a drive has been 're-inserted'. Additionally, the >> >> mdadm response to this command shows that an 'add' or 'readd' was >> >> executed depending on the state of the disk inserted. >> >> --All is apparently going fine at this point. The add command succeeds >> >> and cat /proc/mdstat shows the re-sync in progress and it eventually >> >> finishes. >> >> --Now for the interesting part. >> >> 10 - Verify that the dd command is still running. >> >> 11 - Pull the same drive again. >> >> >> >> This time, the device is not removed from the array, although it is >> >> marked as faulty in the /proc/mdstat report. >> >> >> >> In mdadm --detail /dev/md1, the device is still in the raid set and is >> >> marked as "faulty spare rebuilding". I have not found a command that >> >> will remove drive from the raid set at this point. There were a couple >> >> of instances/tests where after 10+ minutes, the device came out of the >> >> array and was simply marked faulty, at which point I could add a new >> >> drive, but that has been the exception. Usually, it remains in the >> >> 'faulty spare rebuilding' mode. >> >> >> >> I don't understand why there is different behavior the second time the >> >> drive is pulled. I tried zeroing out both partitions on the drive, >> >> re-partitioning, mdadm --zero-superblock, but still the same behavior. >> >> If I pull a drive and replace it, I am able to do a subsequent pull of >> >> the new drive without trouble, albeit only once. >> >> >> >> Comments? Suggestions? I'm glad to provide more info. >> >> >> > >> > Yes, strange. >> > >> > The only think that should stop you being able to remove the device is >> > if >> > there are outstanding IO requests. >> > >> > Maybe the driver is being slow in aborting requests the second time. >> > Could >> > be a driver bug on the LSI. >> > >> > You could try using blktrace to watch all the requests and make sure >> > every >> > request that starts also completes.... >> > >> > NeilBrown >> > > > -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html