Thanks Neil, for looking into this. On Mon, Nov 21, 2011 at 4:44 AM, NeilBrown <neilb@xxxxxxx> wrote: > On Thu, 17 Nov 2011 13:13:20 +0200 Alexander Lyakas <alex.bolshoy@xxxxxxxxx> > wrote: > >> Hello Neil, >> >> >> However, at least for 1.2 arrays, I believe this is too restrictive, >> >> don't you think? If the raid slot (not desc_nr) of the device being >> >> re-added is *not occupied* yet, can't we just select a free desc_nr >> >> for the new disk on that path? >> >> Or perhaps, mdadm on the re-add path can select a free desc_nr >> >> (disc.number) for it (just as it does for --add), after ensuring that >> >> the slot is not occupied yet? Where it is better to do it? >> >> Otherwise, the re-add fails, while it can perfectly succeed (only pick >> >> a different desc_nr). >> > >> > I think I see what you are saying. >> > However my question is: is this really an issue. >> > Is there a credible sequence of events that results in the current code makes >> > an undesirable decision? Of course I do not count deliberately editing the >> > metadata as part of a credible sequence of events. >> >> Consider this scenario, in which the code refuses to re-add a drive: >> >> Step 1: >> - I created a raid1 array with 3 drives: A,B,C (and their desc_nr=0,1,2) >> - I failed drives B and C, and removed them from the array, and >> totally forgot about them for the rest of the scenario. >> - I added to the array two new drives: D and E, and waited for the >> resync to complete. The array now has the following structure: >> A: descr_nr=0 >> D: desc_nr=3 (was selected during the "add" path in mdadm, as expected) >> E: desc_nr=4 (was selected during the "add" path in mdadm, as expected) >> >> Step 2: >> - I failed drives D and E, and removed them from the array. The E >> drive is not used for the rest of the scenario, so we can forget about >> it. >> >> I wrote some data to the array. At this point, the array bitmap is >> dirty, and will not be cleared, since the array is degraded. >> >> Step 3: >> - I added one new drive (last one, I promise!) to the array - drive F, >> and waited for it to resync. The array now has the following >> structure: >> A: descr_nr=0 >> F: desc_nr=3 >> >> So F took desc_nr of D drive (desc_nr=3). This is expected according >> to mdadm code. >> >> Event counters at this point: >> A and F: events=149, events_cleared=0 >> D: events=109 >> >> Step 4: >> At this point, mdadm refuses to re-add the drive D to the array, >> because its desc_nr is already taken (I verified that via gdb). On the >> other hand, if we would have simply picked a fresh desc_nr for D, then >> it could be re-added I believe, because: >> - slots are not important for raid1 (D's slot was taken actually by F). >> - it should pass the check for bitmap-based resync (events in D' sb >= >> events_cleared of the array) >> >> Do you agree with this, or perhaps I missed something? >> >> Additional notes: >> - of course, such scenario is relevant only for arrays with more than >> single redundancy, so it's not relevant for raid5 >> - to simulate such scenario for raid6, need at step 3 to add the new >> drive to the slot, which is not the slot of the drive we're going to >> re-add in step4 (otherwise, it takes the D's slot, and then we really >> cannot re-add). This can be done as we discussed earlier. >> >> What do you think? > > I think some of the details in your steps aren't really right, but I do see > the point you are making. > If you keep the array degraded, the events_cleared will not be updated so any > old array member can safely be re-added. > > I'll have a look and see how best to fix the code. > > Thanks. > > NeilBrown > > > >> >> Thanks, >> Alex. >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html