Re: Fast RAID 1 Resync

Philip Cameron <pecameron@attbi.com> · Thu, 02 Jan 2003 20:22:32 -0500

Hi Neil,

(Sorry if this is a repost. I had an error with returned mail)

Thanks for your comments.

I also don't see a need to synchronize disks at mkraid time. Its nice to have

identical disks but not necessary as long as the result of reading a sector that

has never been written is undefined. An option to do either approach can be done

as long as there is a real need. Adding an option increases complexity

especially during test.

I have been thinking of tracking writes to each chunk with a counter. The

counters would be organized into a vector indexed by chunk number. The counter

is incremented by the number of mirrors including any currently unavailable

mirrors before the write starts. It is decremeted as each write completes. So

when all writes in a chunk are complete the counter returns to zero. If there is

a missing mirror, the counter will not return to zero (since one of the needed

writes was not done). 

When a disk is pulled, the counters increment but don't return to zero. When the

disk is reinserted, the resync needs to copy chunks where the counter doesn't go

to zero. When the resync of the chunk is complete, set the counter to zero.

To deal with a recovery after crash, I am thinking about using your approach.

Use a bit per counter and set the bit when the counter is non-zero. When a bit

goes from 0 to 1, the updated bit vector is written before starting the write to

the chunk. On reboot after a crash, the bit vector from the selected mirror is

used (the current mechanism is used to select the base disk). The counter is

incremented for each chunk that has a bit that is set. After this, the resync in

the above case can be performed. I don't see a need for timestamps beyond what

is currently being done.

The transitions from 1 to 0 are not all that important since the worst case is

syncing a chunk that is already mirrored. Overall performance can be improved by

delaying the 1 to 0 transition for a few seconds. A lazy write of the bits can

be done every 10 seconds or so if there are no 0 to 1 changes during that interval.

I have a third goal: minimize the resync time for a new (replacement) disk. In

this case all of the chunks that have ever been written need to be copied. I

have been thinking about controlling this through a second bit vector. When a

chunk is written for the first time the bit is set and it is never reset. When

the new disk is added, the counters corresponding to all of the bits that have

been set in the vector are incremented and a resync (as above) is performed. As

above the vector update needs to be done before the write to the chunk. The

length of resync is proportional to how much of the disk has been used.

A hardware note: the system has two IO assemblies each of which contains a PCI

bus, SCSI HBA and 3 hot plugable SCSI disk slots. We are using 72GB disks. Sets

of 2 mirror RAID1 raid sets is the most practical configuration.

Phil Cameron

Hi,

 You have two quite different, though admittedly similar, goal here.
  1/ quick resync when a recently removed drive is re-added.
  2/ quick resync after an unclean shutdown.

 I would treat these quite separately.

 For the latter I would have a bitmap which was written to disk
 whenever a bit was set and eventually after a bit was cleared.
 I would use a 16 bit counter for each 'chunk'.
 If the high bit is clear, it stores the number of outstanding writes
 on that chunk.
 If the high bit is set, it stores some sort of time stamp of when the
 number of outstanding writes hit zero.
 Every time you write the bitmap, you increment this timestamp.
 So when you schedule a write, you only need to write out the bitmap
 first if the 16bit number of this chunk has the highbit set and has a
 timestamp which is different to the current one - which means that
 the bitmap has been written out with a zero in this slot.
 So:
   On write, if highbit clear, increment counter
             if highbit set and timestamp matches, set counter to 1
		 and set bit in bitmap
	     if highbit set and timestamp doesn't match, set
		 bit in bitmap, schedule write, set counter to 1
   On write complete,
	decrement counter.  If it hits zero, set to timestamp with
        high bit set, clear the bitmap bit, and schudle a bitmap
	writout a few seconds hence.	

 For the former I would just hold a separate bitmap, one bit per
 chunk.
 While all drives are working, this bitmap would be all zeros.
 Whenever a write fails to write to all drives, the relevant bit gets
 set.
 When a recently failed drive comes back online, we resync all chunks
 that have that bit set.

 I don't see a particular need to sync the drives are device creation
 time, but I would like to keep the option of doing so.  I don't
 really care which behaviour is the default.

NeilBrown

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html