Neil, It is very interesting but I have a couple of questions on your suggestion as fellow; 1) Is a bit on the bitmap match with each chunk? 2) it has a counter( or timestamp) for each chunk on each mirror, right? then how do we find the timestamp matched or not? what do you compare it with? 3) I think "a bit on the bitmap has the same meaning as the MSB(high bit) on the counter". That means you need only a counter for each chunk? Am I wrong? Bo ----- Original Message ----- From: "Philip Cameron" <pecameron@attbi.com> To: <linux-raid@vger.kernel.org>; <neilb@cse.unsw.edu.au> Sent: Thursday, January 02, 2003 5:22 PM Subject: Re: Fast RAID 1 Resync > Hi Neil, > > (Sorry if this is a repost. I had an error with returned mail) > > Thanks for your comments. > > I also don't see a need to synchronize disks at mkraid time. Its nice to have > identical disks but not necessary as long as the result of reading a sector that > has never been written is undefined. An option to do either approach can be done > as long as there is a real need. Adding an option increases complexity > especially during test. > > I have been thinking of tracking writes to each chunk with a counter. The > counters would be organized into a vector indexed by chunk number. The counter > is incremented by the number of mirrors including any currently unavailable > mirrors before the write starts. It is decremeted as each write completes. So > when all writes in a chunk are complete the counter returns to zero. If there is > a missing mirror, the counter will not return to zero (since one of the needed > writes was not done). > > When a disk is pulled, the counters increment but don't return to zero. When the > disk is reinserted, the resync needs to copy chunks where the counter doesn't go > to zero. When the resync of the chunk is complete, set the counter to zero. > > To deal with a recovery after crash, I am thinking about using your approach. > Use a bit per counter and set the bit when the counter is non-zero. When a bit > goes from 0 to 1, the updated bit vector is written before starting the write to > the chunk. On reboot after a crash, the bit vector from the selected mirror is > used (the current mechanism is used to select the base disk). The counter is > incremented for each chunk that has a bit that is set. After this, the resync in > the above case can be performed. I don't see a need for timestamps beyond what > is currently being done. > > The transitions from 1 to 0 are not all that important since the worst case is > syncing a chunk that is already mirrored. Overall performance can be improved by > delaying the 1 to 0 transition for a few seconds. A lazy write of the bits can > be done every 10 seconds or so if there are no 0 to 1 changes during that interval. > > I have a third goal: minimize the resync time for a new (replacement) disk. In > this case all of the chunks that have ever been written need to be copied. I > have been thinking about controlling this through a second bit vector. When a > chunk is written for the first time the bit is set and it is never reset. When > the new disk is added, the counters corresponding to all of the bits that have > been set in the vector are incremented and a resync (as above) is performed. As > above the vector update needs to be done before the write to the chunk. The > length of resync is proportional to how much of the disk has been used. > > A hardware note: the system has two IO assemblies each of which contains a PCI > bus, SCSI HBA and 3 hot plugable SCSI disk slots. We are using 72GB disks. Sets > of 2 mirror RAID1 raid sets is the most practical configuration. > > Phil Cameron > > >> > >> Hi, > > > > > >> You have two quite different, though admittedly similar, goal here. > >> 1/ quick resync when a recently removed drive is re-added. > >> 2/ quick resync after an unclean shutdown. > >> > >> I would treat these quite separately. > >> > >> For the latter I would have a bitmap which was written to disk > >> whenever a bit was set and eventually after a bit was cleared. > >> I would use a 16 bit counter for each 'chunk'. > >> If the high bit is clear, it stores the number of outstanding writes > >> on that chunk. > >> If the high bit is set, it stores some sort of time stamp of when the > >> number of outstanding writes hit zero. > >> Every time you write the bitmap, you increment this timestamp. > >> So when you schedule a write, you only need to write out the bitmap > >> first if the 16bit number of this chunk has the highbit set and has a > >> timestamp which is different to the current one - which means that > >> the bitmap has been written out with a zero in this slot. > >> So: > >> On write, if highbit clear, increment counter > >> if highbit set and timestamp matches, set counter to 1 > >> and set bit in bitmap > >> if highbit set and timestamp doesn't match, set > >> bit in bitmap, schedule write, set counter to 1 > >> On write complete, > >> decrement counter. If it hits zero, set to timestamp with > >> high bit set, clear the bitmap bit, and schudle a bitmap > >> writout a few seconds hence. > >> > >> For the former I would just hold a separate bitmap, one bit per > >> chunk. > >> While all drives are working, this bitmap would be all zeros. > >> Whenever a write fails to write to all drives, the relevant bit gets > >> set. > >> When a recently failed drive comes back online, we resync all chunks > >> that have that bit set. > >> > >> I don't see a particular need to sync the drives are device creation > >> time, but I would like to keep the option of doing so. I don't > >> really care which behaviour is the default. > >> > >> NeilBrown > > > > > > - > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html