Hi all, I hope I won't bore you too much, but I had these ideas on my mind for a couple of days now, so I just wanted to share them to you. It is not a long document, so I guess it would take 10mins to read. Sorry for my sometimes obscure english, ya know, I did the best I could... I do welcome any comments, critics, suggestions, ... especially because it's really a first draft here. I of course would wish to implement what i'm speaking about, one day or another... Best regards, F.-E.B.
Generic reshape model for MD 1. Introduction The aim of this document is to present a concept of generic online md level migration, such as raid1 to raid5, raid growth, ... Regardless to implementation issues, the only migrations that are strictly impossible are those where data (and replications) will not fit in the disk at the end of the migration. For example, raid5 to raid1 with the same number of disks is strictly impossible. The aim of generic migration is to make each migration possible and easily implementable (if not straitforward), including the ones that would imply a growth, to the extend that enough spare disks are provided. 2. Proposal 2.1. Initial Concept. Migration always imply a total read/re-write of the whole disks, just as a resync or a growth would do. Needless to say, raid-levels are always aware of their own layout, and know how to read or write stuff to disk with regard to their layout. So in short, the concept is : let's benefit of their awareness. Let's add a layer on top of the raidX block layer, which would be responsible for reading a certain amount of data using the previous raid level/layout (feeding the window), and writing this window using the target raid level/layout. Both reads and writes use the corresponding level implementation, just as a regular use would do. So, don't forget the window is read/written at md level. 2.2. Layout shape and calculations (Nota : we would not consider here linear md models). At this point, we should detail three different types of migrations, in terms of migration of data at the disk level. Considering only raid1, 4, 5 and 6, it is clear that we can find an exact ratio M between the size of the window (in sectors) and the corresponding number of sectors involved on disk. For raid1, M = 1, for raid 4 and 5, M = k - 1 (k being the total number of disks in a non-degraded setup), and for raid 6, M = k - 2. In a migration context, we have to know M for the previous setup and M' for the target setup. Let's call W the size of the window. A convenient window size shall be a multiple of M and M', such as each window read involves W/M sectors on each disk, and each window write involves W/M' sectors on each disk. First, a little bit of notation issues. s(n,p,q) defines sectors p to p+q on the nth disk before migration. s'(n,p,q) defines sectors p to p+q on the nth disk after migration. w(m) defines the mth window of the migration (window at sector m, and of size W). k defines the number of disks before migration. k' defines the number of disks after migration. w(m) = s(0,p,W/M) ... s(k,p,W/M) means the mth window is built using the pth sector of disks 0...k. All disks may not be necessary to build the window, but it is asserted that we *can* build the window knowing this. (Nota : don't misunderstand this ugly notation : this is not concatenation..) In the general case, migration consists on 1. Building window w(m) = s(0,p,W/M) ... s(k,p, W/M) 2. Writing back the window s'(0,p',W/M') ... s'(k',p',W/M') = w(m) So, m = M * p = M' * p'. It is clear here that we should ask level-specific code for their M value, as we provide them the number of disks we want to use for migrations. 2.3. Different types of migrations. As a result, we can have three different migrations - stable migration : when the data before and after migration sits at the same part of the disk, i.e., M = M', and for each m, p' == p. So data at s(0,p) ... s(k,p) will be copied to s(0,p) ... s(k',p). A good example ? Raid-5 to Raid-6 conversion with an additionnal disk. k' = k + 1, M = k - 1, M' = k' - 2 = k - 1 = M. A consequence of M = M' is that the md device size does not change after the migration. - downsize migration : that means M' > M, and p' < p. An example ? Raid-6 to Raid-5 conversion with the same number of disks. We have M = k - 2 and M' = k - 1, so M' > M. Indeed, when we will finish migrating, space will remain at the end of disks so md device size will increase and we'll need an extra resync of the remaining part. - upsize migration : that means p' > p. This is only possible when underlying (partition) size is greater than the one actually used. This can be part of 'standard' growth and is strictly off-topic. 2.4. Implementation in short Well, I will NOT underestimate the work to accomplish by saying 'easy, just change that and code this... Piece of cake !'. The basic concept is : rely on the current implementation of involved raid levels to do the job. That is, using request_queue and bio, read the window and write it back. That also implies we have two concurrent raid level drivers on top of the same drives. During the migration process, the block device at md level will be splitted in two parts : the one that has migrated (and is written on disk using the target layout), and the one that is to be migrated (which is stored on disk using the previous layout). This border evolves during the process, starting at the beginning of the md, finishing at the end of the previous layout. (In the case of a downsize migration, a resync of the remaining part must be performed after the migration of previous content.) Let's call current_sector the current start sector of the window to be copied. current_sector starts at 0, and ends at the previous layout size. It can only grow, because we only consider stable and downsize migration. Let's assume the mddev->queue is empty and clean at the beginning of the migration. Let's assume that each level-specific cache is cleared. Let's assume we are able to strictly forbid the underlying level drivers to read or write stuff outside of their own border. That is, the previous driver will not read nor write on sectors after current_sector (and the corresponding sectors on disk), and the target driver will not read nor write before. That also imply inhibiting level-specific resync threads, ... So, an idea of the implementation could be a wrapper around the make_request_fn set for the queue of the mddev, plus a migration thread. The make_request_fn of the migration will be responsible for choosing the relevant level driver for the (userland) requests it recieves. If the request is before current_sector, call the target level driver. If the request is after current_sector, call the previous one. If the request is inside the window, delay it. Of course, overlapping issues will appear here. The migration thread will be responsible for fetching the window content, increment current_sector, write window with the new layout. Locking and unlocking as necessary. 2.5. Journaling and failure recovery Consider these points : - before migration starts, we recorder in the superblocks all necessary setup information regarding the targetted raid level and layout. This includes disk role in the target layout, window size, ... - during migration, we regularly write the current status (the current_sector) of the migration in (some ?) superblocks. This way, we could handle crash recovery of migrations, at least for dowsize migrations. Because stable migrations are a bit more complex to handle, because we write stuff where we have read it. This means we don't know after a crash if : A: window was being read B: window was being written C: window was completely written. Status A and C are safe, at least if we can determine that we started writing the window or not. Status B is unsafe : we started writing data but did not finish, so sectors are in a mix of old and new layout. Several options could turn here : - use another device/journal (nvram ?) to backup the window before writing - write all data at a certain offset to forbid a window to be rewritten where it was read. This shall imply a change in all underlying drivers... 3. Open Issues As we rely on the level-specific code for reading/writing, we benefit from its own redundancy implementation. So we could migrate a degraded array.