On Fri, 26 Nov 2010 09:03:51 +0100 Adam Kwolek <adam.kwolek@xxxxxxxxx> wrote: > This patch series (combines 3 previous series in to one) for mdadm and introduces features: > - Freeze array/container and new reshape vectors: patches 0001 to 0015 > mdadm devel 3.2 contains patches 0001 to 0013 already, patches 0014 and 0016 fixes 2 problems in this functionality > - Takeover: patches 0016 to 0017 > - Online Capacity Expansion (OLCE): patches 0018 to 0036 > - Checkpointing: patches 0037 to 0045 > - Migrations: patches 0045 to 0053 > 1. raid0 to raid5 : patch 0051 > 2. raid5 to raid0 : patch 0052 > 3. chunk size migration) : patch 0053 > > Patches are for mdadm 3.1.4 and Neil's feedback for 6 first OLCE patches is included. > There should be no patch corruption problem now, as it is sent directly from stgit (not outlook). > > For checkpointing md patch "md: raid5: update suspend_hi during reshape" is required also (sent before). I think I've decided that I don't want to apply this patch to raid5. I discussed this with Dan Williams at the plumbers conference and he took notes, so hopefully he can correct anything in the following. I think it was me that suggested this patch in the first place, so it probably seemed like a good idea at the time. But I no longer think so. This is how I think it should work - which should probably go in external-reshape-design.txt. An important principle is that everything works much like it currently does for the native metadata case except that some of the work normally performed by the kernel is now performed by mdmon. So the only changes to mdadm need to work with external metadata in general involve communicating directly with mdmon when it would normally only communicate with the kernel. (of course where will be other changes required to mdadm to deal with the specifics of reshaping imsm and general container-based metadata). Also, the atomicity provided by the kernel may not be implicitly available to the kernel+mdmon pairing, so mdadm may get involved in negotiating the required atomicity. Just to be explicit, we are talking here about a 'reshape' which requires restriping the array, moving data around and taking a long time. Reshapes which are atomic or just require a resync are much simpler than this. 1/ mdadm freezes the array so the no recovery or reshape can start. 2/ mdadm sets sync_max to 0 so even when the array is unfrozen, no data will be relocated. It also sets suspend_lo and suspend_hi to zero. 3/ mdadm tells the kernel about the requested reshape, setting some or all of chunk_size, layout, level, raid_disks (and later, data_offset for each device). 4/ mdadm checks that mdmon has noticed the changes and has updates the metadata to show a reshape-in-progress (ping_monitor). 5/ mdadm unfreezes the array for mdmon (change the '-' in metadata_version back to '/') and calls ping_monitor 6/ mdmon assigns spares as appropriate and tells the kernel which slot to use for each. This requires a kernel change. The slot number will be stored in saved_raid_disk. ping_monitor doesn't complete until the spares have been assigned. 7/ mdadm asked the kernel to start reshape (echo reshape > sync_action). This causes md_check_recovery to all remove_and_add_spares which will add the chosen spares to the required slots and will create the reshape thread. That thread will not actually do anything yet as sync_max is still 0. 8/ Now we loop, performing backups, reshaping data, and updating the metadata. It proceeds in a 'double-buffered' process where we are backing up one section while the previous section is being reshaped. 8a/ mdadm sets suspend_hi to a larger number. This blocks until intervening IO is flushed. 8b/ mdadm makes a backup copy of the data up to the new suspend_hi 8c/ mdadm updates sync_max to match suspend_hi. 8d/ kernel starts reshaping data and periodically signals progress through sync_completed 8e/ mdmon notices sync_completed changing and updates the metadata to record how far the reshape has progressed. 8f/ mdadm notices sync_completed changing and when it passes the end of the oldest of the two sections being worked on it uses ping_monitor to ensure the metadata is up-to-date and then moves suspend_lo to the beginning of the next section, and then goes back to 8a. 9/ When sync_completed reaches the end of the array, mdmon will notice and update the metadata to show that the reshape has finished, and mdadm will set both suspend_lo and suspend_hi to beyond the end of the array, and all is done. In the case where the number of data devices is changing there are large periods of time when no backup of data is needed. In this case mdmon still needs to update the metadata from time to time, and the kernel needs to be made to wait for that update. This is done with sync_max. So in those cases the primary sets in the above become just 8c, 8d, 8e, 8f, and suspend_lo,suspend_hi aren't changed. It is tempting to have mdmon update sync_max, as then mdadm would not be needed at all when no backup is happening. I think that is the path of reasoning I followed previously which lead to having the kernel update suspend_hi. But I don't think that is a good design now. Sometimes it really has to be mdadm updating sync_max, so it should always been mdadm updating sync_max. It should be a reasonably simple change to your code to follow this pattern. If the only problem that I find in any of your patches is that they don't quite follow this pattern properly I will happily fix them to follow the pattern and apply them with the fix. > New vectors (introduced by Dan Williams) reshape_super() and manage_reshape() are used in whole process. > > In the next step, I'll rebase it to mdadm devel 3.2, meanwhile Krzysztof Wojcik will prepare additional fixes for raid10<->raid0 takeover > > I think that few patches can be taken in to devel 3.2 at this monent i.e.: > 0014-FIX-Cannot-exit-monitor-after-takeover.patch > 0015-FIX-Unfreeze-not-only-container-for-external-metada.patch > 0016-Add-takeover-support-for-external-meta.patch > 0018-Treat-feature-as-experimental.patch > 0033-Prepare-and-free-fdlist-in-functions.patch > 0034-Compute-backup-blocks-in-function.patch I would really rather take as much as is ready. The fewer times I have to review a patch, the better. So if a patch looks close enough that I can apply it as-is, or with just a few fixes, then I will. That way you only need to resent the patches that need serious work. NeilBrown -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html