On Wed, 5 Aug 2015 14:34:21 -0700 Shaohua Li <shli@xxxxxx> wrote: > On Wed, Aug 05, 2015 at 01:43:30PM +1000, NeilBrown wrote: > > On Wed, 29 Jul 2015 17:38:44 -0700 Shaohua Li <shli@xxxxxx> wrote: > > > > > This is the reclaim support for raid5 log. A stripe write will have > > > following steps: > > > > > > 1. reconstruct the stripe, read data/calculate parity. ops_run_io > > > prepares to write data/parity to raid disks > > > 2. hijack ops_run_io. stripe data/parity is appending to log disk > > > 3. flush log disk cache > > > 4. ops_run_io run again and do normal operation. stripe data/parity is > > > written in raid array disks. raid core can return io to upper layer. > > > 5. flush cache of all raid array disks > > > 6. update super block > > > 7. log disk space used by the stripe can be reused > > > > > > In practice, several stripes consist of an io_unit and we will batch > > > several io_unit in different steps, but the whole process doesn't > > > change. > > > > > > It's possible io return just after data/parity hit log disk, but then > > > read IO will need read from log disk. For simplicity, IO return happens > > > at step 4, where read IO can directly read from raid disks. > > > > > > Currently reclaim run every minute or out of space. Reclaim is just to > > > free log disk spaces, it doesn't impact data consistency. > > > > Having arbitrary times lines "every minute" is a warning sign. > > "As soon as possible" and "Just it time" can both make sense easily. > > "every minute" needs more justification. > > > > I'll probably say more when I find the code. > > The idea is if we reclaim periodically, recovery could scan less log > space. It's insane recovery scans a 1T disk. As I said this is just to > free disk spaces. It's not a signal we will lose data in minute > interval. I can change the relaim to run every 1G reclaimable space for > example. There seem to be two issues here and I might be confusing them. Firstly there is the question of when a stripe gets written back to the array. Once the data is safe in the log this doesn't have to happen in any great hurry, but I suspect it should still happen sooner rather than later. Presumably as soon as data/parity of a stripe is safe in the log, that stripe will be scheduled to be written to the array - is that correct? As these writes-to-the-array complete the counter in the io_unit will decrease. when it reaches zero the io_unit can be freed and the recovery_offset in the superblock can, potentially be updated. Secondly there is the question of how often the superblock is updated. As you say; delaying the updates indefinitely could lead to a recovery having to examine a very large part of the log - maybe more than necessary (though if that might be a problem, the simple solution is to use a smaller log). I would probably feel most comfortable scheduling a superblock update whenever the amount of log space that it would reclaim exceeds 1/4 of the log size. That should be often enough without imposing a completely arbitrary number. Make sense? thanks, NeilBrown -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html