On Wed, Aug 05, 2015 at 01:43:30PM +1000, NeilBrown wrote: > On Wed, 29 Jul 2015 17:38:44 -0700 Shaohua Li <shli@xxxxxx> wrote: > > > This is the reclaim support for raid5 log. A stripe write will have > > following steps: > > > > 1. reconstruct the stripe, read data/calculate parity. ops_run_io > > prepares to write data/parity to raid disks > > 2. hijack ops_run_io. stripe data/parity is appending to log disk > > 3. flush log disk cache > > 4. ops_run_io run again and do normal operation. stripe data/parity is > > written in raid array disks. raid core can return io to upper layer. > > 5. flush cache of all raid array disks > > 6. update super block > > 7. log disk space used by the stripe can be reused > > > > In practice, several stripes consist of an io_unit and we will batch > > several io_unit in different steps, but the whole process doesn't > > change. > > > > It's possible io return just after data/parity hit log disk, but then > > read IO will need read from log disk. For simplicity, IO return happens > > at step 4, where read IO can directly read from raid disks. > > > > Currently reclaim run every minute or out of space. Reclaim is just to > > free log disk spaces, it doesn't impact data consistency. > > Having arbitrary times lines "every minute" is a warning sign. > "As soon as possible" and "Just it time" can both make sense easily. > "every minute" needs more justification. > > I'll probably say more when I find the code. The idea is if we reclaim periodically, recovery could scan less log space. It's insane recovery scans a 1T disk. As I said this is just to free disk spaces. It's not a signal we will lose data in minute interval. I can change the relaim to run every 1G reclaimable space for example. > > + r5l_move_io_unit_list(&log->running_ios, &log->io_end_ios, > > + IO_UNIT_IO_END); > > + r5l_move_io_unit_list(&log->io_end_ios, &log->stripe_end_ios, > > + IO_UNIT_STRIPE_END); > > + r5l_compress_stripe_end_list(log); > > + run_stripe = !list_empty(&log->io_end_ios); > > + spin_unlock(&log->io_list_lock); > > + > > + if (!run_stripe) > > + return; > > + > > + blkdev_issue_flush(r5l_bdev(log), GFP_NOIO, NULL); > > + > > + spin_lock(&log->io_list_lock); > > + list_for_each_entry(io, &log->io_end_ios, log_sibling) { > > + if (io->state >= IO_UNIT_STRIPE_START) > > + continue; > > + r5l_set_io_unit_state(io, IO_UNIT_STRIPE_START); > > + > > + while (!list_empty(&io->stripe_list)) { > > + sh = list_first_entry(&io->stripe_list, > > + struct stripe_head, log_list); > > + list_del_init(&sh->log_list); > > + set_bit(STRIPE_HANDLE, &sh->state); > > + release_stripe(sh); > > This code makes me a bit nervous. handle_stripe() can potentially be > called on any stripe at any time. > Here you are scheduling a call the handle_stripe() without obviously > changing the state of the stripe. So whatever is going to happen now > could potentially have happened before... is that safe? I'm not fully sure, but it's ok in my test. > > + * move proper io_unit to reclaim list. We should not change the order. > > + * reclaimable/unreclaimable io_unit can be mixed in the list, we > > + * shouldn't reuse space of an unreclaimable io_unit > > + * */ > > + while (1) { > > + r5l_move_io_unit_list(&log->running_ios, &log->io_end_ios, > > + IO_UNIT_IO_END); > > + r5l_move_io_unit_list(&log->io_end_ios, &log->stripe_end_ios, > > + IO_UNIT_STRIPE_END); > > + while (!list_empty(&log->stripe_end_ios)) { > > + io = list_first_entry(&log->stripe_end_ios, > > + struct r5l_io_unit, log_sibling); > > + list_move_tail(&io->log_sibling, &list); > > + free += (io->log_end - io->log_start + > > + log->total_blocks) % log->total_blocks; > > + } > > + > > + if (free >= reclaim_target || (list_empty(&log->running_ios) && > > + list_empty(&log->io_end_ios) && > > + list_empty(&log->stripe_end_ios))) > > + break; > > + > > + if (!list_empty(&log->io_end_ios)) { > > + io = list_first_entry(&log->io_end_ios, > > + struct r5l_io_unit, log_sibling); > > + spin_unlock(&log->io_list_lock); > > + /* nobody else can delete the io, we are safe */ > > + r5l_kick_io_unit(log, io); > > + spin_lock(&log->io_list_lock); > > + continue; > > + } > > + > > + if (!list_empty(&log->running_ios)) { > > + io = list_first_entry(&log->running_ios, > > + struct r5l_io_unit, log_sibling); > > + spin_unlock(&log->io_list_lock); > > + /* nobody else can delete the io, we are safe */ > > + r5l_kick_io_unit(log, io); > > + spin_lock(&log->io_list_lock); > > + continue; > > + } > > + } > > + spin_unlock(&log->io_list_lock); > > Well, here we are with the important parts of the reclaim code... > > The main result of the above section is to possibly call > r5l_flush_stripe_to_raid() a few times, and to wait until 'list' > contains enough io_units to satisfy the requirement. > > As raid5d already calls r5l_flush_stripe_to_raid - which it really must > to make sure that writes complete quickly - this really comes down to > some book keeping and some waiting. > Book keeping can be done as changes happen, and waiting is best not > done at all. > > To be more specific: when an io_unit transitions to IO_UNIT_STRIPE_END > it can immediately be removed from the list and if it was the first > io_unit on the list, then the log_start can immediately be updated. Ok, the original idea is to avoid holding the io_list_lock for every IO end/stripe end. Maybe over-designed, I'll fix this. > > + > > + if (list_empty(&list)) > > + return; > > + > > + r5l_flush_all_disks(log); > > + > > + /* super always point to last valid meta */ > > + last = list_last_entry(&list, struct r5l_io_unit, log_sibling); > > + r5l_write_super(log, r5l_block_to_sector(log, last->log_start)); > > This bit flushes all the disks and then updates the metadata and writes > it. > As md_super_write already uses WRITE_FLUSH_FUA I don't think the extra > flush is needed. great. > I really think you should just update ->recovery_offset and set > MD_CHANGE_DEVS (or similar) and let the update happen. I think we should write super here. The reclaimed space might be reused immediately, we don't want to confuse recovery. > > + > > + mutex_lock(&log->io_mutex); > > + log->last_checkpoint = last->log_start; > > + log->last_cp_seq = last->seq; > > + mutex_unlock(&log->io_mutex); > > + wake_up(&log->space_waitq); > > + > > + while (!list_empty(&list)) { > > + io = list_first_entry(&list, struct r5l_io_unit, log_sibling); > > + list_del(&io->log_sibling); > > + r5l_free_io_unit(log, io); > > + } > > +} > > So I really think all of this can be done as-it-happens (the > book-keeping) or asynchronously. There is no need to push something > every minute. Because we need flush raid disks cache, to avoid the overhead, we do batch operation. Once I changed the reclaim to run every specific reclaimable space, this should be ok. Thanks, Shaohua -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html