On 2016/11/24 下午1:45, NeilBrown wrote: > On Wed, Nov 23 2016, Shaohua Li wrote: >> >> Thanks! The idea is awesome and does makes the code easier to >> understand. > > This is a key point - simpler code!! > >> >> For hash conflict, I'm worrying about one case. resync does >> sequential access. Say we have a sequential normal IO. If the >> first normal IO and resync IO have conflict, it's possible they >> will have conflict subsequently, since both are doing sequential >> access. We can change the hash to something like this: >> >> for the first 64M * 512, the hash is 0, 1, ... 511 For the second >> 64M * 512, the hash is 1, 3, 5 ... The third 64M * 512, the hash >> is 2, 5, 8 ... > > This would happen when the write and the resync are at different > addresses which happen to collide in the hash. They would only > remain in sync if the two proceeded at the same speed. Normally > resync is throttled when there is competing IO, so that is fairly > unlikely. > > If this really was important, it might make sense to just use > hash_long() to make the relevant bits of the sector number into > and index. (i.e. keep the code simple) > >>> >>> If user has a (realy) large raid1 device, for example 10PB >>> size, we may just increase the buckets number >>> BARRIER_BUCKETS_NR. Now this is a macro, it is possible to be a >>> raid1-created-time-defined variable in future. > > I don't think the size of the array is very relevant. No matter > how big the array is, resync can be expected to directly interfere > with 0.2% of all writes. So, assuming a fairly random distribution > over times, most writes will most often not be blocked by resync at > all. > > I wouldn't give any thought to any problems like this until they > are actually measured. > Yes, I agree with you. If we do have to use more buckets in future, current patch makes future modification easier. >>> static void reschedule_retry(struct r1bio *r1_bio) @@ -215,10 >>> +214,15 @@ static void reschedule_retry(struct r1bi unsigned >>> long flags; struct mddev *mddev = r1_bio->mddev; struct r1conf >>> *conf = mddev->private; + sector_t sector_nr; + long idx; > > I wonder about the use of "long" for "idx". As that is an array > index, it would have to be a REALLY big array before "long" is > better than "int". > Good suggestion! I will modify it to int in next version. >>> @@ -255,19 +257,14 @@ static void call_bio_endio(struct r1bio >>> if (!test_bit(R1BIO_Uptodate, &r1_bio->state)) bio->bi_error = >>> -EIO; >>> >>> - if (done) { + if (done) bio_endio(bio); - /* - * Wake up >>> any possible resync thread that waits for the device - * to >>> go idle. - */ - allow_barrier(conf, start_next_window, >>> bi_sector); - } } >>> >>> static void raid_end_bio_io(struct r1bio *r1_bio) { struct bio >>> *bio = r1_bio->master_bio; + struct r1conf *conf = >>> r1_bio->mddev->private; >>> >>> /* if nobody has done the final endio yet, do it now */ if >>> (!test_and_set_bit(R1BIO_Returned, &r1_bio->state)) { @@ -278,6 >>> +275,12 @@ static void raid_end_bio_io(struct r1bio >>> >>> call_bio_endio(r1_bio); } + + /* + * Wake up any possible >>> resync thread that waits for the device + * to go idle. + */ >>> + allow_barrier(conf, r1_bio->sector); >> Why this change? > > I wondered too. I think it may be correct, but it should be in a > separate patch. When you have a write-mostly device, I think the > current code will allow_barrier() before the writes to the > write-mostly devices have completed. > I explain this in the reply to Shaohua's email. Current upstream code increases/decreases conf->nr_pending for each bio (a.k.a r1_bio->master_bio) received by raid1_make_request(). In this patch, conf->nr_pending[idx] increases/decreases for each r1_bio, this is the reason why allow_barrier() moves to raid1_end_bio_io(), you may also find out wait_barrier() also changes its location and is replaced by wait_barrier() and wait_read_barrier() in raid1_make_request(). >>> >>> +static sector_t align_to_barrier_unit_end(sector_t >>> start_sector, + sector_t sectors) +{ + sector_t len; + + >>> WARN_ON(sectors == 0); + /* len is the number of sectors from >>> start_sector to end of the + * barrier unit which start_sector >>> belongs to. + */ + len = ((start_sector + sectors + >>> (1<<BARRIER_UNIT_SECTOR_BITS) - 1) & + >>> (~(BARRIER_UNIT_SECTOR_SIZE - 1))) - + start_sector; + + >>> if (len > sectors) + len = sectors; + + return len; +} >> >> I'd prefer calling bio_split at the early of raid1_make_request >> and split the bio if it crosses the bucket boundary. please see >> raid10_make_request for example. resync will not cross the >> boundary. So the code will not need to worry about the boundary. >> I believe this will make the code simpiler (all the >> align_to_barrier_unit_end calling can removed) and easy to >> understand. > > align_to_barrier_unit_end() is only called twice, once for writes > and once for resync/recovery. If bio_split() were used, you would > still need the same number of calls. > > Changing raid1.c to use bio_split() more may well be a valuable > simplification, but I think it is a separate issue. > To use bio_split() won't be simpler, indeed it is a little bit more complexed, because when r1_bio->master_bio will be the split bio of master bio, not the original master bio, that's why I decide to not use it. >>> wait_event_lock_irq(conf->wait_barrier, - >>> !conf->array_frozen && - conf->barrier < RESYNC_DEPTH && >>> - conf->current_window_requests == 0 && - >>> (conf->start_next_window >= - conf->next_resync + >>> RESYNC_SECTORS), + !conf->array_frozen && >>> !conf->nr_pending[idx] && + conf->barrier[idx] < >>> RESYNC_DEPTH, conf->resync_lock); >> >> So there is a slight behavior change. Originally we guarautee no >> more than RESYNC_DEPTH sync. Now this only checks >> 'conf->barrier[idx] < RESYNC_DEPTH'. How about barrier[idx-1], >> barrier[idx-2]...? It's possible conf->barrier[idx] < >> RESYNC_DEPTH, but barrier[idx] + barrier[idx-1] > RESYNC_DEPTH. >> Not sure how important this is, but at least we can check >> barrier[idx] + barrier[idx-1]. > > I confess that I'm not entirely sure the point of RESYNC_DEPTH - it > was already in the code when I started working on it. My best guess > is that it limits the number of requests we need to wait for before > regular writes can be permitted in the resync region. In that case, > the current code is good enough. > > Your "barrier[idx] + barrier[idx-1]" idea is interesting, but would > only make a difference 1/1024 of the time (bucket is 64Meg, the > RESYNC_BLOCK_SIZE is 64K). > > IMHO RESYNC_DEPTH is for better resync throughput, since now we have linear resync, and only one resync window, I doubt whether it takes effect or not. Now only one resync sliding window, therefore it is very similar for testing the resync depth within the sliding window, or within a single barrier bucket. A resync I/O won't hit more then one barrier bucket now. In future when parallel resync implemented, the modification here may increase whole raid1 device resync depth to BARRIER_BUCKETS_NR * RESYNC_DEPTH, which results better resync I/O when regular I/O is idle. >>> + +static void wait_all_barriers(struct r1conf *conf) +{ + long >>> idx; + + for (idx = 0; idx < BARRIER_BUCKETS_NR; idx++) + >>> _wait_barrier(conf, idx); >> >> This is racy. Since resync is sequential, idealy we only need to >> check the bucket sequentially. The compiler could do >> _wait_barrier(conf, 1) and then _wait_barrier(conf, 0). It's >> possible the bucket 1 has barrier right after the check of bucket >> 0. Even the compiler doesn't reorder, there is case the bucket >> 511 should be check before bucket 0 if the resync is just >> crossing 512 buckets. It would be better we check the resync >> position and adjust the bucket wait order according to the >> position. > > I cannot see how it is racy. No new sync requests will be issued > at this time, so once ->barrier[idx] reaches 0, it will stay at > zero. > Yes, this is what I mean. >>> @@ -2786,6 +2824,22 @@ static struct r1conf *setup_conf(struct >>> if (!conf) goto abort; >>> >>> + conf->nr_pending = kzalloc(PAGE_SIZE, GFP_KERNEL); + if >>> (!conf->nr_pending) + goto abort; >> >> This allocation is a little wierd. I'd define a count and uses >> sizeof(nr_pending) * count to do the allocation. There is nothing >> related to PAGE_SIZE. Alternatively, just make nr_pending an >> array in r1conf. > > The thing about PAGE_SIZE is that the allocation will succeed and > won't waste space. If you make all the arrays simple members of > r1conf, r1conf will be about 4pages larger, so will require an > 8-page allocation, which is more likely to fail. I agree that it > seems a little weird, but I think it results in the best use of > memory. Yes, this is what I mean. Thanks for your comments! Coly -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html