On 2017/2/16 上午10:22, Shaohua Li wrote: > On Thu, Feb 16, 2017 at 12:35:22AM +0800, colyli@xxxxxxx wrote: >> 'Commit 79ef3a8aa1cb ("raid1: Rewrite the implementation of iobarrier.")' >> introduces a sliding resync window for raid1 I/O barrier, this idea limits >> I/O barriers to happen only inside a slidingresync window, for regular >> I/Os out of this resync window they don't need to wait for barrier any >> more. On large raid1 device, it helps a lot to improve parallel writing >> I/O throughput when there are background resync I/Os performing at >> same time. >> >> The idea of sliding resync widow is awesome, but code complexity is a >> challenge. Sliding resync window requires several veriables to work >> collectively, this is complexed and very hard to make it work correctly. >> Just grep "Fixes: 79ef3a8aa1" in kernel git log, there are 8 more patches >> to fix the original resync window patch. This is not the end, any further >> related modification may easily introduce more regreassion. >> >> Therefore I decide to implement a much simpler raid1 I/O barrier, by >> removing resync window code, I believe life will be much easier. >> >> The brief idea of the simpler barrier is, >> - Do not maintain a logbal unique resync window >> - Use multiple hash buckets to reduce I/O barrier conflictions, regular >> I/O only has to wait for a resync I/O when both them have same barrier >> bucket index, vice versa. >> - I/O barrier can be recuded to an acceptable number if there are enought >> barrier buckets >> >> Here I explain how the barrier buckets are designed, >> - BARRIER_UNIT_SECTOR_SIZE >> The whole LBA address space of a raid1 device is divided into multiple >> barrier units, by the size of BARRIER_UNIT_SECTOR_SIZE. >> Bio request won't go across border of barrier unit size, that means >> maximum bio size is BARRIER_UNIT_SECTOR_SIZE<<9 (64MB) in bytes. >> For random I/O 64MB is large enough for both read and write requests, >> for sequential I/O considering underlying block layer may merge them >> into larger requests, 64MB is still good enough. >> Neil also points out that for resync operation, "we want the resync to >> move from region to region fairly quickly so that the slowness caused >> by having to synchronize with the resync is averaged out over a fairly >> small time frame". For full speed resync, 64MB should take less then 1 >> second. When resync is competing with other I/O, it could take up a few >> minutes. Therefore 64MB size is fairly good range for resync. >> >> - BARRIER_BUCKETS_NR >> There are BARRIER_BUCKETS_NR buckets in total, which is defined by, >> #define BARRIER_BUCKETS_NR_BITS (PAGE_SHIFT - 2) >> #define BARRIER_BUCKETS_NR (1<<BARRIER_BUCKETS_NR_BITS) >> this patch makes the bellowed members of struct r1conf from integer >> to array of integers, >> - int nr_pending; >> - int nr_waiting; >> - int nr_queued; >> - int barrier; >> + int *nr_pending; >> + int *nr_waiting; >> + int *nr_queued; >> + int *barrier; >> number of the array elements is defined as BARRIER_BUCKETS_NR. For 4KB >> kernel space page size, (PAGE_SHIFT - 2) indecates there are 1024 I/O >> barrier buckets, and each array of integers occupies single memory page. >> 1024 means for a request which is smaller than the I/O barrier unit size >> has ~0.1% chance to wait for resync to pause, which is quite a small >> enough fraction. Also requesting single memory page is more friendly to >> kernel page allocator than larger memory size. >> >> - I/O barrier bucket is indexed by bio start sector >> If multiple I/O requests hit different I/O barrier units, they only need >> to compete I/O barrier with other I/Os which hit the same I/O barrier >> bucket index with each other. The index of a barrier bucket which a >> bio should look for is calculated by sector_to_idx() which is defined >> in raid1.h as an inline function, >> static inline int sector_to_idx(sector_t sector) >> { >> return hash_long(sector >> BARRIER_UNIT_SECTOR_BITS, >> BARRIER_BUCKETS_NR_BITS); >> } >> Here sector_nr is the start sector number of a bio. >> >> - Single bio won't go across boundary of a I/O barrier unit >> If a request goes across boundary of barrier unit, it will be split. A >> bio may be split in raid1_make_request() or raid1_sync_request(), if >> sectors returned by align_to_barrier_unit_end() is small than original >> bio size. >> >> Comparing to single sliding resync window, >> - Currently resync I/O grows linearly, therefore regular and resync I/O >> will have confliction within a single barrier units. So the I/O >> behavior is similar to single sliding resync window. >> - But a barrier unit bucket is shared by all barrier units with identical >> barrier uinit index, the probability of confliction might be higher >> than single sliding resync window, in condition that writing I/Os >> always hit barrier units which have identical barrier bucket indexs with >> the resync I/Os. This is a very rare condition in real I/O work loads, >> I cannot imagine how it could happen in practice. >> - Therefore we can achieve a good enough low confliction rate with much >> simpler barrier algorithm and implementation. >> >> There are two changes should be noticed, >> - In raid1d(), I change the code to decrease conf->nr_pending[idx] into >> single loop, it looks like this, >> spin_lock_irqsave(&conf->device_lock, flags); >> conf->nr_queued[idx]--; >> spin_unlock_irqrestore(&conf->device_lock, flags); >> This change generates more spin lock operations, but in next patch of >> this patch set, it will be replaced by a single line code, >> atomic_dec(&conf->nr_queueud[idx]); >> So we don't need to worry about spin lock cost here. >> - Mainline raid1 code split original raid1_make_request() into >> raid1_read_request() and raid1_write_request(). If the original bio >> goes across an I/O barrier unit size, this bio will be split before >> calling raid1_read_request() or raid1_write_request(), this change >> the code logic more simple and clear. >> - In this patch wait_barrier() is moved from raid1_make_request() to >> raid1_write_request(). In raid_read_request(), original wait_barrier() >> is replaced by raid1_read_request(). >> The differnece is wait_read_barrier() only waits if array is frozen, >> using different barrier function in different code path makes the code >> more clean and easy to read. >> Changelog >> V3: >> - Rebase the patch against latest upstream kernel code. >> - Many fixes by review comments from Neil, >> - Back to use pointers to replace arraries in struct r1conf >> - Remove total_barriers from struct r1conf >> - Add more patch comments to explain how/why the values of >> BARRIER_UNIT_SECTOR_SIZE and BARRIER_BUCKETS_NR are decided. >> - Use get_unqueued_pending() to replace get_all_pendings() and >> get_all_queued() >> - Increase bucket number from 512 to 1024 >> - Change code comments format by review from Shaohua. >> V2: >> - Use bio_split() to split the orignal bio if it goes across barrier unit >> bounday, to make the code more simple, by suggestion from Shaohua and >> Neil. >> - Use hash_long() to replace original linear hash, to avoid a possible >> confilict between resync I/O and sequential write I/O, by suggestion from >> Shaohua. >> - Add conf->total_barriers to record barrier depth, which is used to >> control number of parallel sync I/O barriers, by suggestion from Shaohua. >> - In V1 patch the bellowed barrier buckets related members in r1conf are >> allocated in memory page. To make the code more simple, V2 patch moves >> the memory space into struct r1conf, like this, >> - int nr_pending; >> - int nr_waiting; >> - int nr_queued; >> - int barrier; >> + int nr_pending[BARRIER_BUCKETS_NR]; >> + int nr_waiting[BARRIER_BUCKETS_NR]; >> + int nr_queued[BARRIER_BUCKETS_NR]; >> + int barrier[BARRIER_BUCKETS_NR]; >> This change is by the suggestion from Shaohua. >> - Remove some inrelavent code comments, by suggestion from Guoqing. >> - Add a missing wait_barrier() before jumping to retry_write, in >> raid1_make_write_request(). >> V1: >> - Original RFC patch for comments > > Looks good, two minor issues. > >> >> -static void raid1_read_request(struct mddev *mddev, struct bio *bio, >> - struct r1bio *r1_bio) >> +static void raid1_read_request(struct mddev *mddev, struct bio *bio) >> { >> struct r1conf *conf = mddev->private; >> struct raid1_info *mirror; >> + struct r1bio *r1_bio; >> struct bio *read_bio; >> struct bitmap *bitmap = mddev->bitmap; >> const int op = bio_op(bio); >> @@ -1083,7 +1101,34 @@ static void raid1_read_request(struct mddev *mddev, struct bio *bio, >> int max_sectors; >> int rdisk; >> >> - wait_barrier(conf, bio); >> + /* >> + * Still need barrier for READ in case that whole >> + * array is frozen. >> + */ >> + wait_read_barrier(conf, bio->bi_iter.bi_sector); >> + bitmap = mddev->bitmap; >> + >> + /* >> + * make_request() can abort the operation when read-ahead is being >> + * used and no empty request is available. >> + * >> + */ >> + r1_bio = mempool_alloc(conf->r1bio_pool, GFP_NOIO); >> + r1_bio->master_bio = bio; >> + r1_bio->sectors = bio_sectors(bio); >> + r1_bio->state = 0; >> + r1_bio->mddev = mddev; >> + r1_bio->sector = bio->bi_iter.bi_sector; > > This part looks unnecessary complicated. If you change raid1_make_request to > something like __raid1_make_reques, add a new raid1_make_request and do bio > split there, then call __raid1_make_request for each splitted bio, then you > don't need to duplicate the r1_bio allocation parts for read/write. > Aha, good point! I will do that. >> diff --git a/drivers/md/raid1.h b/drivers/md/raid1.h >> index c52ef42..d3faf30 100644 >> --- a/drivers/md/raid1.h >> +++ b/drivers/md/raid1.h >> @@ -1,6 +1,14 @@ >> #ifndef _RAID1_H >> #define _RAID1_H >> >> +/* each barrier unit size is 64MB fow now >> + * note: it must be larger than RESYNC_DEPTH >> + */ >> +#define BARRIER_UNIT_SECTOR_BITS 17 >> +#define BARRIER_UNIT_SECTOR_SIZE (1<<17) >> +#define BARRIER_BUCKETS_NR_BITS (PAGE_SHIFT - 2) > > maybe write this as (PAGE_SHIFT - ilog2(sizeof(int)))? To be honest, I don't > think it really matters if the array is PAGE_SIZE length, maybe just specify a > const here. Nice catch! It makes sense, I will do that, and add some comments to explain why it is sizeof(int). Thanks. Coly -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html