On Sat, Feb 18, 2017 at 03:05:56AM +0800, colyli@xxxxxxx wrote: > 'Commit 79ef3a8aa1cb ("raid1: Rewrite the implementation of iobarrier.")' > introduces a sliding resync window for raid1 I/O barrier, this idea limits > I/O barriers to happen only inside a slidingresync window, for regular > I/Os out of this resync window they don't need to wait for barrier any > more. On large raid1 device, it helps a lot to improve parallel writing > I/O throughput when there are background resync I/Os performing at > same time. > > The idea of sliding resync widow is awesome, but code complexity is a > challenge. Sliding resync window requires several variables to work > collectively, this is complexed and very hard to make it work correctly. > Just grep "Fixes: 79ef3a8aa1" in kernel git log, there are 8 more patches > to fix the original resync window patch. This is not the end, any further > related modification may easily introduce more regreassion. > > Therefore I decide to implement a much simpler raid1 I/O barrier, by > removing resync window code, I believe life will be much easier. > > The brief idea of the simpler barrier is, > - Do not maintain a global unique resync window > - Use multiple hash buckets to reduce I/O barrier conflicts, regular > I/O only has to wait for a resync I/O when both them have same barrier > bucket index, vice versa. > - I/O barrier can be reduced to an acceptable number if there are enough > barrier buckets > > Here I explain how the barrier buckets are designed, > - BARRIER_UNIT_SECTOR_SIZE > The whole LBA address space of a raid1 device is divided into multiple > barrier units, by the size of BARRIER_UNIT_SECTOR_SIZE. > Bio requests won't go across border of barrier unit size, that means > maximum bio size is BARRIER_UNIT_SECTOR_SIZE<<9 (64MB) in bytes. > For random I/O 64MB is large enough for both read and write requests, > for sequential I/O considering underlying block layer may merge them > into larger requests, 64MB is still good enough. > Neil also points out that for resync operation, "we want the resync to > move from region to region fairly quickly so that the slowness caused > by having to synchronize with the resync is averaged out over a fairly > small time frame". For full speed resync, 64MB should take less then 1 > second. When resync is competing with other I/O, it could take up a few > minutes. Therefore 64MB size is fairly good range for resync. > > - BARRIER_BUCKETS_NR > There are BARRIER_BUCKETS_NR buckets in total, which is defined by, > #define BARRIER_BUCKETS_NR_BITS (PAGE_SHIFT - 2) > #define BARRIER_BUCKETS_NR (1<<BARRIER_BUCKETS_NR_BITS) > this patch makes the bellowed members of struct r1conf from integer > to array of integers, > - int nr_pending; > - int nr_waiting; > - int nr_queued; > - int barrier; > + int *nr_pending; > + int *nr_waiting; > + int *nr_queued; > + int *barrier; > number of the array elements is defined as BARRIER_BUCKETS_NR. For 4KB > kernel space page size, (PAGE_SHIFT - 2) indecates there are 1024 I/O > barrier buckets, and each array of integers occupies single memory page. > 1024 means for a request which is smaller than the I/O barrier unit size > has ~0.1% chance to wait for resync to pause, which is quite a small > enough fraction. Also requesting single memory page is more friendly to > kernel page allocator than larger memory size. > > - I/O barrier bucket is indexed by bio start sector > If multiple I/O requests hit different I/O barrier units, they only need > to compete I/O barrier with other I/Os which hit the same I/O barrier > bucket index with each other. The index of a barrier bucket which a > bio should look for is calculated by sector_to_idx() which is defined > in raid1.h as an inline function, > static inline int sector_to_idx(sector_t sector) > { > return hash_long(sector >> BARRIER_UNIT_SECTOR_BITS, > BARRIER_BUCKETS_NR_BITS); > } > Here sector_nr is the start sector number of a bio. > > - Single bio won't go across boundary of a I/O barrier unit > If a request goes across boundary of barrier unit, it will be split. A > bio may be split in raid1_make_request() or raid1_sync_request(), if > sectors returned by align_to_barrier_unit_end() is smaller than > original bio size. > > Comparing to single sliding resync window, > - Currently resync I/O grows linearly, therefore regular and resync I/O > will conflict within a single barrier units. So the I/O behavior is > similar to single sliding resync window. > - But a barrier unit bucket is shared by all barrier units with identical > barrier uinit index, the probability of conflict might be higher > than single sliding resync window, in condition that writing I/Os > always hit barrier units which have identical barrier bucket indexs with > the resync I/Os. This is a very rare condition in real I/O work loads, > I cannot imagine how it could happen in practice. > - Therefore we can achieve a good enough low conflict rate with much > simpler barrier algorithm and implementation. > > There are two changes should be noticed, > - In raid1d(), I change the code to decrease conf->nr_pending[idx] into > single loop, it looks like this, > spin_lock_irqsave(&conf->device_lock, flags); > conf->nr_queued[idx]--; > spin_unlock_irqrestore(&conf->device_lock, flags); > This change generates more spin lock operations, but in next patch of > this patch set, it will be replaced by a single line code, > atomic_dec(&conf->nr_queueud[idx]); > So we don't need to worry about spin lock cost here. > - Mainline raid1 code split original raid1_make_request() into > raid1_read_request() and raid1_write_request(). If the original bio > goes across an I/O barrier unit size, this bio will be split before > calling raid1_read_request() or raid1_write_request(), this change > the code logic more simple and clear. > - In this patch wait_barrier() is moved from raid1_make_request() to > raid1_write_request(). In raid_read_request(), original wait_barrier() > is replaced by raid1_read_request(). > The differnece is wait_read_barrier() only waits if array is frozen, > using different barrier function in different code path makes the code > more clean and easy to read. > Changelog > V4: > - Add alloc_r1bio() to remove redundant r1bio memory allocation code. > - Fix many typos in patch comments. > - Use (PAGE_SHIFT - ilog2(sizeof(int))) to define BARRIER_BUCKETS_NR_BITS. > V3: > - Rebase the patch against latest upstream kernel code. > - Many fixes by review comments from Neil, > - Back to use pointers to replace arraries in struct r1conf > - Remove total_barriers from struct r1conf > - Add more patch comments to explain how/why the values of > BARRIER_UNIT_SECTOR_SIZE and BARRIER_BUCKETS_NR are decided. > - Use get_unqueued_pending() to replace get_all_pendings() and > get_all_queued() > - Increase bucket number from 512 to 1024 > - Change code comments format by review from Shaohua. > V2: > - Use bio_split() to split the orignal bio if it goes across barrier unit > bounday, to make the code more simple, by suggestion from Shaohua and > Neil. > - Use hash_long() to replace original linear hash, to avoid a possible > confilict between resync I/O and sequential write I/O, by suggestion from > Shaohua. > - Add conf->total_barriers to record barrier depth, which is used to > control number of parallel sync I/O barriers, by suggestion from Shaohua. > - In V1 patch the bellowed barrier buckets related members in r1conf are > allocated in memory page. To make the code more simple, V2 patch moves > the memory space into struct r1conf, like this, > - int nr_pending; > - int nr_waiting; > - int nr_queued; > - int barrier; > + int nr_pending[BARRIER_BUCKETS_NR]; > + int nr_waiting[BARRIER_BUCKETS_NR]; > + int nr_queued[BARRIER_BUCKETS_NR]; > + int barrier[BARRIER_BUCKETS_NR]; > This change is by the suggestion from Shaohua. > - Remove some inrelavent code comments, by suggestion from Guoqing. > - Add a missing wait_barrier() before jumping to retry_write, in > raid1_make_write_request(). > V1: > - Original RFC patch for comments Thanks, I applied the two patches. For the deadlock issue Neil pointed out, I don't want it to hold this patch, we can fix it later after we figured out how to fix. The r1bio allocation is guaranteed to success because it's mempool allocation, so I deleted the warn statement. Neil, Since you reviewed the patches a lot I added your 'reviewed-by'. Please let me know if you don't want to add it. Thanks, Shaohua -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html