Re: [v2 PATCH 1/2] RAID1: a new I/O barrier implementation to remove resync window

Shaohua Li <shli@xxxxxxxxxx> · Wed, 4 Jan 2017 11:35:00 -0800

On Tue, Dec 27, 2016 at 11:47:37PM +0800, Coly Li wrote:
> 'Commit 79ef3a8aa1cb ("raid1: Rewrite the implementation of iobarrier.")'
> introduces a sliding resync window for raid1 I/O barrier, this idea limits
> I/O barriers to happen only inside a slidingresync window, for regular
> I/Os out of this resync window they don't need to wait for barrier any
> more. On large raid1 device, it helps a lot to improve parallel writing
> I/O throughput when there are background resync I/Os performing at
> same time.
> 
> The idea of sliding resync widow is awesome, but there are several
> challenges are very difficult to solve,
>  - code complexity
>    Sliding resync window requires several veriables to work collectively,
>    this is complexed and very hard to make it work correctly. Just grep
>    "Fixes: 79ef3a8aa1" in kernel git log, there are 8 more patches to fix
>    the original resync window patch. This is not the end, any further
>    related modification may easily introduce more regreassion.
>  - multiple sliding resync windows
>    Currently raid1 code only has a single sliding resync window, we cannot
>    do parallel resync with current I/O barrier implementation.
>    Implementing multiple resync windows are much more complexed, and very
>    hard to make it correctly.
> 
> Therefore I decide to implement a much simpler raid1 I/O barrier, by
> removing resync window code, I believe life will be much easier.
> 
> The brief idea of the simpler barrier is,
>  - Do not maintain a logbal unique resync window
>  - Use multiple hash buckets to reduce I/O barrier conflictions, regular
>    I/O only has to wait for a resync I/O when both them have same barrier
>    bucket index, vice versa.
>  - I/O barrier can be recuded to an acceptable number if there are enought
>    barrier buckets
> 
> Here I explain how the barrier buckets are designed,
>  - BARRIER_UNIT_SECTOR_SIZE
>    The whole LBA address space of a raid1 device is divided into multiple
>    barrier units, by the size of BARRIER_UNIT_SECTOR_SIZE.
>    Bio request won't go across border of barrier unit size, that means
>    maximum bio size is BARRIER_UNIT_SECTOR_SIZE<<9 in bytes.
>  - BARRIER_BUCKETS_NR
>    There are BARRIER_BUCKETS_NR buckets in total, which is defined by,
>         #define BARRIER_BUCKETS_NR_BITS   9
>         #define BARRIER_BUCKETS_NR        (1<<BARRIER_BUCKETS_NR_BITS)
>    if multiple I/O requests hit different barrier units, they only need
>    to compete I/O barrier with other I/Os which hit the same barrier
>    bucket index with each other. The index of a barrier bucket which a
>    bio should look for is calculated by,
>         int idx = hash_long(sector_nr, BARRIER_BUCKETS_NR_BITS)
>    that sector_nr is the start sector number of a bio. We use function
>    align_to_barrier_unit_end() to calculate sectors number from sector_nr
>    to the next barrier unit size boundary, if the requesting bio size
>    goes across the boundary, we split the bio in raid1_make_request(), to
>    make sure the finall bio sent into generic_make_request() won't exceed
>    barrier unit boundary.
> 
> Comparing to single sliding resync window,
>  - Currently resync I/O grows linearly, therefore regular and resync I/O
>    will have confliction within a single barrier units. So it is similar to
>    single sliding resync window.
>  - But a barrier unit bucket is shared by all barrier units with identical
>    barrier uinit index, the probability of confliction might be higher
>    than single sliding resync window, in condition that writing I/Os
>    always hit barrier units which have identical barrier bucket index with
>    the resync I/Os. This is a very rare condition in real I/O work loads,
>    I cannot imagine how it could happen in practice.
>  - Therefore we can achieve a good enough low confliction rate with much
>    simpler barrier algorithm and implementation.
> 
> If user has a (realy) large raid1 device, for example 10PB size, we may
> just increase the buckets number BARRIER_BUCKETS_NR. Now this is a macro,
> it is possible to be a raid1-created-time-defined variable in future.
> 
> There are two changes should be noticed,
>  - In raid1d(), I change the code to decrease conf->nr_pending[idx] into
>    single loop, it looks like this,
>         spin_lock_irqsave(&conf->device_lock, flags);
>         conf->nr_queued[idx]--;
>         spin_unlock_irqrestore(&conf->device_lock, flags);
>    This change generates more spin lock operations, but in next patch of
>    this patch set, it will be replaced by a single line code,
>         atomic_dec(conf->nr_queueud[idx]);
>    So we don't need to worry about spin lock cost here.
>  - Original function raid1_make_request() is split into two functions,
>    - raid1_make_read_request(): handles regular read request and calls
>      wait_read_barrier() for I/O barrier.
>    - raid1_make_write_request(): handles regular write request and calls
>      wait_barrier() for I/O barrier.
>    The differnece is wait_read_barrier() only waits if array is frozen,
>    using different barrier function in different code path makes the code
>    more clean and easy to read.
>  - align_to_barrier_unit_end() is called to make sure both regular and
>    resync I/O won't go across a barrier unit boundary.
> 
> Changelog
> V1:
> - Original RFC patch for comments
> V2:
> - Use bio_split() to split the orignal bio if it goes across barrier unit
>   bounday, to make the code more simple, by suggestion from Shaohua and
>   Neil.
> - Use hash_long() to replace original linear hash, to avoid a possible
>   confilict between resync I/O and sequential write I/O, by suggestion from
>   Shaohua.
> - Add conf->total_barriers to record barrier depth, which is used to
>   control number of parallel sync I/O barriers, by suggestion from Shaohua.
> - In V1 patch the bellowed barrier buckets related members in r1conf are
>   allocated in memory page. To make the code more simple, V2 patch moves
>   the memory space into struct r1conf, like this,
>         -       int                     nr_pending;
>         -       int                     nr_waiting;
>         -       int                     nr_queued;
>         -       int                     barrier;
>         +       int                     nr_pending[BARRIER_BUCKETS_NR];
>         +       int                     nr_waiting[BARRIER_BUCKETS_NR];
>         +       int                     nr_queued[BARRIER_BUCKETS_NR];
>         +       int                     barrier[BARRIER_BUCKETS_NR];
>   This change is by the suggestion from Shaohua.
> - Remove some inrelavent code comments, by suggestion from Guoqing.
> - Add a missing wait_barrier() before jumping to retry_write, in
>   raid1_make_write_request().
> 
> Signed-off-by: Coly Li <colyli@xxxxxxx>
> Cc: Shaohua Li <shli@xxxxxx>
> Cc: Neil Brown <neilb@xxxxxxx>
> Cc: Johannes Thumshirn <jthumshirn@xxxxxxx>
> Cc: Guoqing Jiang <gqjiang@xxxxxxxx>
> ---
>  
> +static sector_t align_to_barrier_unit_end(sector_t start_sector,
> +					  sector_t sectors)
> +{
> +	sector_t len;
> +
> +	WARN_ON(sectors == 0);
> +	/* len is the number of sectors from start_sector to end of the
> +	 * barrier unit which start_sector belongs to.
> +	 */

The correct format for comments is:
/*
 * something
 */

There are some other places with the same issue

> +	len = ((start_sector + sectors + (1<<BARRIER_UNIT_SECTOR_BITS) - 1) &
> +	       (~(BARRIER_UNIT_SECTOR_SIZE - 1))) -
> +	      start_sector;

This one makes me nervous. shouldn't this be:
 +	len = ((start_sector +  (1<<BARRIER_UNIT_SECTOR_BITS) - 1) &
 +	       (~(BARRIER_UNIT_SECTOR_SIZE - 1))) -
 +	      start_sector;

And you can use round_up()

>  
> -static void raid1_make_request(struct mddev *mddev, struct bio * bio)
> +static void raid1_make_read_request(struct mddev *mddev, struct bio *bio)
>  {

Please rebase the patches to latest md-next. The raid1_make_request already
split for read/write code path recently.

Otherwise, the patch looks good. After these are fixed, I'll add it for 4.11

Thanks,
Shaohua
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html