Re: [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window

Coly Li <colyli@xxxxxxx> · Tue, 21 Feb 2017 17:45:53 +0800

On 2017/2/21 上午8:29, NeilBrown wrote:
> On Mon, Feb 20 2017, Coly Li wrote:
> 
>>> 在 2017年2月20日，下午3:04，Shaohua Li <shli@xxxxxxxxxx> 写道：
>>> 
>>>> On Mon, Feb 20, 2017 at 01:51:22PM +1100, Neil Brown wrote:
>>>>> On Mon, Feb 20 2017, NeilBrown wrote:
>>>>> 
>>>>>> On Fri, Feb 17 2017, Coly Li wrote:
>>>>>> 
>>>>>>> On 2017/2/16 下午3:04, NeilBrown wrote: I know you are
>>>>>>> going to change this as Shaohua wantsthe spitting to 
>>>>>>> happen in a separate function, which I agree with, but
>>>>>>> there is something else wrong here. Calling
>>>>>>> bio_split/bio_chain repeatedly in a loop is dangerous.
>>>>>>> It is OK for simple devices, but when one request can
>>>>>>> wait for another request to the same device it can 
>>>>>>> deadlock. This can happen with raid1.  If a resync
>>>>>>> request calls raise_barrier() between one request and
>>>>>>> the next, then the next has to wait for the resync
>>>>>>> request, which has to wait for the first request. As
>>>>>>> the first request will be stuck in the queue in 
>>>>>>> generic_make_request(), you get a deadlock.
>>>>>> 
>>>>>> For md raid1, queue in generic_make_request(), can I
>>>>>> understand it as bio_list_on_stack in this function? And
>>>>>> queue in underlying device, can I understand it as the
>>>>>> data structures like plug->pending and 
>>>>>> conf->pending_bio_list ?
>>>>> 
>>>>> Yes, the queue in generic_make_request() is the
>>>>> bio_list_on_stack.  That is the only queue I am talking
>>>>> about.  I'm not referring to plug->pending or
>>>>> conf->pending_bio_list at all.
>>>>> 
>>>>>> 
>>>>>> I still don't get the point of deadlock, let me try to
>>>>>> explain why I don't see the possible deadlock. If a bio
>>>>>> is split, and the first part is processed by
>>>>>> make_request_fn(), and then a resync comes and it will 
>>>>>> raise a barrier, there are 3 possible conditions, - the
>>>>>> resync I/O tries to raise barrier on same bucket of the
>>>>>> first regular bio. Then the resync task has to wait to
>>>>>> the first bio drops its conf->nr_pending[idx]
>>>>> 
>>>>> Not quite. First, the resync task (in raise_barrier()) will
>>>>> wait for ->nr_waiting[idx] to be zero.  We can assume this
>>>>> happens immediately. Then the resync_task will increment
>>>>> ->barrier[idx]. Only then will it wait for the first bio to
>>>>> drop ->nr_pending[idx]. The processing of that first bio
>>>>> will have submitted bios to the underlying device, and they
>>>>> will be in the bio_list_on_stack queue, and will not be
>>>>> processed until raid1_make_request() completes.
>>>>> 
>>>>> The loop in raid1_make_request() will then call
>>>>> make_request_fn() which will call wait_barrier(), which
>>>>> will wait for ->barrier[idx] to be zero.
>>>> 
>>>> Thinking more carefully about this.. the 'idx' that the
>>>> second bio will wait for will normally be different, so there
>>>> won't be a deadlock after all.
>>>> 
>>>> However it is possible for hash_long() to produce the same
>>>> idx for two consecutive barrier_units so there is still the
>>>> possibility of a deadlock, though it isn't as likely as I
>>>> thought at first.
>>> 
>>> Wrapped the function pointer issue Neil pointed out into Coly's
>>> original patch. Also fix a 'use-after-free' bug. For the
>>> deadlock issue, I'll add below patch, please check.
>>> 
>>> Thanks, Shaohua
>>> 
>> 

Neil,

Thanks for your patient explanation, I feel I come to follow up what
you mean. Let me try to re-tell what I understand, correct me if I am
wrong.

>> Hmm, please hold, I am still thinking of it. With barrier bucket
>> and hash_long(), I don't see dead lock yet. For raid10 it might
>> happen, but once we have barrier bucket on it , there will no
>> deadlock.
>> 
>> My question is, this deadlock only happens when a big bio is
>> split, and the split small bios are continuous, and the resync io
>> visiting barrier buckets in sequntial order too. In the case if
>> adjacent split regular bios or resync bios hit same barrier
>> bucket, it will be a very big failure of hash design, and should
>> have been found already. But no one complain it, so I don't
>> convince myself tje deadlock is real with io barrier buckets
>> (this is what Neil concerns).
> 
> I think you are wrong about the design goal of a hash function. 
> When feed a sequence of inputs, with any stride (i.e. with any
> constant difference between consecutive inputs), the output of the
> hash function should appear to be random. A random sequence can
> produce the same number twice in a row. If the hash function
> produces a number from 0 to N-1, you would expect two consecutive
> outputs to be the same about once every N inputs.
> 

Yes, you are right. But when I mentioned hash conflict, I limit the
integers in range [0, 1<<38]. 38 is (64-17-9), when a 64bit LBA
address divided by 64MB I/O barrier unit size, its value range is
reduced to [0, 1<<38].

Maximum size of normal bio is 1MB, it could be split into 2 bios at most.

For DISCARD bio, its maximum size is 4GB, it could be split into 65
bios at most.

Then in this patch, the hash question is degraded to: for any
consecutive 65 integers in range [0, 1<<38], use hash_long() to hash
these 65 integers into range [0, 1023], will any hash conflict happen
among these integers ?

I tried a half range [0, 1<<37] to check hash conflict, by writing a
simple code to emulate hash calculation in the new I/O barrier patch,
to iterate all consecutive {2, 65, 128, 512} integers in range [0,
1<<37] for hash conflict.

On a 20 core CPU each run spent 7+ hours, finally I find no hash
conflict detected up to 512 consecutive integers in above limited
condition. For 1024, there are a lot hash conflict detected.

[0, 1<<37] range back to [0, 63] LBA range, this is large enough for
almost all existing md raid configuration. So for current kernel
implementation and real world device, for a single bio, there is no
possible hash conflict the new I/O barrier patch.

If bi_iter.bi_size changes from unsigned int to unsigned long in
future, the above assumption will be wrong. There will be hash
conflict, and potential dead lock, which is quite implicit. Yes, I
agree with you. No, bio split inside loop is not perfect.

> Even if there was no possibility of a deadlock from a resync
> request happening between two bios, there are other possibilities.
> 

The bellowed text makes me know more about raid1 code, but confuses me
more as well. Here comes my questions,

> It is not, in general, safe to call mempool_alloc() twice in a
> row, without first ensuring that the first allocation will get
> freed by some other thread.  raid1_write_request() allocates from
> r1bio_pool, and then submits bios to the underlying device, which
> get queued on bio_list_on_stack.  They will not be processed until
> after raid1_make_request() completes, so when raid1_make_request
> loops around and calls raid1_write_request() again, it will try to
> allocate another r1bio from r1bio_pool, and this might end up
> waiting for the r1bio which is trapped and cannot complete.
> 

Can I say that it is because blk_finish_plug() won't be called before
raid1_make_request() returns ? Then in raid1_write_request(), mbio
will be added into plug->pending, but before blk_finish_plug() is
called, they won't be handled.

> As r1bio_pool preallocates 256 entries, this is unlikely  but not 
> impossible.  If 256 threads all attempt a write (or read) that
> crosses a boundary, then they will consume all 256 preallocated
> entries, and want more. If there is no free memory, they will block
> indefinitely.
> 

If raid1_make_request() is modified into this way,
+	if (bio_data_dir(split) == READ)
+		raid1_read_request(mddev, split);
+	else
+		raid1_write_request(mddev, split);
+	if (split != bio)
+		generic_make_request(bio);

Then the original bio will be added into the bio_list_on_stack of top
level generic_make_request(), current->bio_list is initialized, when
generic_make_request() is called nested in raid1_make_request(), the
split bio will be added into current->bio_list and nothing else happens.

After the nested generic_make_request() returns, the code back to next
code of generic_make_request(),
2022                         ret = q->make_request_fn(q, bio);
2023
2024                         blk_queue_exit(q);
2025
2026                         bio = bio_list_pop(current->bio_list);

bio_list_pop() will return the second half of the split bio, and it is
continuously handled in raid1_make_request() by calling,
2022                         ret = q->make_request_fn(q, bio);

Then there is no dead lock, at all.

> bio_alloc_bioset() has punt_bios_to_rescuer() to attempt to work
> around a deadlock very similar to this, but it is a very specific
> solution, wouldn't help raid1, and is much more complex than just
> rearranging the code.
> 

Yes, it is, definitely.

> 
>> 
>> For the function pointer asignment, it is because I see a brach 
>> happens in a loop. If I use a function pointer, I can avoid
>> redundant brach inside the loop. raid1_read_request() and
>> raid1_write_request() are not simple functions, I don't know
>> whether gcc may make them inline or not, so I am on the way to
>> check the disassembled code..
> 
> It is a long time since I studied how CPUs handle different sorts
> of machine code, but I'm fairly sure that and indirect branch (i.e.
> a branch through a function pointer) is harder to optimize than a 
> conditional branch.
> 

It makes sense to me, yes, you are right.

> But I think the readability of the code is more important, and
> having an if-then-else in a loop is more familiar to most readers
> than using a function pointer like this.
> 

Copied, I agree with you.

>> 
>> The loop in raid1_make_request() is quite high level, I am not
>> sure whether CPU brach pridiction may work correctly, especially
>> when it is a big DISCARD bio, using function pointer may drop a
>> possible brach.
>> 
>> So I need to check what we get and lose when use function pointer
>> or not. If it is not urgent, please hold this patch for a while.
>> 
>> 
>> The only thing I worry in the bellowed patch is, if a very big
>> DISCARD bio comes, will the kernel space stack trend to be
>> overflow？
> 
> How would a large request cause extra stack space to be used?

If my understand is correct, there is no worry here.

Please correct me.

Coly
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html