Re: raid0 vs. mkfs

Mike Snitzer <snitzer@xxxxxxxxxx> · Wed, 7 Dec 2016 00:08:15 -0500

On Tue, Nov 29, 2016 at 5:45 PM, Avi Kivity <avi@xxxxxxxxxxxx> wrote:
> On 11/29/2016 11:14 PM, NeilBrown wrote:
>>
>> On Mon, Nov 28 2016, Avi Kivity wrote:
>>>>
>>>> If it is easy for the upper layer to break a very large request into a
>>>> few very large requests, then I wouldn't necessarily object.
>>>
>>> I can't see why it would be hard.  It's simple arithmetic.
>>
>> That is easy to say before writing the code :-)
>> It probably is easy for RAID0.  Less so for RAID10.  Even less for
>> RAID6.
>>
>
> pick the largest subrange wihin the inpu range whose bounds are 0 (mod
> stripe-size); TRIM it (for all members); apply the regular algorithm to the
> head and tail subranges.  Works for all RAID types.  If the RAID is
> undergoing reshaping, exclude the range undergoing reshaping, and treat the
> two halves separately.
>
>>>> But unless it is very hard for the lower layer to merge requests, it
>>>> should be doing that too.
>>>
>>> Merging has tradeoffs.  When you merge requests R1, R2, ... Rn you make
>>> the latency request R1 sum of the latencies of R1..Rn.  You may gain
>>> some efficiency in the process, but that's not going to make up for a
>>> factor of n.  The queuing layer has no way to tell whether the caller is
>>> interested in the latency of individual requests.  By sending large
>>> requests, the caller indicates it's not interested in the latency of
>>> individual subranges.  The queuing layer is still free to internally
>>> split the request to smaller ranges, to satisfy hardware constraints, or
>>> to reduce worst-case latencies for competing request streams.
>>
>> I would have thought that using plug/unplug to group requests is a
>> fairly strong statement that they can be handled as a unit if that is
>> convenient.
>
>
> It is not.  As an example, consider a read and a few associated read-ahead
> requests submitted in a batch.  The last thing you want is for them to be
> treated as a unit.
>
> Plug/unplug means: I have a bunch of requests here.  Whether they should be
> merged or reordered is orthogonal to whether they are members of a batch or
> not.
>
>>
>>
>>> So I disagree that all the work should be pushed to the merging layer.
>>> It has less information to work with, so the fewer decisions it has to
>>> make, the better.
>>
>> I think that the merging layer should be as efficient as it reasonably
>> can be, and particularly should take into account plugging.  This
>> benefits all callers.
>
>
> Yes, but plugging does not mean "please merge anything you can until the
> unplug".
>
>> If it can be demonstrated that changes to some of the upper layers bring
>> further improvements with acceptable costs, then certainly it is good to
>> have those too.
>
>
> Generating millions of requests only to merge them again is inefficient.  It
> happens in an edge case (TRIM of the entirety of a very large RAID), but it
> already caused on user to believe the system failed.  I think the system
> should be more robust than that.

DM's stripe target (raid0) has had Avi's proposed implementation
(breaks large discard into N discards, one per component device) for a
long while.
See commit 7b76ec11fec4 ("dm stripe: support discards")

This improvement applies to other raid levels too.  And I completely
agree that it would be a mistake to to continue to have the block
layer issue millions of chunk-sized bios only to _maybe_ have them be
merged by the IO scheduler and/or plugging.  Needless make-work that
just eats cpu and memory.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html