Re: [LSF/MM ATTEND] block: multipage bvecs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Boaz,

It is nice to see you are interested in this topic, :-)

On Sun, Feb 28, 2016 at 7:17 PM, Boaz Harrosh <boaz@xxxxxxxxxxxxx> wrote:
> On 02/26/2016 06:33 PM, Ming Lei wrote:
>> Hi,
>>
>> I'd like to participate in LSF/MM and discuss multipage bvecs.
>>
>> Kent posted the idea[1] before, but never pushed out.
>> I have studied multipage bvecs for a while, and think
>> it is a good idea to improve block subsystem.
>>
>> Multipage bvecs means that one 'struct bio_bvec' can hold
>> multiple pages which are physically contiguous instead
>> of one single page used in current kernel.
>>
>
> Hi Ming Lei
>
> This is an interesting talk for me.
>
> I don't know if you ever tried it but I did. If I take a regular
> SSD disk or a PCIE flash card that I have in my machine and
> I stick a pointer to a page and bv_len = PAGE_SIZE * 8 and call
> submit_bio, I get 8 pages worth of IO with a single bvec and it
> all just works.

I think bio_for_each_segment() isn't ready yet, at least .bv_page
isn't updated, but it isn't difficult to support.

Also the thing is that there are lots of singlepage assumption
in kernel, such as updating .bv_page of 'bvev' from
bio_for_each_segment_all().

>
> Yes Yes I know it would break bunch of other places, probably
> the single bvec case works better. But just to say that current
> code is not that picky in assuming a single page size.

I agree, at least bio_bvec is defined as so.

>
> I would like to see an audit and test cases done in this regard
> but to keep current API and make this transparent. I think

Yeah, that is one of my goal  to make this transparent wrt. API,
and introduce as few as change to drivers & fs.

But in situation of bio_clone() & updating bvec, the cloned bio
may has to be splitted into singlebase bvec, then the source
bio can't be very big. So the callers has to be aware of
this story.

> that all the below places you mentioned can be made transparent
> to "big bvec" if coded carefully, and there need not be a separate
> API for multi-page / single-page bvecs. It should all just work.
> I might be wrong, have not looked at this deeply, but is my gut
> feeling, that it can be possible.

Yeah, so far, I think we can do that, :-)

This work may bring up change on fs, dm, bcache and raid code,
and that is why I propose the topic and hope we can talk with guys
in these subsystems.

>
> Thanks for bringing up the issue

You are welcome!

Thanks,
Ming

> Boaz
>
>> IMO, there are several advantages by supporting multipage bvecs:
>>
>> - currently one bio from bio_alloc() can only hold at most 256
>> vectors, which means one bio can be used to transfer at most
>> 1Mbytes(256*4K). With multipage bvecs fs can submit bigger
>> chunk via single bio because big physically contiguous segment
>> is very common.
>>
>> - CPU consumed in iterating bvec table should be decreased
>>
>> - block merge gets simplified a lot, and segment can be merged
>> just inside bio_add_page(), then singlepage bvec needn't to store
>> in bvec table, finally the segment can be splitted to driver with
>> proper size. blk_bio_map_sg() gets simplified too. Recent days,
>> block merge becomes a bit complicated and we saw quite bug reports/fixes
>> in block merge.
>>
>> I'd like to hear opinions from fs guys about multipage bvecs based bio
>> because this should bring up some change to the bio interface(one bio
>> will represent bigger I/O than before).
>>
>> Also I hope to discuss with guys in fs, dm, md, bcache... about
>> the implementation because this feature will bring changes on
>> these subsystems. So far, I have the following ideas:
>>
>> 1) change on bio_for_each_segment()
>>
>> bvec returned from this iterator helper need to keep as singlepage
>> vector as before, so most users of bio iterator don't need change
>>
>> 2) change on bio_for_each_segment_all()
>>
>> bio_for_each_segment_all() has to be changed because callers may
>> change the bvec and assume it is always singlepage now.
>>
>> I think bio_for_each_segment_all() need to be splitted into
>> bio_for_each_segment_all_rd() and bio_for_each_segment_all_wt().
>>
>> Both two new helpers returns pointer to bio_bvec like before.
>>
>> *_rd() is used to iterate each vector for reading the pointed bvec,
>> and caller can not write to this vector. This helper can still
>> return singlepage bvec like before, so one extra local/temp 'bio_bvec'
>> variable has to be added for conversion from multipage bvec to
>> singlepage bvec.
>>
>> *_wt() is used to iterate each vector for changing the bvec, and
>> only allowed for iterating bio with singlepage bvecs, there are
>> just several such cases, such as bio bounce, bio_alloc_pages(),
>> raid1 and raid10.
>>
>> 3) change bvecs of cloned bio
>> Such as bio bounce and raid1, one bio is cloned from the incoming
>> bio, and each bvec of the cloned bio may be updated. We have to
>> introduce singlepage version of bio_clone() to make the cloned bio
>> only include singlepage bvec, then the bvecs can be updated like
>> before.
>>
>> One problem is that the cloned bio may not hold all singlepage bvec
>> converted from multipage bvecs in the source bio, and one simple
>> solution is to split the source bio and make sure its size can't be
>> bigger than 1Mbytes(256 single page vectors).
>>
>> 4) introduce bio_for_each_mp_segment()
>>
>> bvec returned from this iterator helper will become multipage bvec
>> which should be the actual/real segment, so drivers may switch to
>> this helper if they can handle multipage segment directly, which
>> should be common case.
>>
>>
>> [1] http://marc.info/?l=linux-kernel&m=141680246629547&w=2
>>
>> Thanks,
>> Ming Lei
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>



-- 
Ming Lei
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]
  Powered by Linux