Re: [PATCH-RESEND 4/4] pnfsblock: do not ask for layout in pg_init

Boaz Harrosh <bharrosh@xxxxxxxxxxx> · Thu, 1 Dec 2011 10:00:40 -0800



On 12/01/2011 09:33 AM, Boaz Harrosh wrote:
> On 11/30/2011 09:05 PM, tao.peng@xxxxxxx wrote:
>>> -----Original Message-----
>> Why return or forget the 1 lo_seg? What you really need to avoid seg
>> list exploding is to have LRU based caching and merge them when
>> necessary, instead of asking and dropping lseg again and again...
>>
> 
> Ya Ya, I'm talking in abstract now giving the all picture. If in our
> above case the application has closed the file and will not ever open it
> again then I'm right, right? of course once you got it you can keep it
> around in a cache. Though I think that with heavy segmenting keeping segs
> passed ROC is extremely harder to manage. Be careful with that in current
> implementation.
> 
>> Removing the IO_MAX limitation can be a second optimization. I was
>> hoping to remove it if current IO_MAX thing turns out hurting
>> performance. And one reason for IO_MAX is to avoid the likelihood
>> that server returns short layout, because current implementation is
>> limited to retry nfs_read/write_data as a whole instead of splitting
>> it up. I think that if we do it this way, the IO_MAX can be removed
>> later when necessary, by introducing a splitting mechanism either on
>> nfs_read/write_data or on desc. Now that you ask for it, I think
>> following approach is possible:
>>
>> 1. remove the limit on IO_MAX at .pg_test.
>> 2. ask for layout at .pg_doio for the size of current IO desc
>> 3. if server returns short layout, split nfs_read/write_data or desc and issue the pagelist covered by lseg.
>> 4. do 2 and 3 in a loop until all pages in current desc is handled.
>>
> 
> So in effect you are doing what I suggest two passes 
> one: what's the next hole,
> second: Collect pages slicing by returned lo_seg
> 
> Don't you think it is more simple to do a 3 line preliminary
> loop in "one:"?
> 
> and keep the current code that is now exactly built to do
> "second:"
> 
> You are suggesting to effectively repeat current code using
> the first .pg_init...pg_doio for one: and hacking for blocks
> "second:"
> 
> I want 3 lines of code for one: and keep second: exactly as is.
> 
> <snip>
> 
>> Looks like you are suggesting going through the dirty page list twice
>> before issuing IO, one just for getting the IO size information and
>> another one for page collapsing. The whole point of moving layoutget
>> to .pg_doio is to collect real IO size there because we don't know it
>> at .pg_init. And it is done without any changes to generic NFS IO
>> path. I'm not sure if it is appropriate to change generic IO routine
>> to collect the information before .pg_init (I'd be happy too if we
>> can do it there).
> 
> That's what you are suggesting as well look in your step 4.:
> 	do 2 and 3 in a loop until all pages in current desc is handled
> sounds like another loop on all pages no?
> 
> BTW: we are already doing two passes in the system we already looked
> through all the pages when building the io desc at .write/read_pages
> 
> At first we can do above 3 lines loop in .pg_init. Second we can
> just collect that information in generic nfs at the first loop
> we are already doing
> 
>>>
>>>     a. We want it at .pg_init so we have a layout at .pg_test to inspect.
>>>
>>>        Done properly will let you, in blocks, slice by extents at .pg_test
>>>        and .write_pages can send the complete paglist to md (bio chaining)
>>>
>> Unlike objects and files, blocks don't slice by extents, not at .pg_test, nor at .read/write_pagelist.
>>  
> 
> What? What do you do? Send a scsi scatter list command?
> I don't think so. Somewhere you have to see an extent boundary of the data
> and send the continue of the next extent on disk as a new block request
> in a new bio with a new disk offset. No?
> 
> I'm not saying to do this right away, but you could simplify the code a lot
> by slicing by extent inside .pg_test
> 
> <>
>>>
>>>     At first version:
>>>       A good approximation which gives you an exact middle point
>>>       between blocks B.2 and objects/files B.2, is dirty count.
>>>     At later patch:
>>>       Have generic NFS collect the above O1, N1, and Nn for you and base
>>>       your decision on that.
>>>
>>
>> Well, unless you put both the two parts in... The first version is
>> ignoring the fact that blocks MDS cannot give out file stripping
>> information as easily as objects and files do. And I will stand
>> against it alone because all it does is to benefit objects while
>> hurting blocks (files don't care because they use whole file layout,
>> at least for now).
> 
> Lets make a computerize lets find O1..N1 and put in the Generic

computerize => compromise

> code for everyone objects files-segmented and blocks. Because I
> need it two. And I'd like O1..Nn but for first step I'd love
> O1..N1 a lot.
> 
> Please see my point. You created a system that only blocks can
> benefit from unless objects repeats the same exact but duplicated
> hacks as blocks.
> 
> Please do the *very* simple thing I suggest which we can all enjoy.
> 
>>
>> Otherwise, I would suggest having private hack for blocks because we have a real problem to solve.
>>
> 
> It's just as real for objects, and files when they will do segments.
> 
> My suggestion is to have two helpers at pnfs.c one for blocks and
> objects, one for files. Which can be called in .pg_init.
> The blocks/objects does a simple loop counting the first contiguous
> chunk and asks for a layout, like today. files one just sends
> all-file request.
> 
> Boaz
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html