RE: [PATCH-RESEND 4/4] pnfsblock: do not ask for layout in pg_init

<tao.peng@xxxxxxx> · Thu, 1 Dec 2011 23:59:51 -0500



> -----Original Message-----
> From: linux-nfs-owner@xxxxxxxxxxxxxxx [mailto:linux-nfs-owner@xxxxxxxxxxxxxxx] On Behalf Of Boaz
> Harrosh
> Sent: Friday, December 02, 2011 1:33 AM
> To: Peng, Tao
> Cc: bergwolf@xxxxxxxxx; bhalevy@xxxxxxxxxx; Trond.Myklebust@xxxxxxxxxx; linux-nfs@xxxxxxxxxxxxxxx
> Subject: Re: [PATCH-RESEND 4/4] pnfsblock: do not ask for layout in pg_init
> 
> On 11/30/2011 09:05 PM, tao.peng@xxxxxxx wrote:
> >> -----Original Message-----
> > Why return or forget the 1 lo_seg? What you really need to avoid seg
> > list exploding is to have LRU based caching and merge them when
> > necessary, instead of asking and dropping lseg again and again...
> >
> 
> Ya Ya, I'm talking in abstract now giving the all picture. If in our
> above case the application has closed the file and will not ever open it
> again then I'm right, right? of course once you got it you can keep it
> around in a cache. Though I think that with heavy segmenting keeping segs
> passed ROC is extremely harder to manage. Be careful with that in current
> implementation.
> 
In above, your description is not only about keeping segments passing ROC, if the file is still open, you will be asking for the same segment again and again... no?

> > Removing the IO_MAX limitation can be a second optimization. I was
> > hoping to remove it if current IO_MAX thing turns out hurting
> > performance. And one reason for IO_MAX is to avoid the likelihood
> > that server returns short layout, because current implementation is
> > limited to retry nfs_read/write_data as a whole instead of splitting
> > it up. I think that if we do it this way, the IO_MAX can be removed
> > later when necessary, by introducing a splitting mechanism either on
> > nfs_read/write_data or on desc. Now that you ask for it, I think
> > following approach is possible:
> >
> > 1. remove the limit on IO_MAX at .pg_test.
> > 2. ask for layout at .pg_doio for the size of current IO desc
> > 3. if server returns short layout, split nfs_read/write_data or desc and issue the pagelist covered
> by lseg.
> > 4. do 2 and 3 in a loop until all pages in current desc is handled.
> >
> 
> So in effect you are doing what I suggest two passes
> one: what's the next hole,
> second: Collect pages slicing by returned lo_seg
> 
> Don't you think it is more simple to do a 3 line preliminary
> loop in "one:"?
> 
> and keep the current code that is now exactly built to do
> "second:"
> 
> You are suggesting to effectively repeat current code using
> the first .pg_init...pg_doio for one: and hacking for blocks
> "second:"
> 
> I want 3 lines of code for one: and keep second: exactly as is.
> 
> <snip>
> 
> > Looks like you are suggesting going through the dirty page list twice
> > before issuing IO, one just for getting the IO size information and
> > another one for page collapsing. The whole point of moving layoutget
> > to .pg_doio is to collect real IO size there because we don't know it
> > at .pg_init. And it is done without any changes to generic NFS IO
> > path. I'm not sure if it is appropriate to change generic IO routine
> > to collect the information before .pg_init (I'd be happy too if we
> > can do it there).
> 
> That's what you are suggesting as well look in your step 4.:
> 	do 2 and 3 in a loop until all pages in current desc is handled
> sounds like another loop on all pages no?
> 
No. The loop only happens when server returns less layout than client asks for. While in your solution, you need to loop twice in all cases.

> BTW: we are already doing two passes in the system we already looked
> through all the pages when building the io desc at .write/read_pages
> 
> At first we can do above 3 lines loop in .pg_init. Second we can
> just collect that information in generic nfs at the first loop
> we are already doing
> 
Please understand that I totally agree with you as long as the two parts can get in together. It's just that your second half of work worries me. If we really do it in two phases, I am afraid that the second half never gets done because it goes too deep into generic code. And if we only have the first version of your solution, it is going to hurt blocks.

> >>
> >>     a. We want it at .pg_init so we have a layout at .pg_test to inspect.
> >>
> >>        Done properly will let you, in blocks, slice by extents at .pg_test
> >>        and .write_pages can send the complete paglist to md (bio chaining)
> >>
> > Unlike objects and files, blocks don't slice by extents, not at .pg_test, nor
> at .read/write_pagelist.
> >
> 
> What? What do you do? Send a scsi scatter list command?
> I don't think so. Somewhere you have to see an extent boundary of the data
> and send the continue of the next extent on disk as a new block request
> in a new bio with a new disk offset. No?
> 
> I'm not saying to do this right away, but you could simplify the code a lot
> by slicing by extent inside .pg_test
> 
OK, I see what you mean by "slice by extents". And I admit that blocks can be changed as you suggested. But I do not see your motivation for asking for the change. Even we do layoutget at .pg_init (which is what we have today), current blocks works well under it. And I do not see any restriction at generic layer saying there is something wrong with current implementation. So what is the benefit for blocks to change as you suggested? IMHO current implementation is better because blocks only need to visit extent cache in read/write_pagelist once and for all. But in your suggestion, we have to do it for every .pg_test, and yet again in read/write_pagelist.

> <>
> >>
> >>     At first version:
> >>       A good approximation which gives you an exact middle point
> >>       between blocks B.2 and objects/files B.2, is dirty count.
> >>     At later patch:
> >>       Have generic NFS collect the above O1, N1, and Nn for you and base
> >>       your decision on that.
> >>
> >
> > Well, unless you put both the two parts in... The first version is
> > ignoring the fact that blocks MDS cannot give out file stripping
> > information as easily as objects and files do. And I will stand
> > against it alone because all it does is to benefit objects while
> > hurting blocks (files don't care because they use whole file layout,
> > at least for now).
> 
> Lets make a computerize lets find O1..N1 and put in the Generic
> code for everyone objects files-segmented and blocks. Because I
> need it two. And I'd like O1..Nn but for first step I'd love
> O1..N1 a lot.
> 
> Please see my point. You created a system that only blocks can
> benefit from unless objects repeats the same exact but duplicated
> hacks as blocks.
> 
> Please do the *very* simple thing I suggest which we can all enjoy.
> 
I do want to make it generic so all layouts can benefit. But when I can't, I have to do private hacks...

> >
> > Otherwise, I would suggest having private hack for blocks because we have a real problem to solve.
> >
> 
> It's just as real for objects, and files when they will do segments.
> 
> My suggestion is to have two helpers at pnfs.c one for blocks and
> objects, one for files. Which can be called in .pg_init.
> The blocks/objects does a simple loop counting the first contiguous
> chunk and asks for a layout, like today. files one just sends
> all-file request.
> 
I'm sorry but I don't get it... How do you count the first contiguous chunk in .pg_init? All you know at .pg_init is an empty io desc and an nfs page. Are you going to scan the radix tree and find the next page that write_cache_pages is going to flush out? I don't really think anyone will agree on it... Please look into the second part of your solution and then you'll see why I am worried that it may never get done.

Best Regards,
Tao

> Boaz
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

��.n��������+%������w��{.n�����{��w���jg��������ݢj����G�������j:+v���w�m������w�������h�����٥