On Wed, Aug 18, 2010 at 11:56 AM, Venkateswararao Jujjuri (JV) <jvrao@xxxxxxxxxxxxxxxxxx> wrote: > Eric Van Hensbergen wrote: >> On Tue, Aug 17, 2010 at 6:31 PM, Venkateswararao Jujjuri (JV) >> <jvrao@xxxxxxxxxxxxxxxxxx> wrote: >>>>> Additionally, given we know the page size constant, couldn't we infer >>>>> this from a negotiated msize? Transports already have an maxsize >>>>> field which limits the msize selections (and defaults? if not maybe it >>>>> should?) -- why not just use that? >>> The IO size in this mode is derived by virtio ring size rather than the msize. >>> msize is restricted to the maximum amount of data that gets copied on the the >>> kmalloc'd buffer. >>> In this case it is just the header. >>> >> >> Actually, msize is supposed to be the maximum amount of data that gets >> copied over the transport which should be a factor determined by the >> client and server and the underlying technology which links them. >> From that aspect I think its a perfectly valid way to set this up and >> is already part of the mechanisms. > Our intention of this patch was to achieve zero copy for reads/writes with > minimal changes, not affecting other transports and other vfs transactions. > > In this method, if the transport has capability to send user pages directly, we > are using/treating > the msize as the maximum allowed size for header, and the actual payload is > limited/controlled > by the virtio ring size. With these minor set of changes, we achieved the zero > copy on the most > important code path, read/write...without any changes to other sections of the code. > I think this is a misuse of msize. iohdrsize is the maximum allowed header size, it is a component of msize and msize is always greater than iohdrsize. It just feels to me like we are duplicating information. > This could be a step in the right direction, in the future when we are ready to > modify > other code paths like readdir, xattr we can take more broader set of changes. > Currently we are using 8k as the msize...kmalloc() on anything beyond 1 page(4k) > may fail > under high load.. so we may have to revisit our entire logic somewhere down the line > on how do we segregate header from the actual payload. This is a separate issue IMHO, but an important one to track. >> >> I don't know if I saw such a use in the existing patch. Can you give >> me a concrete use case where we would use multiple "short" pages >> instead of mostly full pages and potentially a single short page at >> the end? > > I am talking about short reads/writes not short pages. > Yes at the most you will have two partial pages. (un-aligned start, and may be > last page) > > The case where the request into VFS is more than the actual read/write we could do > either because of transport restriction/msize .. or something else. > > Say, if the read size of 512k comes into vfs, and we could put only 480k into one > request because of virtio ring limitation (128-8(header) * 4k). At VFS/client level > we need to retry the read for the rest of the IO..before we return back to the user. > Okay - so the concern is that an unaligned start of the buffer might lead to an oversubscription of the ring elements because msize is just a static limit and does not take into consideration the alignment of the initial buffer. However, given that we are checking this in the client code anyways, it still seems like we can use the msize as an indication of how many elements we have in the scatter-gather ring since it'll always be msize/pagesize = number of ring elements when we are using an sg_array. However, that may be a bit of an oversubscription of the msize field, so I think I'm willing to accept a field in a structure which represents the maximum number of elements in a scatter/gather array. I guess the only remaining question is whether there would ever be case where this would need to be per client or if we could just make it per transport. My feeling right now is that per-transport is fine as its all constants anyways - so why not just add a field to the transport structure? -eric -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html