From: Mike Marshall <hubcap@xxxxxxxxxxxx> ---------- added text for this mail message ---------------- Here's a related patch. Something like this needs to be added to Christoph's reversion in order for Orangefs to compile and keep working. Using this static read size rather than using the reverted logic to determine a read size seems faster in my tests anyhow. I accept Christoph's assertion that there would be a race, and I looked at some of the vfs code (vfs_read, new_sync_read and related). I guess the race Christoph sees would happen in a threaded userspace program? I would be a better maintainer if I saw the race more clearly, if anyone wants to go into it. I guess I could employ a locking scheme if I wanted keep the reverted code (I don't) instead of getting rid of it? As an aside, the page cache has been a blessing and a curse for us. Since we started using it, small IO has improved incredibly, but our max speed hits a plateau before it otherwise would have. I think because of all the page size copies we have to do to fill our 4 meg native buffers. I try to read about all the new work going into the page cache in lwn, and make some sense of the new code :-). One thing I remember is when Christoph Lameter said "the page cache does not scale", but I know the new work is focused on that. If anyone has any thoughts about how we could make improvments on filling our native buffers from the page cache (larger page sizes?), feel free to offer any help... Anywho... thanks and I'll try to get this patch and Christoph's two pulled if nobody sees a problem with them... -Mike ------------------------------------------------------------ Logically, optimal Orangefs "pages" are 4 megabytes. Reading large Orangefs files 4096 bytes at a time is like trying to kick a dead whale down the beach. Before Christoph's "Revert orangefs: remember count when reading." I tried to give users a knob whereby they could, for example, use "count" in read(2) or bs with dd(1) to get whatever they considered an appropriate amount of bytes at a time from Orangefs and fill as many page cache pages as they could at once. Without the racy code that Christoph reverted Orangefs won't even compile, much less work. So this replaces the logic that used the private file data that Christoph reverted with a static number of bytes to read from Orangefs. I ran tests like the following to determine what a reasonable static number of bytes might be: dd if=/pvfsmnt/asdf of=/dev/null count=128 bs=4194304 dd if=/pvfsmnt/asdf of=/dev/null count=256 bs=2097152 dd if=/pvfsmnt/asdf of=/dev/null count=512 bs=1048576 . . . dd if=/pvfsmnt/asdf of=/dev/null count=4194304 bs=128 Reads seem faster using the static number, so my "knob code" wasn't just racy, it wasn't even a good idea... Signed-off-by: Mike Marshall <hubcap@xxxxxxxxxxxx> --- fs/orangefs/inode.c | 39 ++++++--------------------------------- 1 file changed, 6 insertions(+), 33 deletions(-) diff --git a/fs/orangefs/inode.c b/fs/orangefs/inode.c index 961c0fd8675a..fb0884626d18 100644 --- a/fs/orangefs/inode.c +++ b/fs/orangefs/inode.c @@ -259,46 +259,19 @@ static int orangefs_readpage(struct file *file, struct page *page) pgoff_t index; /* which page */ struct page *next_page; char *kaddr; - struct orangefs_read_options *ro = file->private_data; loff_t read_size; - loff_t roundedup; int buffer_index = -1; /* orangefs shared memory slot */ int slot_index; /* index into slot */ int remaining; /* - * If they set some miniscule size for "count" in read(2) - * (for example) then let's try to read a page, or the whole file - * if it is smaller than a page. Once "count" goes over a page - * then lets round up to the highest page size multiple that is - * less than or equal to "count" and do that much orangefs IO and - * try to fill as many pages as we can from it. - * - * "count" should be represented in ro->blksiz. - * - * inode->i_size = file size. + * Get up to this many bytes from Orangefs at a time and try + * to fill them into the page cache at once. + * Tests with dd made this seem like a reasonable static + * number, if there was interest perhaps this number could + * be made setable through sysfs... */ - if (ro) { - if (ro->blksiz < PAGE_SIZE) { - if (inode->i_size < PAGE_SIZE) - read_size = inode->i_size; - else - read_size = PAGE_SIZE; - } else { - roundedup = ((PAGE_SIZE - 1) & ro->blksiz) ? - ((ro->blksiz + PAGE_SIZE) & ~(PAGE_SIZE -1)) : - ro->blksiz; - if (roundedup > inode->i_size) - read_size = inode->i_size; - else - read_size = roundedup; - - } - } else { - read_size = PAGE_SIZE; - } - if (!read_size) - read_size = PAGE_SIZE; + read_size = 524288; if (PageDirty(page)) orangefs_launder_page(page); -- 2.25.1