So I took my orangefs_readahead patch and David Howells' patches from his fscache-iter branch and put them on top of Linux 5.12-rc3 so as to get rid of all that RIP: 0010:ttm_bo_release+0x2ea/0x340 [ttm] stuff that was happening to VMs with 5.12-rc2. Then I added a readahead_expand call at the start of orangefs_readahead. I played with various expansion values, but the bottom line is: it works GREAT in my simple tests, speeds reads WAY up. -Mike On Sat, Mar 13, 2021 at 10:31 AM Mike Marshall <hubcap@xxxxxxxxxxxx> wrote: > > Greetings everyone. > > I have made another version of orangefs_readahead, without any > of my hand rolled page cache manipulations. I read a bunch of > the source in other filesystems and mm and fs and pagemap.h to > try and get an idea of how to implement readahead so that my > implementation is "with the program". > > I have described the flawed code I have upstream now in an > earlier message. My flawed code has no readahead implementation, but > it is much faster than with this readahead implementation. > > If this readahead implementation is "the right idea", I can > use it as a framework to implement an async orangefs read function > and start the read at the beginning of my readahead function > and collect the results at the end after the readahead pages > have been marshaled. Also, once some mechanism like David Howells' > code to control the readahead window goes upstream, I should be > able take big enough gulps of readahead to make Orangefs do right. > The heuristically chosen 64 page max that I can get now isn't enough. > > I hope some of y'all have the time to review this implementation of > readahead... > > Thanks! > > -Mike > > static void orangefs_readahead(struct readahead_control *rac) > { > struct page **pages; > unsigned int npages = readahead_count(rac); > loff_t offset = readahead_pos(rac); > struct bio_vec *bvs; > int i; > struct iov_iter iter; > struct file *file = rac->file; > struct inode *inode = file->f_mapping->host; > int ret; > > /* allocate an array of page pointers. */ > pages = kzalloc(npages * (sizeof(struct page *)), GFP_KERNEL); > > /* Get a batch of pages to read. */ > npages = __readahead_batch(rac, pages, npages); > > /* allocate an array of bio_vecs. */ > bvs = kzalloc(npages * (sizeof(struct bio_vec)), GFP_KERNEL); > > /* hook the bio_vecs to the pages. */ > for (i = 0; i < npages; i++) { > bvs[i].bv_page = pages[i]; > bvs[i].bv_len = PAGE_SIZE; > bvs[i].bv_offset = 0; > } > > iov_iter_bvec(&iter, READ, bvs, npages, npages * PAGE_SIZE); > > /* read in the pages. */ > ret = wait_for_direct_io(ORANGEFS_IO_READ, inode, &offset, &iter, > npages * PAGE_SIZE, inode->i_size, NULL, NULL, file); > > /* clean up. */ > for (i = 0; i < npages; i++) { > SetPageUptodate(bvs[i].bv_page); > unlock_page(bvs[i].bv_page); > put_page(bvs[i].bv_page); > } > kfree(pages); > kfree(bvs); > } > > On Mon, Feb 1, 2021 at 10:32 PM Mike Marshall <hubcap@xxxxxxxxxxxx> wrote: > > > > >> This is not the way to do it. You need to actually kick > > >> off readahead in this routine so that you get pipelining > > >> (ie the app is working on pages 0-15 at the same time > > >> the server is getting you pages 16-31). > > > > Orangefs isn't very good at reading or writing a few > > pages at a time. Its optimal block size is four megabytes. > > I'm trying to do IOs big enough to make Orangefs > > start flowing like it needs to and then have pages > > on hand to fill with the data. Perhaps I can figure > > how to use Dave Howell's code to control the > > readahead window and make adjustments to > > how many pages Orangefs reads per IO and > > end up with something that is closer to how > > readahead is intended to be used. > > > > This patch is a big performance improvement over > > the code that's upstream even though I'm > > not using readahead as intended. > > > > >> I don't see much support in orangefs for doing async > > >> operations; everything seems to be modelled on > > >> "submit an I/O and wait for it to complete". > > > > Yep... when we were polishing up the kernel module to > > attempt to go upstream, the code in there for async was > > left behind... I might be able to make sense of it now, > > Ida know... You've helped me to see this place where > > it is needed. > > > > >> adding async > > >> support to orangefs is a little bigger task than I'm willing to put > > >> significant effort into right now. > > > > The effort and help that you're providing is much > > appreciated and just what I need, thanks! > > > > -Mike > > > > On Mon, Feb 1, 2021 at 8:08 AM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: > > > > > > On Sun, Jan 31, 2021 at 05:25:02PM -0500, Mike Marshall wrote: > > > > I wish I knew how to specify _nr_pages in the readahead_control > > > > structure so that all the extra pages I need could be obtained > > > > in readahead_page instead of part there and the rest in my > > > > open-coded stuff in orangefs_readpage. But it looks to me as > > > > if values in the readahead_control structure are set heuristically > > > > outside of my control over in ondemand_readahead? > > > > > > That's right (for now). I pointed you at some code from Dave Howells > > > that will allow orangefs to enlarge the readahead window beyond that > > > determined by the core code's algorithms. > > > > > > > [root@vm3 linux]# git diff master..readahead > > > > diff --git a/fs/orangefs/inode.c b/fs/orangefs/inode.c > > > > index 48f0547d4850..682a968cb82a 100644 > > > > --- a/fs/orangefs/inode.c > > > > +++ b/fs/orangefs/inode.c > > > > @@ -244,6 +244,25 @@ static int orangefs_writepages(struct > > > > address_space *mapping, > > > > > > > > static int orangefs_launder_page(struct page *); > > > > > > > > +/* > > > > + * Prefill the page cache with some pages that we're probably > > > > + * about to need... > > > > + */ > > > > +static void orangefs_readahead(struct readahead_control *rac) > > > > +{ > > > > + pgoff_t index = readahead_index(rac); > > > > + struct page *page; > > > > + > > > > + while ((page = readahead_page(rac))) { > > > > + prefetchw(&page->flags); > > > > + put_page(page); > > > > + unlock_page(page); > > > > + index++; > > > > + } > > > > + > > > > + return; > > > > +} > > > > > > This is not the way to do it. You need to actually kick off readahead in > > > this routine so that you get pipelining (ie the app is working on pages > > > 0-15 at the same time the server is getting you pages 16-31). I don't > > > see much support in orangefs for doing async operations; everything > > > seems to be modelled on "submit an I/O and wait for it to complete". > > > > > > I'm happy to help out with pagecache interactions, but adding async > > > support to orangefs is a little bigger task than I'm willing to put > > > significant effort into right now.