Re: [RFC PATCH v2] implement orangefs_readahead

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



So I took my orangefs_readahead patch and
David Howells' patches from his fscache-iter
branch and put them on top of Linux 5.12-rc3
so as to get rid of all that
 RIP: 0010:ttm_bo_release+0x2ea/0x340 [ttm]
stuff that was happening to VMs with
5.12-rc2.  Then I added a readahead_expand
call at the start of orangefs_readahead. I
played with various expansion values, but
the bottom line is: it works GREAT in my
simple tests, speeds reads WAY up.

-Mike

On Sat, Mar 13, 2021 at 10:31 AM Mike Marshall <hubcap@xxxxxxxxxxxx> wrote:
>
> Greetings everyone.
>
> I have made another version of orangefs_readahead, without any
> of my hand rolled page cache manipulations. I read a bunch of
> the source in other filesystems and mm and fs and pagemap.h to
> try and get an idea of how to implement readahead so that my
> implementation is "with the program".
>
> I have described the flawed code I have upstream now in an
> earlier message. My flawed code has no readahead implementation, but
> it is much faster than with this readahead implementation.
>
> If this readahead implementation is "the right idea", I can
> use it as a framework to implement an async orangefs read function
> and start the read at the beginning of my readahead function
> and collect the results at the end after the readahead pages
> have been marshaled. Also, once some mechanism like David Howells'
> code to control the readahead window goes upstream, I should be
> able take big enough gulps of readahead to make Orangefs do right.
> The heuristically chosen 64 page max that I can get now isn't enough.
>
> I hope some of y'all have the time to review this implementation of
> readahead...
>
> Thanks!
>
> -Mike
>
> static void orangefs_readahead(struct readahead_control *rac)
> {
> struct page **pages;
> unsigned int npages = readahead_count(rac);
> loff_t offset = readahead_pos(rac);
> struct bio_vec *bvs;
> int i;
> struct iov_iter iter;
> struct file *file = rac->file;
> struct inode *inode = file->f_mapping->host;
> int ret;
>
> /* allocate an array of page pointers. */
> pages = kzalloc(npages * (sizeof(struct page *)), GFP_KERNEL);
>
> /* Get a batch of pages to read. */
> npages = __readahead_batch(rac, pages, npages);
>
> /* allocate an array of bio_vecs. */
> bvs = kzalloc(npages * (sizeof(struct bio_vec)), GFP_KERNEL);
>
> /* hook the bio_vecs to the pages. */
> for (i = 0; i < npages; i++) {
> bvs[i].bv_page = pages[i];
> bvs[i].bv_len = PAGE_SIZE;
> bvs[i].bv_offset = 0;
> }
>
> iov_iter_bvec(&iter, READ, bvs, npages, npages * PAGE_SIZE);
>
> /* read in the pages. */
> ret = wait_for_direct_io(ORANGEFS_IO_READ, inode, &offset, &iter,
> npages * PAGE_SIZE, inode->i_size, NULL, NULL, file);
>
> /* clean up. */
> for (i = 0; i < npages; i++) {
> SetPageUptodate(bvs[i].bv_page);
> unlock_page(bvs[i].bv_page);
> put_page(bvs[i].bv_page);
> }
> kfree(pages);
> kfree(bvs);
> }
>
> On Mon, Feb 1, 2021 at 10:32 PM Mike Marshall <hubcap@xxxxxxxxxxxx> wrote:
> >
> > >> This is not the way to do it. You need to actually kick
> > >> off readahead in this routine so that you get pipelining
> > >> (ie the app is working on pages 0-15 at the same time
> > >> the server is getting you pages 16-31).
> >
> > Orangefs isn't very good at reading or writing a few
> > pages at a time. Its optimal block size is four megabytes.
> > I'm trying to do IOs big enough to make Orangefs
> > start flowing like it needs to and then have pages
> > on hand to fill with the data. Perhaps I can figure
> > how to use Dave Howell's code to control the
> > readahead window and make adjustments to
> > how many pages Orangefs reads per IO and
> > end up with something that is closer to how
> > readahead is intended to be used.
> >
> > This patch is a big performance improvement over
> > the code that's upstream even though I'm
> > not using readahead as intended.
> >
> > >> I don't see much support in orangefs for doing async
> > >> operations; everything seems to be modelled on
> > >> "submit an I/O and wait for it to complete".
> >
> > Yep... when we were polishing up the kernel module to
> > attempt to go upstream, the code in there for async was
> > left behind... I might be able to make sense of it now,
> > Ida know... You've helped me to see this place where
> > it is needed.
> >
> > >> adding async
> > >> support to orangefs is a little bigger task than I'm willing to put
> > >> significant effort into right now.
> >
> > The effort and help that you're providing is much
> > appreciated and just what I need, thanks!
> >
> > -Mike
> >
> > On Mon, Feb 1, 2021 at 8:08 AM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
> > >
> > > On Sun, Jan 31, 2021 at 05:25:02PM -0500, Mike Marshall wrote:
> > > > I wish I knew how to specify _nr_pages in the readahead_control
> > > > structure so that all the extra pages I need could be obtained
> > > > in readahead_page instead of part there and the rest in my
> > > > open-coded stuff in orangefs_readpage. But it looks to me as
> > > > if values in the readahead_control structure are set heuristically
> > > > outside of my control over in ondemand_readahead?
> > >
> > > That's right (for now).  I pointed you at some code from Dave Howells
> > > that will allow orangefs to enlarge the readahead window beyond that
> > > determined by the core code's algorithms.
> > >
> > > > [root@vm3 linux]# git diff master..readahead
> > > > diff --git a/fs/orangefs/inode.c b/fs/orangefs/inode.c
> > > > index 48f0547d4850..682a968cb82a 100644
> > > > --- a/fs/orangefs/inode.c
> > > > +++ b/fs/orangefs/inode.c
> > > > @@ -244,6 +244,25 @@ static int orangefs_writepages(struct
> > > > address_space *mapping,
> > > >
> > > >  static int orangefs_launder_page(struct page *);
> > > >
> > > > +/*
> > > > + * Prefill the page cache with some pages that we're probably
> > > > + * about to need...
> > > > + */
> > > > +static void orangefs_readahead(struct readahead_control *rac)
> > > > +{
> > > > +       pgoff_t index = readahead_index(rac);
> > > > +       struct page *page;
> > > > +
> > > > +       while ((page = readahead_page(rac))) {
> > > > +               prefetchw(&page->flags);
> > > > +               put_page(page);
> > > > +               unlock_page(page);
> > > > +               index++;
> > > > +       }
> > > > +
> > > > +       return;
> > > > +}
> > >
> > > This is not the way to do it.  You need to actually kick off readahead in
> > > this routine so that you get pipelining (ie the app is working on pages
> > > 0-15 at the same time the server is getting you pages 16-31).  I don't
> > > see much support in orangefs for doing async operations; everything
> > > seems to be modelled on "submit an I/O and wait for it to complete".
> > >
> > > I'm happy to help out with pagecache interactions, but adding async
> > > support to orangefs is a little bigger task than I'm willing to put
> > > significant effort into right now.



[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux