On Thu, Sep 03, 2020 at 06:01:57PM +0100, Matthew Wilcox wrote: > On Thu, Sep 03, 2020 at 01:50:51PM -0300, Jason Gunthorpe wrote: > > At least from a RDMA NIC perspective I've heard from a lot of users > > that higher order pages at the DMA level is giving big speed ups too. > > > > It is basically the same dynamic as CPU TLB, except missing a 'TLB' > > cache in a PCI-E device is dramatically more expensive to refill. With > > 200G and soon 400G networking these misses are a growing problem. > > > > With HPC nodes now pushing 1TB of actual physical RAM and single > > applications basically using all of it, there is definately some > > meaningful return - if pages can be reliably available. > > > > At least for HPC where the node returns to an idle state after each > > job and most of the 1TB memory becomes freed up again, it seems more > > believable to me that a large cache of 1G pages could be available? > > You may be interested in trying out my current THP patchset: > > http://git.infradead.org/users/willy/pagecache.git > > It doesn't allocate pages larger than PMD size, but it does allocate pages > *up to* PMD size for the page cache which means that larger pages are > easier to create as larger pages aren't fragmented all over the system. Yeah, I saw that, it looks like a great direction. > If someone wants to opportunistically allocate pages larger than PMD > size, I've put some preliminary support in for that, but I've never > tested any of it. That's not my goal at the moment. > > I'm not clear whether these HPC users primarily use page cache or > anonymous memory (with O_DIRECT). Probably a mixture. There are defiantly HPC systems now that are filesystem-less - they import data for computation from the network using things like blob storage or some other kind of non-POSIX userspace based data storage scheme. Jason