Re: [RFC]numa: improve I/O performance by optimizing numa interleave allocation

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, 2011-11-21 at 09:39 +0800, Shaohua Li wrote:
> On Sat, 2011-11-19 at 01:30 +0800, Andi Kleen wrote:
> > On Fri, Nov 18, 2011 at 03:12:12PM +0800, Shaohua Li wrote:
> > > If mem plicy is interleaves, we will allocated pages from nodes in a round
> > > robin way. This surely can do interleave fairly, but not optimal.
> > > 
> > > Say the pages will be used for I/O later. Interleave allocation for two pages
> > > are allocated from two nodes, so the pages are not physically continuous. Later
> > > each page needs one segment for DMA scatter-gathering. But maxium hardware
> > > segment number is limited. The non-continuous pages will use up maxium
> > > hardware segment number soon and we can't merge I/O to bigger DMA. Allocating
> > > pages from one node hasn't such issue. The memory allocator pcp list makes
> > > we can get physically continuous pages in several alloc quite likely.
> > 
> > FWIW it depends a lot on the IO hardware if the SG limitation
> > really makes a measurable difference for IO performance. I saw some wins from 
> > clustering using the IOMMU before, but that was a long time ago. I wouldn't 
> > consider it a truth without strong numbers, and then also only
> > for that particular device measured.
> > 
> > My understanding is that modern IO devices like NHM Express will
> > be faster at large SG lists.
> This is a LSI SAS1068E HBA card attaching some hard disks. The
> clustering has real benefit here. I/O throughput increases 3% or so.
> Not sure about NHM Express, wondering why large SG list could be faster.
> doesn't large SG means large DMA descriptor?
> 
> > > So can we make both interleave fairness and continuous allocation happy?
> > > Simplily we can adjust the round robin algorithm. We switch to another node
> > > after several (N) allocation happens. If N isn't too big, we can still get
> > > fair allocation. And we get N continuous pages. I use N=8 in below patch.
> > > I thought 8 isn't too big for modern NUMA machine. Applications which use
> > > interleave are unlikely run short time, so I thought fairness still works.
> > 
> > It depends a lot on the CPU access pattern.
> > 
> > Some workloads seem to do reasonable well with 2MB huge page interleaving.
> > But others actually prefer the cache line interleaving supplied by 
> > the BIOS.
> > 
> > So you can have a trade off between IO and CPU performance.
> > When in doubt I usually opt for CPU performance by default.
> Can you elaborate this more? the cache line interleaving can only be
> supplied by BIOS. OS can provide N*PAGE_SIZE interleave. I'm wondering
> what's the difference for example a 4k or 8k interleave for CPU
> performance. Actually if adjacent pages interleaved in two nodes could
> be in the same coloring, while two adjacent pages allocated from one
> node not. So clustering could be more cache efficient from coloring
> point of view.
> 
> > I definitely wouldn't make it default, but if there are workloads
> > that benefits a lot it could be an additional parameter to the
> > interleave policy.
> Christoph suggested the same way. the problem is we need change the API,
> right? And how are users supposed to use it? It would be difficult to
> determine the correct parameter.
> 
> If 8 pages clustering is too big, maybe we can use small. I guess a 2
> pages clustering is a big win too.
> 
> And I didn't change the allocation with a VMA case, which is supposed to
> be used for anonymous pages.
I tried a 2 pages clustering, it has the same effect like 8 page
clustering in my test environment.
would making the clustering a config option or sysctl be better?

Thanks,
Shaohua


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>


[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [ECOS]     [Asterisk Internet PBX]     [Linux API]