On Thu, Aug 15, 2013 at 12:13:09PM -0600, Khalid Aziz wrote: > I am working with a tool that simulates oracle database I/O workload. > This tool (orion to be specific - > <http://docs.oracle.com/cd/E11882_01/server.112/e16638/iodesign.htm#autoId24>) allocates hugetlbfs pages using shmget() with SHM_HUGETLB flag. It then does aio into these pages from flash disks using various common block sizes used by database. I am looking at performance with two of the most common block sizes - 1M and 64K. aio performance with these two block sizes plunged after Transparent HugePages was introduced in the kernel. Here are performance numbers: > > pre-THP 2.6.39 3.11-rc5 > 1M read 8384 MB/s 5629 MB/s 6501 MB/s > 64K read 7867 MB/s 4576 MB/s 4251 MB/s > > I have narrowed the performance impact down to the overheads introduced > by THP in __get_page_tail() and put_compound_page() routines. perf top > shows >40% of cycles being spent in these two routines. Every time > direct I/O to hugetlbfs pages starts, kernel calls get_page() to grab a > reference to the pages and calls put_page() when I/O completes to put > the reference away. THP introduced significant amount of locking > overhead to get_page() and put_page() when dealing with compound pages > because hugepages can be split underneath get_page() and put_page(). It > added this overhead irrespective of whether it is dealing with hugetlbfs > pages or transparent hugepages. This resulted in 20%-45% drop in aio > performance when using hugetlbfs pages. > > Since hugetlbfs pages can not be split, there is no reason to go through > all the locking overhead for these pages from what I can see. I added > code to __get_page_tail() and put_compound_page() to bypass all the > locking code when working with hugetlbfs pages. This improved > performance significantly. Performance numbers with this patch: > > pre-THP 3.11-rc5 3.11-rc5 + Patch > 1M read 8384 MB/s 6501 MB/s 8371 MB/s > 64K read 7867 MB/s 4251 MB/s 6510 MB/s > > Performance with 64K read is still lower than what it was before THP, > but still a 53% improvement. It does mean there is more work to be done > but I will take a 53% improvement for now. > > Please take a look at the following patch and let me know if it looks > reasonable. > > > Signed-off-by: Khalid Aziz <khalid.aziz@xxxxxxxxxx> > --- > mm/swap.c | 77 +++++++++++++++++++++++++++++++++++++++++-------------------- > 1 file changed, 52 insertions(+), 25 deletions(-) > > diff --git a/mm/swap.c b/mm/swap.c > index 62b78a6..cc8326f 100644 > --- a/mm/swap.c > +++ b/mm/swap.c > @@ -31,6 +31,7 @@ > #include <linux/memcontrol.h> > #include <linux/gfp.h> > #include <linux/uio.h> > +#include <linux/hugetlb.h> > > #include "internal.h" > > @@ -81,6 +82,19 @@ static void __put_compound_page(struct page *page) > > static void put_compound_page(struct page *page) > { > + /* > + * hugetlbfs pages can not be split from under us. If this > + * is a hugetlbfs page, check refcount on head page and release > + * the page if refcount is zero. > + */ > + if (PageHuge(page)) { > + page = compound_head(page); > + if (put_page_testzero(page)) > + __put_compound_page(page); > + > + return; > + } > + > if (unlikely(PageTail(page))) { > /* __split_huge_page_refcount can run under us */ > struct page *page_head = compound_trans_head(page); > @@ -184,38 +198,51 @@ bool __get_page_tail(struct page *page) > * proper PT lock that already serializes against > * split_huge_page(). > */ > - unsigned long flags; > bool got = false; > - struct page *page_head = compound_trans_head(page); > + struct page *page_head; > > - if (likely(page != page_head && get_page_unless_zero(page_head))) { > + /* > + * If this is a hugetlbfs page, it can not be split under > + * us. Simply increment refcount for head page > + */ > + if (PageHuge(page)) { > + page_head = compound_head(page); > + atomic_inc(&page_head->_count); > + got = true; Why not just return here and don't increase indentantion level for rest of the function? -- Kirill A. Shutemov -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>