On Tue, Mar 01, 2016 at 11:00:55AM +0000, Mel Gorman wrote: > On Tue, Mar 01, 2016 at 11:25:41AM +0100, Jan Kara wrote: > > Hi, > > > > On Tue 01-03-16 02:09:11, Matthew Wilcox wrote: > > > There are a few issues around 1GB THP support that I've come up against > > > while working on DAX support that I think may be interesting to discuss > > > in person. > > > > > > - Do we want to add support for 1GB THP for anonymous pages? DAX support > > > is driving the initial 1GB THP support, but would anonymous VMAs also > > > benefit from 1GB support? I'm not volunteering to do this work, but > > > it might make an interesting conversation if we can identify some users > > > who think performance would be better if they had 1GB THP support. > > > > Some time ago I was thinking about 1GB THP and I was wondering: What is the > > motivation for 1GB pages for persistent memory? Is it the savings in memory > > used for page tables? Or is it about the cost of fault? > > > > If anything, the cost of the fault is going to suck as a 1G allocation > and zeroing is required even if the application only needs 4K. It's by > no means a universal win. The savings are in page table usage and TLB > miss cost reduction and TLB footprint. For anonymous memory, it's not > considered to be worth it because the cost of allocating the page is so > high even if it works. There is no guarantee it'll work as fragementation > avoidance only works on the 2M boundary. > > It's worse when files are involved because there is a > write-multiplication effect when huge pages are used. Specifically, a > fault incurs 1G of IO even if only 4K is required and then dirty > information is only tracked on a huge page granularity. This increased > IO can offset any TLB-related benefit. > It was pointed out to me privately that the IO amplication cost is not the same for persistent memory as it is for traditional storage and this is true. For example, the 1G of data does not have to be read on fault every time. The write problems are mitigated but remain if the 1G block has to be zero'd for example. Even for normal writeback the cache lines have to be flushed as the kernel does not know what lines were updated. I know there is a proposal to defer that tracking to userspace but that breaks if an unaware process accesses the page and is overall very risky. There are other issues such as having to reserve 1G of block in case a file is truncated in the future or else there is an extremely large amount of wastage. Maybe it can be worked around but a workload that uses persistent memory with many small files may have a bad day. While I know some of these points can be countered and discussed further, at the end of the day, the benefits to huge page usage are reduced memory usage on page tables, a reduction of TLB pressure and reduced TLB fill costs. Until such time as it's known that there are realistic workloads that cannot fit in memory due to the page table usage and workloads that are limited by TLB pressure, the complexity of huge pages is unjustified and the focus should be on the basic features working correctly. If fault overhead of a 4K page is a major concern then fault-around should be used on the 2M boundary at least. I expect there are relatively few real workloads that are limited by the cost of major faults. Applications may have a higher startup cost than desirable but in itself that does not justify using huge pages to workload problems with fault speeds in the kernel. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html