Hi Christoph, On Wed, Mar 17, 2010 at 02:05:53PM -0500, Christoph Lameter wrote: > > I am still opposed to this. The patchset results in compound pages be > managed in 4k segments. The approach so far was that a compound > page is simply a page struct referring to a larger linear memory > segment. The compound state is exclusively modified in the first > page struct which allows an easy conversion of code to deal with compound > pages since the concept of handling a single page struct is preserved. The > main difference between the handling of a 4K page and a compound pages > page struct is that the compound flag is set. > > Here compound pages have refcounts in each 4k segment. Critical VM path > can no longer rely on the page to stay intact since there is this on the > fly conversion. The on the fly "atomic" conversion requires various forms > of synchronization and modifications to basic VM primitives like pte > management and page refcounting. There is no change at all to pte management related to this. Please stick to facts. What you're commenting is the "compound_lock" and the change to two functions only: get_page and put_page (put_page has to serialize against split_huge_page_refcount and it uses the "compound_lock" per-head-page bit spinlock to achieve it), and this change only applies to TailPages and won't add any overhead to the fast path for non-compound pages and no change for PageHead pages either. The only thing that tries to run get_page or put_page on a TailPage is gup and gup_fast. Nothing else. And it's mostly needed to avoid altering gup_fast/gup API while still preventing to split hugepages across GUP. The reason why I went to this extra-length to be backwards compatible with gup/gup-fast is to avoid the patch to escalate from <40 patches to maybe >100 or more. It's troublesome enough already to merge this right now in this non-intrusive <40 patches fully backwards compatible form, you got to think how I could merge this if I went ahead and break everything in gup. > I would recommend that the conversion between 2M and 4K page work with > proper synchronization with all those handling references to the page. It's not doable without breaking every gup user, because the gup API allows them to run put_page on the tail page without adding any further synchronization at all! So the only way to avoid breaking every driver out there and most of kernel core, is to have the "proper synchronization with all those handling references to the page" _inside_ put_page itself, and this is what compound_lock achieves in a fully scalar way by implementing it as a bit spinlock on the PageHead. > Codepaths handling huge pages should not rely on on the fly conversion but > properly handle the various sizes. In most cases size does not matter > since the page state is contained in a single page struct regardless of > size. This patch here will cause future difficulties in making code handle > compound pages. This is definitely the long term plan, and in fact if we will eventually manage to remove split_huge_page _completely_, we could remove the compound_lock and return to the current get_page/put_page implementation of compound pages. The only point of compound_lock is to serialize put_page against split_huge_page_refcount so if the latter disappear, the compound_lock will disappear too and we won't have to take refcounts on tail pages in get_page either (which as said above is already zerocost for all regular pages and can't affect the fast path at all). But while this suggestion of yours totally misses the point of why split_huge_page, that this to avoid sending a patchset that is hugely bigger in size and much harder to audit and merge without much risk, than what I sent. My approach here is to allow incremental support for hugepages, just like when SMP was first introduced and many paths started to call lock_kernel() instead of being properly SMP threaded with spinlocks. I know you disagree with this comparison but personally I think it's the perfect comparison and it's an _identical_ scenario and I think it's very appropriate to repeat it here. > Transparent huge page support better be introduced gradually starting f.e. > with the support of 2M pages for anonymous pages. "introduced gradually starting with the support of 2M pages for anonymous pages" is exactly what my patch does. What you're suggesting in the previous part of the email is the opposite! Handling hugepages natively everywhere and removing both compound_lock and split_huge_page would then require the swapcache to handle 2M pages. swapcache is sharing 100% of pagecache code so then pagecache would need to handle 2M pages natively too. Hence the moment we remove the split_huge_page and the moment we require swapcache to handle 2M natively without splitting the hugepage first like my patch does, the whole thing escalates way beyond anonymous pages, like you seem to agree that it's a good idea to start with. Last but not the least your design that you advocate for that will escalate and explode everywhere all over the VM (like if the 2.0 kernel missed lock_kernel()) will perform exactly the same for KVM. We already get 100% of the benefit with the first mostly self contained <40 patches and as I shown in the benchmark I posted a few days ago it's only NPT/EPT that absolutely requires hugepages, host gets a speedup too of course, but orders of magnitude smaller than virtualization. And things like transparent hugepages are really one of the points where the KVM design shines boosting host and guest cumulatively and avoiding reinventing a fairly complex wheel. I think it's a no brainer (and you also obviously agree above) that we need to "introduce gradually transparent hugepages by starting with anonymous memory", so it's hard to see how anyone could prefer something that will escalate and explode non gradually over the whole VM, compared to what I proposed that remains self contained and allows gradual extension of VM awareness of hugepages prioritizing on what is more important to achieve first for applications. Thanks a lot for the review and I welcome different opinions of course. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxxx For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>