On Sat, 12 Mar 2016, Kirill A. Shutemov wrote: > Here is an updated version of huge pages support implementation in > tmpfs/shmem. > > All known issues has been fixed. I'll continue with validation. > I will also send follow up patch with documentation update. > > Hugh, I would be glad to hear your opinion on this patchset. At long last I've managed to spend some time getting into it. My opinion is simple: generally I liked your patches, but I'm disappointed by where they leave tmpfs functionally, so wouldn't want the patchset to go in as is. As soon as I ran it up and tried to copy a tree in there for testing, my (huge) tmpfs filled up with all these 2MB files: which is just the same as when you started out two or three years ago on ramfs. As I've said on several occasions, I am not interested in emulating the limitations of hugetlbfs inside tmpfs: there might one day be a case for such a project, but it's transparent hugepages that we want to have now - blocksize 4kB, but in 2MB extents when possible (on x86_64). I accept that supporting small files correctly is not the #1 requirement for a "huge" tmpfs; and if it were impossible or unreasonable to ask for, I'd give up on it. But since we (Google) have been using a successful implementation for a year now, I see no reason to give up on it at all: it allows for much wider and easier adoption (and testing) of huge tmpfs - without "but"s. The small files thing formed my first impression. My second impression was similar, when I tried mmap(NULL, size_of_RAM, PROT_READ|PROT_WRITE, MAP_ANONYMOUS|MAP_SHARED, -1, 0) and cycled around the arena touching all the pages (which of course has to push a little into swap): that soon OOMed. But there I think you probably just have some minor bug to be fixed: I spent a little while trying to debug it, but then decided I'd better get back to writing to you. I didn't really understand what I was seeing, but when I hacked some stats into shrink_page_list(), converting !is_page_cache_freeable(page) to page_cache_references(page) to return the difference instead of the bool, a large proportion of huge tmpfs pages seemed to have count 1 too high to be freeable at that point (and one huge tmpfs page had a count of 3477). Hmm, I just went to repeat that test, to see if MemFree goes down and down on each run, but got a VM_BUG_ON_PAGE(page_ref_count(page) == 0) in put_page_testzero() in find_get_entries() during shmem_evict_inode(). So, something not quite right with the refcounting when under pressure. My third impression was much better, when I just did a straight cp /dev/zero /tmpfs_twice_size_of_RAM: that went smoothly, pushed into swap just as it should, nice. > > The main difference with Hugh's approach[1] is that I continue with > compound pages, where Hugh invents new way couple pages: team pages. > I believe THP refcounting rework made team pages unnecessary: compound > page are flexible enough to serve needs of page cache. I have disagreed with you about the suitability of compound pages; but at some point on Sunday night, after reading your patchset, found myself looking at it differently, and now agreeing with you. They are not flexible enough yet, but I believe you have done a large part of the work (by diverting all those tail ops to the head), and it just needs a little more. I say "a little", but it might not be easy. What I realize now, is that you should be able to do almost everything I did with team pages, instead with your compound pages. Just abandon the idea that the whole compound page has to be initialized in one go: initialize (charge to memcg, add to page cache, fill, SetPageUptodate) each subpage as it is touched. (But of course, the whole extent has to be to be initialized before it can be pmd-mapped.) And the convenient thing about a compound page is that you have plenty of space to hold your metadata: where I have been squeezing everything into page->private (which makes it difficult then to extend the same design to miscellaneous filesystems), you're free to use (certain) fields of 511 more struct pages. I said above that you should be able to do almost everything: I think that "almost" involves dropping a few things that I have, that we can well do without. With team pages, it is possible for each member to be charged to a different memcg; but I don't have anyone calling out for that behaviour, it's more an awkward possibility that we have to be on guard against in a few places, and I'd be quite happy just to stop it (charging instead to whichever memcg first touched the extent). And it's been nice to entertain the idea of the individual team tails, travelling separately up and down their own lrus, subject to their own pagesize access patterns even while the whole is pmd-mapped. But in practice, we continue to throw them on the unevictable list once they get in the way, with no attempt to bring them back when accessed individually; so I think it's a nice idea that we can live without. And a compound page implementation would avoid the overhead I have, of putting all those tails on lru in the first place. Another thing I did, that I think you can safely fail to emulate: I was careful to allow sparse population, and not insist that the head page be instantiated. That might turn out to be more of a problem for compound pages, and has been an awkward edge case for me. Although in principle there are sparse files which would be doubled in charge by insisting that the head be instantiated, I doubt they will hurt anyone in practice: perhaps best treated as a bug to be revisited if it actually comes up as an issue later. (Thinking about that last paragraph later, I'm not so sure. Forcing the instantiation of the head is okay when first allocating, but when trying to recover a fragmented extent into a hugepage later, it might prove to be seriously limiting to require that the head be migrated into place before anything else is done.) > > Many ideas and some patches were stolen from Hugh's patchset. Having this > patchset around was very helpful. People are probably wondering why I didn't repost it, and whether it's vapourware. No, it's worked out very well, better than we were expecting; and several of the TODOs that I felt needed attention, have not yet caused anyone trouble - though wider adoption is sure to meet new usecases and new bugs. I didn't repost it because, as I said at the time, I felt it wasn't ready for real use until recovery of huge pages after fragmentation and swapping (what khugepaged does for anon THP) was added. After that went in (and then quite a period of fixing the bugs, I admit!), other priorities intervened until now. Your patchset also lacks recovery at present, so equally that's a reason why I don't consider yours ready for merging. I'll repost mine, including recovery and a few minor additions, against v4.6-rc2. I still haven't got around to doing any NUMA work on it, but it does appear to be useful to many people even without that. I'd have preferred to spend longer rebasing it, it may turn out to be a little raw: most of the rebases from one release to another have been easy, but to v4.5 was messier - not particularly due to your THP refcounting work going in, just various different conflicts that I feel I've not properly digested yet. But v4.6-rc2 looks about as early and as late as I can do it, with LSF/MM coming up later in April; and should be a good base for people to try it out on. (I'll work on preparing it in advance, then I guess a last-minute edit of PAGE_CACHEs before posting.) I now believe that the right way forward is to get that team page implementation in as soon as we can, so users have the functionality; and then (if you're still interested) you work on converting it over from teams to compound pages, both as an optimization, and to satisfy your understandable preference for a single way of managing huge pages. (I would say "How does that sound to you?" but I'd prefer you to think it over, maybe wait to see what the new team patchset looks like. Though aside from the addition of recovery, it is very much like the one against v3.19 from February last year: no changes to the design at all, which turned out to fit recovery well; but quite a few bug fixes folded back in. Or you may prefer to wait to discuss the way forward at LSF/MM.) > > I will continue with code validation. I would expect mlock require some > more attention. > > Please, review and test the code. > > Git tree: > > git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git hugetmpfs/v4 > > Rebased onto linux-next: > > git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git hugetmpfs/v4-next It wasn't obvious to me what tree your posted patchset was based on, so I made a patch out of the top your hugetmpfs/v4-next branch, and applied that on top of Andrew's 4.5-rc7-mm1: which appeared to be fine, just a trivial clash with mmotm's "do_shared_fault(): check that mmap_sem is held". [...] Some random thoughts on your patches FWIW: 01/25 mm: do not pass mm_struct into handle_mm_fault 02/25 mm: introduce fault_env These two inflate your patchset considerably, with seemingly unrelated changes: the first with arch files otherwise untouched, the second with lots of change to memory.c that makes it harder to compare our patchsets. I think it would be better to leave the handle_mm_fault(mm) removal to another patchset, and I'm ambivalent about the struct fault_env changes: I prefer to avoid such argument blocks (and think I saw Linus once say he detested my zap_details), which do make it harder to see what the real arguments are. I wonder what your 03/25 would look like without the fault_env, perhaps I'd agree it's too cumbersome that way (and perhaps I should look back at earlier versions of your patchset). But I do understand your desire to get some cleanups out of the way before basing the real work on the new structure: I'm going to have to reconsider the initial patches of my series in the light of this criticism of yours, not sure yet if I'll try to cut them out or not. 03/25 mm: postpone page table allocation until we have page to map I've not stopped to consider whether your prealloc_pte work (which perhaps was the trigger for fault_env) meets the case or not, but want to mention that we recently discovered a drawback to my delaying the pagetable allocation: I'm allocating the pagetable while holding the lock on the page from the filesystem (any filesystem, though the change was to suit tmpfs), which presented a deadlock to the OOM killer. The base kernel involved was not a recent one, and there have been rewrites in the OOMing area since, so I'm not sure whether it would be a deadlock today: but I am sure that it would be better to do the allocation while not holding the page lock. We have a fix in place, but I don't think it's quite the right fix, so I want to revisit it: unless I get a gift of time from somewhere, I'll post my series with the ordering here just as I had it before. You remark on yours: "This way we can postpone allocation even in faultaround case without moving do_fault_around() after __do_fault()." Was there something wrong or inefficient with my moving do_fault_around() after __do_fault()? I thought it was fine; but I did also think when I made the change, that it would be fixing a bug in VM_FAULT_MAJOR acccounting too; yet when I went to confirm that, couldn't find evidence for such a bug before; but I moved on before getting to the bottom of it. 04/25 rmap: support file thp 05/25 mm: introduce do_set_pmd() 06/25 mm, rmap: account file thp pages A firm NAK on this one, for a simple reason: we have enough times regretted mixing up the swap-backed Shmem into File stats; so although you intend to support huge file pages by the same mechanism later, the ones you support in this patchset are Shmem: ShmemPmdMapped is the name I was using. 07/25 thp, vmstats: add counters for huge file pages 08/25 thp: support file pages in zap_huge_pmd() I see there's an ongoing discussion with Aneesh about whether you need to deposit/withdraw pagetables. I blindly assumed that I would have to participate in that mechanism, so my series should be safe in that regard, though not as nice as yours. (But I did get the sequence needed for PowerPC wrong somewhere in my original posting.) 09/25 thp: handle file pages in split_huge_pmd() That's neat, just unmapping the pmd for the shared file/shmem case, I didn't think to do it that way. Or did I have some other reason for converting the pmd to ptes? I think it is incorrect not to provide ptes when splitting when it's an mlock()ed area; but you've said yourself that mlock probably needs more work. 10/25 thp: handle file COW faults We made the same decision (and Matthew for DAX too IIRC): don't bother to COW hugely, it's more code, and often not beneficial. 11/25 thp: handle file pages in mremap() Right, I missed that from my v3.19 rebase, but added it later when I woke up to your v3.15 mod: I prefer my version of this patch, which adds take_rmap_locks(vma) and drop_rmap_locks(vma) helpers. 12/25 thp: skip file huge pmd on copy_huge_pmd() Oh, that looks good, I'd forgotten about that optimization. And I think in this case you don't have to worry about VM_LOCKED or deposit/withdraw, it's just fine to skip over it. I think I'll post my one as it is, and let you scoff at it then. 13/25 thp: prepare change_huge_pmd() for file thp You're very dedicated to inserting the vma_is_anonymous(), where I just delete the BUG_ON. Good for you. 14/25 thp: run vma_adjust_trans_huge() outside i_mmap_rwsem Interesting. I haven't spent long enough looking at this, just noted it as something to come back to: I haven't decided whether this is a good safe obvious change (always better to do something outside the lock if it's safe), or not. I don't have that issue of munlock()ing when splitting, so hadn't noticed the peculiar placing of vma_adjust_trans_huge() inside the one lock but outside the other. 15/25 thp: file pages support for split_huge_page() I didn't go much beyond your very helpful commit comment on this one. I'd been worried by how you would freeze for the split, but it sounds right. Very interested to see your comment about dropping pages beyond i_size from the radix-tree: I think that's absolutely right, but never occurred to me to do - not something that tests or users notice, but needed for correctness. 16/25 thp, mlock: do not mlock PTE-mapped file huge pages I think we both wish there were no mlock(), it gave me more trouble than anything. And it wouldn't be a surprise if what I have and think works, actually turns out to be very broken. So I'm going to be careful about criticising your workarounds, they might be better. 17/25 vmscan: split file huge pages before paging them out As mentioned at the beginning, seemed to work well for the straight cp, but not when the huge pages were mapped. 18/25 page-flags: relax policy for PG_mappedtodisk and PG_reclaim 19/25 radix-tree: implement radix_tree_maybe_preload_order() 20/25 filemap: prepare find and delete operations for huge pages 21/25 truncate: handle file thp 22/25 shmem: prepare huge= mount option and sysfs knob 23/25 shmem: get_unmapped_area align huge page 24/25 shmem: add huge pages support 25/25 shmem, thp: respect MADV_{NO,}HUGEPAGE for file mappings Not much to say on these at the moment, but clearly we'll need to settle the user interfaces of 22/25 and 25/25 before anything reaches Linus. For simplicity, I've carried on with my distinct mount options and knobs, but agree that some rapprochement with anon THP conventions will probably be required before finalizing. That's all for now - thanks, Hugh -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html