Here are my notes from the LSF/MM 2016 MM track. I expect LWN.net to have nicely readable articles on most of these discussions. LSF/MM 2016 Memory Management track notes Transparent Huge Pages Kirill & Hugh have different implementations of tmpfs transparent huge pages Kirill can split 4k pages out of huge pages, to avoid splits (refcounting implementation, compound pages) Hugh's implementation: get it up and running quickly and unobtrusively (team pages) Kirill's implementation can dirty 4kB inside a huge page on write() Kirill wants to get huge pages in page cache to work for ext4 cannot be transparent to the filesystem Hugh: what about small files? huge pages would be wasted space Kirill: madvise/madvise for THP, or file size based policy at write time allocate 4kB pages, khugepaged can collapse them Andrea: what is the advantage of using huge pages for small files? Hugh: 2MB initial allocation is shrinkable, not charged to memcg Kirill: for tmpfs, also need to check against tmpfs filesystem size when deciding what page size to allocate Kirill: does not like how tmpfs is growing more and more special cases (radix tree exception entries, etc) Aneesh,Andrea: also not happy that kernel would grow yet another kind of huge page Hugh: Kirill can probably use the same mlock logic my code uses Kirill: I do not mlock pages, just VMAs, prevent pageout that way Hugh: Kirill has some stuff working better than I realized, maybe can still use some of my code Kirill: on split hugepmd Hugh has a split with ptes, Kirill just blows away PMD and lets faults fill in PTEs Hugh: what Kirill's code does is not quite correct for mlock Kirill: mlock does not guarantee lack of minor faults Aneesh: PPC64 needs deposited page tables hardware page table hashed on actual page size, huge page is only logical not HW supported last level page table stores slot/hash information Andrea: do not worry too much about memory consumption with THP if worried, do small allocations and let khugepaged collapse them use same model for THP file cache as used for THP anonymous memory Andrea/Kirill/Hugh: no need to use special radix tree entries for huge pages, in general at hole punch time could be useful later, as an optimization might want a way to mark 4kB pages dirty on radix tree side, inside a compound page (or use page flags on tail page struct) Hugh: how about two radix trees? Everybody else: yuck :) Andrea: with the compound model, I see no benefit to multiple radix trees First preparation series (by Hugh) already went upstream Kirill can use some of Hugh's code DAX needs some of the same code, too Hugh: compount pages could be extended to offer my functionality, would like to integrate what he has settling on sysfs/mount options before freezing then add compount pages on top Hugh: current show stopper with Kirill's code: small files, hole punching khugepaged -> task_work Advantage: concentrate thp on tasks that use most CPU and could benefit from them the most Hugh: having one single scanner/compacter might have advantages When to trigger scanning? Hugh: observe at page fault time? Vlastimil: if there are no faults because the memory is already present, there would not be an observation event Johannes: wait for someone to free a THP? maybe background scanning still best? merge plans Hugh would like to merge team pages now, switch to compound pages later Kirill would like to get compound pages into shape first, then merge things Andrea: if we go with team pages, we should ensure it is the right solution for both anonymous memory and ext4 Andrea: can we integrate the best parts of both code bases and merge that? Mel: one of my patch series is heavily colliding with team pages (moving accounting from zones to nodes) Andrew: need a decision on team pages vs compound pages Hugh: if compound pages went in first, we would not replace it with team pages later - but the other way around might happen merge blockers Compound pages issues: small files memory waste, fast recovery for small files, get khugepaged into shape, maybe deposit/withdrawal, demonstrate recovery, demonstrate robustness (or Hugh demonstrates brokenness) Team page issues: recovery (khugepaged cannot collapse team pages), anonymous memory support (Hugh: pretty sure it is possible), API compatible to test compound, don't use page->private, path forward for other filesystems revert team page patches from MMOTM util blockers addressed GFP flags __GFP_REPEAT fuzzy semantics, keep retrying until an allocation succeeds for higher order allocations but most used for order 0... (not useful) can be cleaned up, and get a useful semantic for higher order allocations "can fail, try hard to be successful, but could still fail in the end" __GFP_NORETRY - fail after single attempt to reclaim something, not very helpful except for optimistic/opportunistic allocations maybe have __GFP_BEST_EFFORT, try until a certain point then give up? (retry until OOM, then fail?) remove __GFP_REPEAT from non-costly allocations introduce new flag, use it where useful can the allocator know compaction was deferred? more explicit flags? NORECLAIM NOKSWAPD NOCOMPACT NO_OOM etc... use explicit flags to switch stuff off clameter: have default definitions with all the "normal stuff" enabled flags inconsistent - sometimes positive, sometimes negative, sometimes for common things, sometimes for uncommon things THP allocation not explicit, but inferred from certain flags concensus on cleaning up GFP usage CMA KVM on PPC64 runs into a strange hardware requirements needs contiguous memory for certain data structures tried to reduce fragmentation/allocation issues with ZONE_CMA atomic 0 order allocations fail early, due to kswapd not kicking in on time taking pages out of CMA zone first compaction does not move movable compound pages (eg. THP), breaking CMA in ZONE_CMA mlock and other things pinning allocated-as-movable pages also break CMA what to do instead of ZONE_CMA? how to keep things movable? sticky MIGRATE_MOVABLE zones? do not allow reclaimable & unmovable allocations in sticky MIGRATE_MOVABLE zones memory hotplug has similar requirements to CMA, no need for a new name need something like physical memory linear reclaim, finding sticky MIGRATE_MOVABLE zones and reclaiming everything inside Mel: would like to see ZONE_CMA and ZONE_MOVABLE go away FOLL_MIGRATE get_user_pages flag to move pages away from movable region when being pinned should be handled by core code, get_user_pages Compaction, higher order allocations compaction not invoked from THP allocations with delayed fragmentation patch set kcompactd daemon for background compaction should kcompactd do fast direct reclaim? lets see cooperation with OOM compaction - hard to get useful feedback about compaction "does random things, returns with random answer" no notion of "costly allocations" compaction can keep indefinitely deferring action, even for smaller allocations (eg. order 2) sometimes compaction finds too many page blocks with the skip bit set success rate of compaction skyrocketed with skip bits ignored (stale skip bits?) migrate skips over MIGRATE_UNMOVABLE page blocks found during order 9 compaction page block may be perfectly suitable for smaller order compaction have THP skip more aggressively, while order 2 scans inside more page blocks priority for compaction code? aggressiveness of diving into blocks vs skipping order 9 allocators: THP - wants allocation to fail quickyl if no order 9 available hugetlbfs - really wants allocations to succeed VM containers VM imply more memory consumption than what application that runs in it need How to pressure guest to give back memory to host ? Adding new shrinker did not seem to perform well Move page cache to the host so it would be easier to reclaim memory for all guest Move memory management from guest kernel to host, some kind of memory controller Have the guest tell the host how to reclaim, sharing LRU for instance mmu_notifier is already sharing some informations with access bit (young), but mmu_notifier is to coarse DAX (in the guest) should be fine to solve filesystem memory if not DAX backed on the host, needs new mechanism for IO barriers, etc FUSE driver in the guest and move filesystem to the host Exchange memory pressure btw guest and host so that host can ask guest to adjust its pressure depending on overall situation of the host Generic page-pool recycle facility found bottlenecks in both page allocator and DMA APIs "packet-page" / explicit data path API make it generic across multiple use cases get rid of open coded driver approaches Mel: make per-cpu allocator fast enough to act as the page pool gets NUMA locality, shrinking, etc all for free needs pool sizing for used pool items, too - can't keep collecting incoming packets without handling them allow page allocator to reclaim memory Address Space Mirroring Haswell-EX allows memory mirroring, partial or all memory goal: improve high availability by avoiding uncorrectable errors in kernel memory partial has higher remaining memory capacity, but not software transparent some memory mirrored, some not mirrored memory set up in BIOS, amount in each NUMA node proportional to amount of memory in each node mirror range info in EFI memory map avoid kernel allocations from non-mirrored memory ranges, avoid ZONE_MOVABLE allocations put user allocations in non-mirrored memory, avoid ZONE_NORMAL allocations MADV_MIRROR to put certain user memory in mirrored memory problem: to put a whole program in mirrored memory, need to relocate libraries into mirrored memory what is the value proposition of mirroring user space memory? policy: when mirrored memory is requested, do not fall back to non- mirrored memory Michal: is this desired? Aneesh: how should we represent mirrored memory? zones? something else? Michal: we are back to highmem problem lesson from highmem era: keep ratio of kernel to non-kernel memory low enough, below 1:4 how much userspace needs to be in mirrored memory, in order to be able to restart applications? should we have opt-out for mirrored instead of opt-in? proposed interface: prctl kcore mirror code upstream already Mel: systems using lots of ZONE_MOVABLE have problems, and are often unstable Mel: assuming userspace can figure out the right thing to choose what needs to be mirrored is not safe Vlastimil: use non-mirrored memory as frontswap only, put all managed memory in mirrored memory dwmw2: for workload of "guest we care about, guests we don't care about", we can allocate only guest memory for unimportant guests in non-mirrored memory Mel: even in that scenario a non-important guest's kernel allocations could exhaust mirrored memory Mel: partial mirroring makes a promise of reliability that it cannot deliver on false hope complex configuration makes the system less reliable Andrea: memory hotplug & other zone_movable users already cause the same problems today Heterogenious Memory Management used for GPU, CAPI, other kinds of offload engines GPU has much faster memory than system RAM to get performance, GPU offload data needs to sit in VRAM shared address space creates an easier programming model needs ability to migrate memory between system RAM and VRAM CPU cannot access VRAM GPU can access system RAM ... very very slowly hardware is coming up real soon (this year) without HMM GPU stuff running 10/100x slower need to pin lots of system memory (16GB per device?) use of mmu_notifier spreading to device drivers, instead of one common solution special swap type to handle migration future openCL API wants address space sharing HMM has some core VM impact, but relatively contained how to get HMM upstream? does anybody have objections to anything in HMM? split up in several series Andrew: put more info in the changelogs space for future optimizations dwmw2: svm API, should move to a generic API intel_svm_bind_mm - bind the current process to a PASID MM validation & debugging Sasha using KASAN on locking, trap missed locks requires annotation of what memory is locked by a lock how to annotate what memory is protected by a lock? Kirill: what about a struct with a lock inside? annotate struct members with which lock protects it? too much work trying to improve hugepage testing split_all_huge_pages expose list of huge pages through debugfs, allow splitting arbirarily chosen ones fuzzer to open, close, read & write random files in sysfs & debugfs how to coordinate security(?) issues with zero-day security folks? Memory cgroups how to figure out the memory a cgroup needs (as opposed to currently used)? memory pressure is not enough to determine the needs of a cgroup cgroups scanned in equal portion unfair, streaming file IO can result in using lots of memory, even when the cgroup has mostly inactive file pages potential solution: dynamically balance the cgroups adjust limits dynamically, based on their memory pressure problem: how to detect memory pressure? when to increase memory? when to decrease memory? real time aging of various LRU lists only for active / anon lists, not inactine file list "keep cgroup data in memory if its working set is younger than X seconds" refault info: distinguish between refaults (working set faulted back in), and evictions of data that is only used once can be used to know when to grow a cgroup, but not when to shrink it vmpressure API: does not work well on very large systems, only on smaller ones quickly reaches "critical" levels on large systems, that are not even that busy Johannes: time-based statistic to measure how much time processes wait for IO not iowait, which measures how long the _system_ waits, but per- task add refault info in, only count time spent on refaults wait time above threshold? grow cgroup wait time under threshold? shrink cgroup, but not below lower limit Larry: Docker people want per-cgroup vmstat info TLB flush optimizations mmu_gather side of tlb flush collect invalidations, gather items to flush patch: increase size of mmu_gather, and try to flush more at once Andrea - rmap length scalability issues too many KSM pages merged together, rmap chain becomes too long put upper limit on number of shares of a KSM page (256 share limit) mmu_notifiers batch flush interface? limit max_page_sharing to reduce KSM rmap chain length OOM killer goal: make OOM invocation more deterministic currently: reclaim until there is nothing left to reclaim, then invoke OOM killer problem: sometimes reclaim gets stuck, and OOM killer is not invoked when it should one single page free resets the OOM counter, causing livelock thrashing not detected, on the contrary helps thrashing happen make things more conservative? OOM killer invoked on heavy thrashing and no progress made in the VM OOM reaper - to free resources before OOM killed task can exit by itself timeout based solution is not trivial, doable, but not preferred by Michal if Johannes can make a timeout scheme deterministic, Michal has no objections Michal: I think we can do better without a timer solution need deterministic way to put system into a consistent state tmpfs vs OOM killer OOM killer cannot discard tmpfs files with cgroups, reap giant tmpfs file anyway in special cases at Google restart whole container, dump container's tmpfs contents MM tree workflow most of Andrew's job: sollicit feedback from people -mm git tree helps many people Michal: would like email message IDs references in patches, both for original patches and fixes the value of -fix patches is that previous reviews do not need to get re-done sometimes a replacement patch is easier Kirill: sometimes difficult to get patch sets reviewed generally adds acked-by and reviewed-by lines by hand Michal: -mm tree is maintainer tree of last resort Andrew: carrying those extra patches isn't too much work SLUB optimizations lightning talk bulk APIs for SLUB + SLAB kmem_cache_{alloc,free}_bulk() kfree_bulk() 60%speedup measured can be used from network, rcu free, ... per CPU freelist per page nice speedup, but still suffers from a race condition
Attachment:
signature.asc
Description: This is a digitally signed message part