On Wed, 6 Apr 2022 21:15:26 -0600 Yu Zhao <yuzhao@xxxxxxxxxx> wrote: > Add a design doc. > > > ... > > +Design overview > +=============== > +Objectives > +---------- > +The design objectives are: > + > +* Good representation of access recency > +* Try to profit from spatial locality > +* Fast paths to make obvious choices > +* Simple self-correcting heuristics > + > +The representation of access recency is at the core of all LRU > +implementations. In the multi-gen LRU, each generation represents a > +group of pages with similar access recency. Generations establish a > +common frame of reference and therefore help make better choices, > +e.g., between different memcgs on a computer or different computers in > +a data center (for job scheduling). Does MGLRU have any special treatment for used-once pages? > +Exploiting spatial locality improves efficiency when gathering the > +accessed bit. A rmap walk targets a single page and does not try to > +profit from discovering a young PTE. A page table walk can sweep all > +the young PTEs in an address space, but the address space can be too > +large to make a profit. The key is to optimize both methods and use > +them in combination. > + > +Fast paths reduce code complexity and runtime overhead. Unmapped pages > +do not require TLB flushes; clean pages do not require writeback. > +These facts are only helpful when other conditions, e.g., access > +recency, are similar. With generations as a common frame of reference, > +additional factors stand out. But obvious choices might not be good > +choices; thus self-correction is required. > + > +The benefits of simple self-correcting heuristics are self-evident. > +Again, with generations as a common frame of reference, this becomes > +attainable. Specifically, pages in the same generation can be > +categorized based on additional factors, and a feedback loop can > +statistically compare the refault percentages across those categories > +and infer which of them are better choices. > + > +Assumptions > +----------- > +The protection of hot pages and the selection of cold pages are based > +on page access channels and patterns. There are two access channels: > + > +* Accesses through page tables > +* Accesses through file descriptors > + > +The protection of the former channel is by design stronger because: > + > +1. The uncertainty in determining the access patterns of the former > + channel is higher due to the approximation of the accessed bit. > +2. The cost of evicting the former channel is higher due to the TLB > + flushes required and the likelihood of encountering the dirty bit. > +3. The penalty of underprotecting the former channel is higher because > + applications usually do not prepare themselves for major page > + faults like they do for blocked I/O. E.g., GUI applications > + commonly use dedicated I/O threads to avoid blocking the rendering > + threads. > + > +There are also two access patterns: > + > +* Accesses exhibiting temporal locality > +* Accesses not exhibiting temporal locality > + > +For the reasons listed above, the former channel is assumed to follow > +the former pattern unless ``VM_SEQ_READ`` or ``VM_RAND_READ`` is > +present, and the latter channel is assumed to follow the latter > +pattern unless outlying refaults have been observed. What about MADV_SEQUENTIAL? Or did we propogate that into the fd? > +Workflow overview > +================= > +Evictable pages are divided into multiple generations for each > +``lruvec``. The youngest generation number is stored in > +``lrugen->max_seq`` for both anon and file types as they are aged on > +an equal footing. The oldest generation numbers are stored in > +``lrugen->min_seq[]`` separately for anon and file types as clean file > +pages can be evicted regardless of swap constraints. These three > +variables are monotonically increasing. > + > > ... > > +Summary > +------- > +The multi-gen LRU can be disassembled into the following parts: > + > +* Generations > +* Page table walks > +* Rmap walks > +* Bloom filters > +* The PID controller > + > +The aging and the eviction is a producer-consumer model; specifically, > +the latter drives the former by the sliding window over generations. > +Within the aging, rmap walks drive page table walks by inserting hot > +densely populated page tables to the Bloom filters. Within the > +eviction, the PID controller uses refaults as the feedback to select > +types to evict and tiers to protect. It's cool to see a PID controller in there. How do we know that it converges well, that it doesn't exhibit instability, etc?