On Mon, Jun 03, 2024 at 06:23:46AM -0700, Dave Hansen wrote: > On 6/3/24 02:35, Byungchul Park wrote: > ...> In luf's point of view, the points where the deferred flush should be > > performed are simply: > > > > 1. when changing the vma maps, that might be luf'ed. > > 2. when updating data of the pages, that might be luf'ed. > > It's simple, but the devil is in the details as always. > > > All we need to do is to indentify the points: > > > > 1. when changing the vma maps, that might be luf'ed. > > > > a) mmap and munmap e.i. fault handler or unmap_region(). > > b) permission to writable e.i. mprotect or fault handler. > > c) what I'm missing. > > I'd say it even more generally: anything that installs a PTE which is > inconsistent with the original PTE. That, of course, includes writes. > But it also includes crazy things that we do like uprobes. Take a look > at __replace_page(). > > I think the page_vma_mapped_walk() checks plus the ptl keep LUF at bay > there. But it needs some really thorough review. > > But the bigger concern is that, if there was a problem, I can't think of > a systematic way to find it. > > > 2. when updating data of the pages, that might be luf'ed. > > > > a) updating files through vfs e.g. file_end_write(). > > b) updating files through writable maps e.i. 1-a) or 1-b). > > c) what I'm missing. > > Filesystems or block devices that change content without a "write" from > the local system. Network filesystems and block devices come to mind. > I honestly don't know what all the rules are around these, but they > could certainly be troublesome. > > There appear to be some interactions for NFS between file locking and > page cache flushing. > > But, stepping back ... > > I'd honestly be a lot more comfortable if there was even a debugging LUF > mode that enforced a rule that said: > > 1. A LUF'd PTE can't be rewritten until after a luf_flush() occurs > 2. A LUF'd page's position in the page cache can't be replaced until > after a luf_flush() I'm thinking a debug mode doing the following *pseudo* code - check the logic only since the grammer might be wrong: 0-a) Introduce new fields in page_ext: #ifdef LUF_DEBUG struct list_head __percpu luf_node; #endif 0-b) Introduce new fields in struct address_space: #ifdef LUF_DEBUG struct list_head __percpu luf_node; #endif 0-c) Introduce new fields in struct task_struct: #ifdef LUF_DEBUG cpumask_t luf_pending_cpus; #endif 0-d) Define percpu list_head to link luf'd folios: #ifdef LUF_DEBUG DEFINE_PER_CPU(struct list_head, luf_folios); DEFINE_PER_CPU(struct list_head, luf_address_spaces); #endif 1) When skipping tlb flush in reclaim or migration for a folio: #ifdef LUF_DEBUG ext = get_page_ext_for_luf_debug(folio); as = folio_mapping(folio); for_each_cpu(cpu, skip_cpus) { list_add(per_cpu_ptr(ext->luf_node, cpu), per_cpu_ptr(luf_folios, cpu)); if (as) list_add(per_cpu_ptr(as->luf_node, cpu), per_cpu_ptr(luf_address_spaces, cpu)); } put_page_ext(ext); #endif 2) When performing tlb flush in try_to_unmap_flush(): Remind luf only works on unmapping during reclaim and migration. #ifdef LUF_DEBUG for_each_cpu(cpu, now_flushing_cpus) { for_each_node_safe(folio, per_cpu_ptr(luf_folios)) { ext = get_page_ext_for_luf_debug(folio); list_del_init(per_cpu_ptr(ext->luf_node, cpu)) put_page_ext(ext); } for_each_node_safe(as, per_cpu_ptr(luf_address_spaces)) list_del_init(per_cpu_ptr(as->luf_node, cpu)) cpumask_clear_cpu(cpu, current->luf_pending_cpus); } #endif 3) In pte_mkwrite(): #ifdef LUF_DEBUG ext = get_page_ext_for_luf_debug(folio); for_each_cpu(cpu, online_cpus) if (!list_empty(per_cpu_ptr(ext->luf_node, cpu))) cpumask_set_cpu(cpu, current->luf_pending_cpus); put_page_ext(ext); #endif 4) On returning to user: #ifdef LUF_DEBUG WARN_ON(!cpumask_empty(current->luf_pending_cpus)); #endif 5) On right after every a_ops->write_end() call: #ifdef LUF_DEBUG as = get_address_space_to_write_to(); for_each_cpu(cpu, online_cpus) if (!list_empty(per_cpu_ptr(as->luf_node, cpu))) cpumask_set_cpu(cpu, current->luf_pending_cpus); #endif luf_flush_or_its_optimized_version(); #ifdef LUF_DEBUG WARN_ON(!cpumask_empty(current->luf_pending_cpus)); #endif I will implement the debug mode this way with all serialized. Do you think it works for what we want? Byungchul > or *some* other independent set of rules that can tell us when something > goes wrong. That uprobes code, for instance, seems like it will work. > But I can also imagine writing it ten other ways where it would break > when combined with LUF.