Hello! On Fri, Dec 19, 2008 at 03:34:20PM +0900, KOSAKI Motohiro wrote: > I think gup_pte_range() doesn't change pte attribute. > Could you explain why get_user_pages_fast() is evil? It's evil because it was assumed that by just relying on the local_irq_disable() to prevent the smp tlb flush IPI to run, it'd be enough to simulate a 'current' pagetable walk that allowed the current task to run entirely lockless. Problem is that by being totally lockless it prevents us to know if a page is under direct-io or not. And if a page is under direct IO with writing to memory (reading from memory we cannot care less, it's always ok) we can't merge pages in ksm or we can't mark the pte readonly in fork etc... If we do things break. The entirely lockless (but atomic) pagetable walk done by the cpu is different from gup_fast because the one done by the cpu will never end up writing to the page through the pci bus in DMA, so the moment the IPI runs whatever I/O is interrupted (not the case for gup_fast, when gup_fast returns and the IPI runs and page is then available for sharing to ksm or pte marked readonly, the direct DMA is still in flight). That's why gup_fast *can't* be 100% lockless as today, otherwise it's unfixable and broken and it's not just ksm. This very O_DIRECT bug in fork is 100% unfixable without adding some serialization to gup_fast. So my patch fixes it fully only for kernels before the introduction of gup_fast... My suggestion is to reintroduced the big reader lock (br_lock) of 2.4 and have gup_fast take the read side of it, and fork/ksm take the write side. It must no be a write-starving lock like the 2.4 one though or fork would hang forever on large smp. It should be still faster than get_user_pages. > Why rhel can't use memory barrier? Oh it can, just I didn't implemented immediately as I wanted to ship a simpler patch first, but given the 27% slowdown measured in later email, I'll definitely have to replace the TestSetPageLocked with smb_rmb and see if the introduced overhead goes away. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html