Hi, I'm sorry for the late and sorry state of these patches. I had some interesting offline issues and also had to strike while the iron is hot with the vfs patches as well. But anyway, mmap_sem scalability is a topic we'll be talking about at the MM meetup next week, so I wanted to contribue something here. I am of the opinion now that a single mmap_sem and global rbtree is maybe not so great. It's tricky to make it more scalable on the read side without increasing write side overhead (which we don't want to do too much). We also have problems with write side scalability too -- writer versus readers (google's problem) and even multiple writers. Possibly one way to go is to actually go away from global rbtree for vmas. We have this beautiful, cache efficient, scalable data structure that is the page table sitting along side the clunky old rbtree. So why not put vma extents in the bottom level page tables, like we do with the page table lock? We can use the page table locks to protect each tree, so number of atomics should be reduced from removing mmap_sem. The page tables are cache hot from the TLB handler and we need to load the bottom level struct page to get the ptl, and the height of vma trees will be smaller so smaller cache footprint and smaller chain of dependent cache misses. And we don't even need any new fancy lock free stuff. Now, one big issue is that we can't sleep if we're using ptl rather than mmap_sem. And also, we prefer not to sleep while holding locks anyway (even if they are more fine grained). So what we could do is if the fault handler needs to sleep, then it can take a reference on the inode (or the vma holding the inode open), and then drop the ptl. After the page is uptodate, just walk the page tables again, take the ptl and confirm the vma is still correct, and insert the pte. Another issue is free space allocation. At the moment I am doing just another global extent tree for this. It is not as bad as it sounds because the lock hold times are much shorter, and we get the benefit that mappings are decoupled from allocations, so we don't need to hold a lock over a complete unmap/tlb flush. But the free space allocator could easily be changed to have per-cpu or per-thread caches and start getting more write side parallelism. Thoughts? Comments? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxxx For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>