On Fri, Oct 15, 2010 at 02:44:51PM +1100, Nick Piggin wrote: > On Fri, Oct 15, 2010 at 02:30:17PM +1100, Nick Piggin wrote: > > On Fri, Oct 15, 2010 at 02:13:43PM +1100, Dave Chinner wrote: > > > You've shown it can be done, and that's great - it shows > > > us the impact of making those changes, but they need to be analysed > > > separately and treated on own their merits, not lumped with core > > > locking changes necessary for store-free path walking. > > > > Actually I didn't see anyone else object to doing this. Everybody > > else it seems acknowledges that it needs to be done, and it gets > > done naturally as a side effect of fine grained locking. > > Let's just get back to this part, which seems to be one you have > the most issues with maybe? [snip per-zone lru/locking] And as far as the rest of the work goes, I much prefer to come to a basic consensus about the overall design of the entire vfs scale work, and then focus on the exact implementation and patch series details. When there is a consensus, I think it makes much more sense to merge it in quite large chunks. Ie. all of the inode locking, then all of the dcache locking. I do not want to just cherry pick things here and there and leave the others because your particular workload doesn't care about them, you haven't reviewed them yet, etc. Because that just gets my plan into a mess. I'm perfectly fine to change the design, drop some aspects of it, etc. _if_ it is decided that we don't want them with reasonable arguments and agreement among everybody. On the other hand I prefer not to just merge a few bits and leave others out because we _don't_ have a consensus about one aspect or another. So if you don't agree with something, let's work out why not and try to come to an agreement, rather than pushing bits and pieces that you do happen to agree with. You're worried about mere mortals reviewing and understanding it... I don't really know. If you understand inode locking today, you can understand the inode scaling series quite easily. Ditto for dcache. rcu-walk path walking is trickier, but it is described in detail in documentation and changelog. And you can understand the high level approach without exactly digesting every detail at once. The inode locking work goes to break up all global locks: - a single inode object is protected (to the same level as inode_lock) with i_lock. This makes it really trivial for filesystems to lock down the object without taking a global lock. - inode hash rcuified and insertion/removal made per-bucket - inode lru lists and locking made per-zone - inode sb list made per-sb, per-cpu - inode counters made per-cpu - inode io lists and locking made per-bdi So from the highest level snapshot, this is not rocket science. And the way I've structured the patches, you can take almost any of the above points and go look in the patch series to see how it is implemented. Is this where we want to go? My argument is yes, and I have been gradually gathering real results and agreement from others. I've demonstrated performance improvements, although many of them can only be actually achieved when dcache, vfsmount, files lock etc scaling is also implemented, which is another reason why it is so important to keep everything together. And it's actually not always trivial to just take a single change and document a performance improvement in isolation. But you can use your brain with scalability work, and if you're not convinced about a particular patch, you can now actually take the full series and revert a single patch (or add an artifical lock in there to demonstrate the scalability overhead). What I have done in the series is required to get almost linear scalability up to the largest POWER7 system IBM has internally on almost all important basic vfs operations. It should scale to the largest UV systems from SGI. And it should scale on -rt. Put a global lock in inode lru creation/destruction/touch/reclaim path, and scalability is going to go to hell on these workloads on large systems again. And large isn't even that large these days. You can see these problems clear as day on 2 and 4 socket small servers and we know it is going to get worse for at least a few more doublings of core count. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html