On Sun, 03 Dec 2006 12:49:42 -0500 Wendy Cheng <wcheng@xxxxxxxxxx> wrote: > Andrew Morton wrote: > > >On Thu, 30 Nov 2006 11:05:32 -0500 > >Wendy Cheng <wcheng@xxxxxxxxxx> wrote: > > > > > > > >> > >>The idea is, instead of unconditionally dropping every buffer associated > >>with the particular mount point (that defeats the purpose of page > >>caching), base kernel exports the "drop_pagecache_sb()" call that allows > >>page cache to be trimmed. More importantly, it is changed to offer the > >>choice of not randomly purging any buffer but the ones that seem to be > >>unused (i_state is NULL and i_count is zero). This will encourage > >>filesystem(s) to pro actively response to vm memory shortage if they > >>choose so. > >> > >> > > > >argh. > > > > > I read this as "It is ok to give system admin(s) commands (that this > "drop_pagecache_sb() call" is all about) to drop page cache. It is, > however, not ok to give filesystem developer(s) this very same function > to trim their own page cache if the filesystems choose to do so" ? If you're referring to /proc/sys/vm/drop_pagecache then no, that isn't for administrators - it's a convenience thing for developers, to get repeatable benchmarks. Attempts to make it a per-numa-node control for admin purposes have been rejected. > >In Linux a filesystem is a dumb layer which sits between the VFS and the > >I/O layer and provides dumb services such as reading/writing inodes, > >reading/writing directory entries, mapping pagecache offsets to disk > >blocks, etc. (This model is to varying degrees incorrect for every > >post-ext2 filesystem, but that's the way it is). > > > > > Linux kernel, particularly the VFS layer, is starting to show signs of > inadequacy as the software components built upon it keep growing. I have > doubts that it can keep up and handle this complexity with a development > policy like you just described (filesystem is a dumb layer ?). Aren't > these DIO_xxx_LOCKING flags inside __blockdev_direct_IO() a perfect > example why trying to do too many things inside vfs layer for so many > filesystems is a bad idea ? That's not a very well-chosen example, but yes, the old ext2-based model has needed to be extended as new filesystems come along. > By the way, since we're on this subject, > could we discuss a little bit about vfs rename call (or I can start > another new discussion thread) ? > > Note that linux do_rename() starts with the usual lookup logic, followed > by "lock_rename", then a final round of dentry lookup, and finally comes > to filesystem's i_op->rename call. Since lock_rename() only calls for > vfs layer locks that are local to this particular machine, for a cluster > filesystem, there exists a huge window between the final lookup and > filesystem's i_op->rename calls such that the file could get deleted > from another node before fs can do anything about it. Is it possible > that we could get a new function pointer (lock_rename) in > inode_operations structure so a cluster filesystem can do proper locking ? That would need a new thread, and probably (at least pseudo-) code, and cc's to the appropriate maintainers (although that part of the kernel isn't really maintained any more - it has fallen into the patch-and-run model). > >>From our end (cluster locks are expensive - that's why we cache them), > >>one of our kernel daemons will invoke this newly exported call based on > >>a set of pre-defined tunables. It is then followed by a lock reclaim > >>logic to trim the locks by checking the page cache associated with the > >>inode (that this cluster lock is created for). If nothing is attached to > >>the inode (based on i_mapping->nrpages count), we know it is a good > >>candidate for trimming and will subsequently drop this lock (instead of > >>waiting until the end of vfs inode life cycle). > >> > >> > > > >Again, I don't understand why you're tying the lifetime of these locks to > >the VFS inode reclaim mechanisms. Seems odd. > > > > > Cluster locks are expensive because: > > 1. Every node in the cluster has to agree about it upon granting the > request (communication overhead). > 2. It involves disk flushing if bouncing between nodes. Say one node > requests a read lock after another node's write... before the read lock > can be granted, the write node needs to flush the data to the disk (disk > io overhead). > > For optimization purpose, we want to refrain the disk flush after writes > and hope (and encourage) the next person who requests the lock to be on > the very same node (to take the advantage of OS write-back logic). > That's why the locks are cached on the very same node. It will not get > removed unless necessary. > What would be better to build the lock caching on top of the existing > inode cache logic - since these are the objects that the cluster locks > are created for in the first place. hmm, I suppose that makes sense. Are there dentries associated with these locks? > >If you want to put an upper bound on the number of in-core locks, why not > >string them on a list and throw away the old ones when the upper bound is > >reached? > > > > > Don't take me wrong. DLM *has* a tunable to set the max lock counts. We > do drop the locks but to drop the right locks, we need a little bit help > from VFS layer. Latency requirement is difficult to manage. > > >Did you look at improving that lock-lookup algorithm, btw? Core kernel has > >no problem maintaining millions of cached VFS objects - is there any reason > >why your lock lookup cannot be similarly efficient? > > > > > Don't be so confident. I did see some complaints from ext3 based mail > servers in the past - when the storage size was large enough, people had > to explicitly umount the filesystem from time to time to rescue their > performance. I don't recall the details at this moment though. People have had plenty of problems with oversized inode-caches in the past, but I think they were due to memory consumption, not to lookup inefficiency. My question _still_ remains unanswered. Third time: is is possible to speed up this lock-lookup code? Perhaps others can take a look at it - where is it? - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html