Christoph and Al raised a number of major objections to the way I'd implemented the CacheFiles module and daemon, and I'd like to get advice on dealing with them. For reference, the current patch set can be downloaded from: http://people.redhat.com/~dhowells/nfs/ The issues are: (1) Security. The way I've written the CacheFiles module means that the module goes directly to the appropriate inode operations to do filesystem operations: - lookup() - create() - mkdir() - unlink() - rename() - setxattr() - getxattr() Partly this is so that I can better handle the locking, avoiding races with other processes also wanting to do caching operations, and partly to avoid security. Christoph insists that I should use the VFS wrappers for performing these operations (such as vfs_rename()). This, however, introduces a couple of tricky security problems: (a) The CacheFiles operations run in the context of which ever process issued an I/O request against the netfs that is being cached. Consider, for instance, a process opening an NFS file: - the open() syscall asks NFS to open a file; - NFS will ask FS-Cache for a cookie to represent that file; - FS-Cache will ask CacheFiles to go and look in the cache to see if it has an object to represent that cookie; - CacheFiles will perform various lookups, getxattrs, mkdirs, etc. on the backing filesystem to achieve this; and - the backing fs (eg Ext3) will actually do the operations. All this is run in the context of the process that issued the open() syscall. This includes its fsuid and fsgid and its security context, including SELinux security label. The cache, however, mustn't be accessed under the security context of anything other than the cache's context, whether doing lookups, file creation or whatever. Doing otherwise means that the objects in the cache may be either inaccessible or unshareable. Security for the original open() call should be done at the NFS level, not at the cache or backing fs level. (b) Accesses to the cache done by CacheFiles on behalf of the calling process should very probably *not* be logged in the audit log, at least not as originating with the process in whose context we're running - it didn't make the requests, and the whole process is supposed to be transparent. It would not be unreasonable to log audit events on behalf of the CacheFiles module. Christoph asserts that I may *not* change the context (whether fsuid/fsgid or security context) of the calling process, and that means that I would have to instead run the CacheFiles operations in their own thread or, possibly, farm them off to userspace. There is a serious performance problem, however, with using a separate thread: it would serialise accesses to the netfs over the duration of the backing fs's accesses to the cache, which, apart from read/write, are synchronous. Now, I could use mitigate this by using multiple threads, but it's going to be serialised at some level, and allocating more threads uses more resources. With what I have at the moment, each process can separately access the cache, and so this isn't actually a problem. And it's not just a matter of each cached netfs being serialised with respect to itself; *all* cached netfs's will be serialised systemwide. Now, if there's mostly only one process accessing NFS files at once on a system, then this isn't actually too much of a problem, but if you want to do something like "make -jN" on an NFS mounted build tree, then it could get interesting. (2) I/O. There are a number of issues here: (a) Without being able to use O_DIRECT, I have to copy data around between the netfs and backing fs pages. There's not a lot I can do about that just at the moment. (b) I've got a function to copy the data from an arbitrary kernel page to a particular page in a specified address space (used to store a page of data from a netfs page into a page in the appropriate inode of a backing device). Christoph seems happy enough with that. What he objects to is that I'm calling this function directly without jumping through the file ops table to get to it - which would permit a filesystem to do something different if it wanted to. The function in effect invokes the address space operations (prepare_write() and commit_write()) directly, but Christoph says that isn't permitted. However, I don't have a file struct available on which to call such an operation. I could allocate a file struct, but that then brings in the ENFILE problem... Objections have been raised to adding more I/O vectors, especially if they're just for a facility like CacheFiles. (c) I'm calling readpage() with a NULL file struct pointer. I'm not sure that's necessarily a good idea, but I don't have a file struct to give it. (d) I'm also calling readpage() directly, but that would seem not to be permitted as Christoph said I'm not allowed to call address space ops directly, except through file ops. (e) I'm using bmap() to do hole detection in the backing files. Holes represent pages not yet retrieved from the server. I can't use readpage() because that'll just return a page full of zeros. Use of bmap() seems to offend people. (3) Communication with the daemon. CacheFiles has a daemon (cachefilesd) that does a number of cache maintenance operations in userspace, including: deletion of dead object, scanning of the cache to build up tables of cullable objects and the culling of objects. The daemon also sets up the cache initially, tells the module where to find it and otherwise configures the module. As it stands, the daemon opens /proc/fs/cachefiles to communicate with the cache. The act of doing so starts the process of making a cache available. The daemon then writes configuration commands down this "pipe" to tell configure the cache. It can write further commands to alter the cache configuration. It also writes commands to cull objects from the cache. The daemon can read from that file to determine the current state of the cache, including how much space is available. The daemon sets signal handling on that file so that the module can send it signals through the file to indicate that it should start culling. This means that the module doesn't have to know what process(es) are on the other end of that file. It can just signal it. I could have used poll() for this, but that'd mean calling poll() every time around the loop just to determine whether I should be culling or not. The daemon installs a dnotify watch on the graveyard directory into which the module will move things that need to be deleted. This will cause the daemon to be signalled when its services in that department are required without the module having to do anything special. Note that the daemon can be busy scanning when a request comes to cull or delete. If it's busy scanning, I'd rather not be polling the module at regular intervals also. The signals really aren't likely to be received all that often. There's hysteresis built in to deal with that. Finally, when the file descriptor is closed, it is assumed that the daemon has gone away or died, and the cachefiles module shuts down the cache facility. With this mechanism, I can permit multiple caches: I just have to let the file be opened multiple times, once per cache, and let there be a cachefiles daemon on the end of each. The objections raised to my approach are: (a) Christoph objected to my use of a proc-file. He said I should be using sysfs. However, having talked to GregKH, sysfs doesn't allow me to detect one of its files being closed. There are several possibilities as to how this can be done: (j) With a proc-file as I'm doing now. (k) GregKH suggested using a chardev, and Matt Domsch suggested creating one with a dynamic major and letting udev create the device file. (l) Al suggested an elaborate scheme involving creating a simple filesystem just for CacheFiles; having the daemon put itself into its own namespace; mount this fs and bind the cache, /etc and /usr under it; and using kill_sb() to detect the cache going away. There'd still be a "command file" in the cachefiles fs for the daemon to control the module. (m) Alter sysfs to permit release detection and then use sysfs. (n) Use sysfs without release detection, use static configuration through sysfs files, and just ignore the fact that the daemon has gone away. That has the disadvantage that we're ignoring the fact that the daemon might go away without us noticing. (o) Use netlink. I don't know much about how to do this. The first two are definitely the simplest, as they involve just a single simple channel of communication between module and daemon. (b) Christoph also objected to my use of signalling the file descriptor to prod the daemon. I could use poll() instead, but see above. (c) Al objected to my use of dnotify on the basis that I have a channel open between the module and the daemon anyway and could use that. My take on this is that dnotify works automatically and the module doesn't have to anything special to indicate this to the daemon. Anyway, can I have your advice please? David - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html