On Fri, Jan 30, 2009 at 03:12:07PM +0900, Akira Fujita wrote: > ext4: online defrag -- Add EXT4_IOC_DEFRAG ioctl and basic defrag functions. > > From: Akira Fujita <a-fujita@xxxxxxxxxxxxx> Thanks, this patch is much better. Eventually I suspect you'll want to add an ioctl to control the allocation of blocks to the donor inode, but that can certainly come later. I would combine the two kernel patches into one. There's no point separating them into separate patches; the maximum message size through the vger mailing lists is something like 100k, and the two patches combined are under 43k. The first patch won't function very well with two key functions stubbed out with "return 0;", anyway.... Also, I note that you are moving pages from the donor inode to the target inode one page at a time. It would be faster and involve less journal activity to move an extent at a time; after all, the donor fd should be no more fragmented than the source fd, or the userspace program shoudl have noticed and not attempted to call the defrag ioctl. Therefore, nearly all of the time, it should be possible to move a contiguous range of blocks at a time from the donor to the source without increasing the number of extents in the target inode's leaf node. You may also want to consider whether it is truly worth it to swap the block assignments between the donor and the target inode, and whether it might be easier simply to release the blocks that had previously belonged to the target inode, and simply decrement the length of the extent and/or delete the extent from the donor inode altogether. That removes the need to split any leaf blocks in the donor inode's extent tree, since it is guaranteed to always stay the same size or become smaller. > +{ > + if (!S_ISREG(org_inode->i_mode) || !S_ISREG(dest_inode->i_mode)) { > + printk(KERN_ERR "ext4 defrag: " > + "The argument files should be regular file\n"); > + return -EINVAL; > + } Random printks is probably a bad idea; this should only be done if debugging is enabled, and if the goal is debugging, it would be more useful if the code printed more information that might be useful to someone trying to debug the userspaceprogram (such as the inode numbers passed to the ioctl, in the above example). > + if (start + len > i_size_read(org_inode)) { > + printk(KERN_ERR "ext4 defrag: Adjust defrag length " > + " because it[%u] is larger than original " > + "file size\n", len); > + len = i_size_read(org_inode) - start; > + } The function ext4_defrag_check_arguments() is called before any locking is done. So you could potentially have a race where target inode gets truncated by another process between when the userspace program determined the parameters to the defrag ioctl and when the defrag ioctl starts executing. > +/** > + * ext4_defrag_lock_two - acquire two inodes' lock > + * > + * @first: inode structure to be locked first > + * @second: inode structure to be locked second > + * > + * Lock mutex and semaphore of the two inodes (@first and @second). First > + * @first is locked and then @second is locked. Note that we *must* set > + * arguments in order of small address. > + */ > +static void ext4_defrag_lock_two(struct inode *first, struct inode *second) This function only gets called once, so it doesn't matter that much, but you can reduce the stack space in the caller (ext4_defrag()) by doing the comparison in the function, and not saving the result. i.e., change this: + /* Lock the lower address inode first */ + if (ip > tip) { + ip = dest_inode; + tip = org_inode; + } + ext4_defrag_lock_two(ip, tip); to this: ext4_defrag_lock_two(org_inode, dest_inode); Note that you don't need to save ip and tip on the stack, because it is the _locking_ order which matters in order to prevent deadlocks. The order in which the two inodes are unlocked doesn't matter. So you can drop the ip and tip variables from ext4_defrag(), and just arrange to use a stable locking order in ext4_defrag_lock_two(). Personally, I'd prefer to use the inode number (target_inode->i_ino and donor_inode->i_ino) to provide the stable locking order, but using the address does work. The C language doesn't guarantee you can compare pointers that don't come from the same array, but it's not like the Linux kernel is likely to be ported to a architecture such as the Honeywell DPS8 (which ran Multics :-). - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html