On Mon 24-02-14 18:12:38, Miklos Szeredi wrote: > On Thu, Feb 13, 2014 at 09:28:50PM +0100, Miklos Szeredi wrote: > > On Thu, Feb 13, 2014 at 8:32 PM, Linus Torvalds > > <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > > > > (I do think we should allow creation - but for root only - for > > > management and testing purposes, but I really think it's a secondary > > > issue, and I do think we should literally use "mknod()" - either with > > > a new S_IFWHT or even just making use of existing S_IFCHR just so you > > > could use the user-space "mknod" to create it with some magic > > > major/minor combination. > > > > And IMO the magic S_IFCHR is a lot better in many respects than a new > > filetype, since now all backup tools automatically work. And I think > > that's a lot more important than looking like a nice new design. > > Sure, if S_IFWHT was there from the start, it would be wonderful. But > > as it stands, it's a lot more difficult to add support for such a > > thing to userspace than adding a hack, using the existing intefaces, > > to the kernel. > > And here's a patch to actually implement the "rename and whiteout source" > operation. > > Jan, I suspect that the "credits" calculation for the number of journal > blocks is excessive, since here we don't actually need to create the > directory entry, only create the inode and populate an already existing > entry. But I don't undestand the logic of ext4 enough to see what's the > right number to use here. Yes. It should be enough to reserve (EXT4_MAXQUOTAS_TRANS_BLOCKS(sb) + EXT4_XATTR_TRANS_BLOCKS + 4) blocks (for inode block, sb block, group summaries, and inode bitmap). Please add this summary to the comment before the estimate. > And another ext4 specific question: I added ext4_setent(old, whiteout) before > ext4_add_entry(new, old.inode) because I suspect from the comment in > ext4_rename_delete() that ext4_add_entry() invalidates the looked up entry. Is > that correct? Yes, that is correct. As soon as you add any entry into a directory, any looked up position in that directory may be invalid (the directory is implemented as a tree and addition can trigger balancing which moves things around in the directory). > Are there any other traps in this area to look out for? :-) Hum, no I cannot remember any. Honza > ---- > Subject: vfs: add RENAME_WHITEOUT > From: Miklos Szeredi <mszeredi@xxxxxxx> > > This adds a new RENAME_WHITEOUT flag. This flag makes rename() create a > whiteout of source. The whiteout creation is atomic relative to the > rename. > > As Linus' suggestion, a whiteout is represented as a dummy char device. > This patch uses the 0/0 device number, but the actual number doesn't matter > as long as it doesn't conflict with a real device. > > Signed-off-by: Miklos Szeredi <mszeredi@xxxxxxx> > --- > fs/ext4/namei.c | 69 ++++++++++++++++++++++++++++++++++++++---------- > fs/namei.c | 8 ++++- > include/linux/fs.h | 7 ++++ > include/uapi/linux/fs.h | 1 > 4 files changed, 70 insertions(+), 15 deletions(-) > > --- a/fs/ext4/namei.c > +++ b/fs/ext4/namei.c > @@ -3157,7 +3157,8 @@ static void ext4_update_dir_count(handle > * This comes from rename(const char *oldpath, const char *newpath) > */ > static int ext4_plain_rename(struct inode *old_dir, struct dentry *old_dentry, > - struct inode *new_dir, struct dentry *new_dentry) > + struct inode *new_dir, struct dentry *new_dentry, > + unsigned int flags) > { > handle_t *handle = NULL; > struct ext4_renament old = { > @@ -3171,6 +3172,9 @@ static int ext4_plain_rename(struct inod > .inode = new_dentry->d_inode, > }; > int retval; > + struct inode *whiteout = NULL; > + int credits; > + u8 old_file_type; > > dquot_initialize(old.dir); > dquot_initialize(new.dir); > @@ -3202,11 +3206,33 @@ static int ext4_plain_rename(struct inod > if (new.inode && !test_opt(new.dir->i_sb, NO_AUTO_DA_ALLOC)) > ext4_alloc_da_blocks(old.inode); > > - handle = ext4_journal_start(old.dir, EXT4_HT_DIR, > - (2 * EXT4_DATA_TRANS_BLOCKS(old.dir->i_sb) + > - EXT4_INDEX_EXTRA_TRANS_BLOCKS + 2)); > - if (IS_ERR(handle)) > - return PTR_ERR(handle); > + credits = (2 * EXT4_DATA_TRANS_BLOCKS(old.dir->i_sb) + > + EXT4_INDEX_EXTRA_TRANS_BLOCKS + 2); > + if (!(flags & RENAME_WHITEOUT)) { > + handle = ext4_journal_start(old.dir, EXT4_HT_DIR, credits); > + if (IS_ERR(handle)) > + return PTR_ERR(handle); > + } else { > + int retries = 0; > + > + credits += (EXT4_DATA_TRANS_BLOCKS(old.dir->i_sb) + > + EXT4_INDEX_EXTRA_TRANS_BLOCKS + 3); > +retry: > + whiteout = ext4_new_inode_start_handle(old.dir, S_IFCHR | WHITEOUT_MODE, &old.dentry->d_name, 0, NULL, EXT4_HT_DIR, credits); > + handle = ext4_journal_current_handle(); > + if (IS_ERR(whiteout)) { > + if (handle) > + ext4_journal_stop(handle); > + retval = PTR_ERR(whiteout); > + if (retval == -ENOSPC && > + ext4_should_retry_alloc(old.dir->i_sb, &retries)) > + goto retry; > + > + return retval; > + } > + init_special_inode(whiteout, whiteout->i_mode, WHITEOUT_DEV); > + whiteout->i_op = &ext4_special_inode_operations; > + } > > if (IS_DIRSYNC(old.dir) || IS_DIRSYNC(new.dir)) > ext4_handle_sync(handle); > @@ -3225,13 +3251,21 @@ static int ext4_plain_rename(struct inod > if (retval) > goto end_rename; > } > + old_file_type = old.de->file_type; > + if (whiteout) { > + retval = ext4_setent(handle, &old, whiteout->i_ino, > + EXT4_FT_CHRDEV); > + if (retval) > + goto end_rename; > + ext4_mark_inode_dirty(handle, whiteout); > + } > if (!new.bh) { > retval = ext4_add_entry(handle, new.dentry, old.inode); > if (retval) > goto end_rename; > } else { > retval = ext4_setent(handle, &new, > - old.inode->i_ino, old.de->file_type); > + old.inode->i_ino, old_file_type); > if (retval) > goto end_rename; > } > @@ -3243,10 +3277,12 @@ static int ext4_plain_rename(struct inod > old.inode->i_ctime = ext4_current_time(old.inode); > ext4_mark_inode_dirty(handle, old.inode); > > - /* > - * ok, that's it > - */ > - ext4_rename_delete(handle, &old); > + if (!whiteout) { > + /* > + * ok, that's it > + */ > + ext4_rename_delete(handle, &old); > + } > > if (new.inode) { > ext4_dec_count(handle, new.inode); > @@ -3282,6 +3318,12 @@ static int ext4_plain_rename(struct inod > brelse(old.dir_bh); > brelse(old.bh); > brelse(new.bh); > + if (whiteout) { > + if (retval) > + drop_nlink(whiteout); > + unlock_new_inode(whiteout); > + iput(whiteout); > + } > if (handle) > ext4_journal_stop(handle); > return retval; > @@ -3407,7 +3449,7 @@ static int ext4_rename(struct inode *old > struct inode *new_dir, struct dentry *new_dentry, > unsigned int flags) > { > - if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE)) > + if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT)) > return -EINVAL; > > if (flags & RENAME_EXCHANGE) { > @@ -3418,7 +3460,8 @@ static int ext4_rename(struct inode *old > * Existence checking was done by the VFS, otherwise "RENAME_NOREPLACE" > * is equivalent to regular rename. > */ > - return ext4_plain_rename(old_dir, old_dentry, new_dir, new_dentry); > + return ext4_plain_rename(old_dir, old_dentry, new_dir, new_dentry, > + flags); > } > > /* > --- a/fs/namei.c > +++ b/fs/namei.c > @@ -4165,12 +4165,16 @@ SYSCALL_DEFINE5(renameat2, int, olddfd, > bool should_retry = false; > int error; > > - if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE)) > + if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT)) > return -EINVAL; > > - if ((flags & RENAME_NOREPLACE) && (flags & RENAME_EXCHANGE)) > + if ((flags & (RENAME_NOREPLACE | RENAME_WHITEOUT)) && > + (flags & RENAME_EXCHANGE)) > return -EINVAL; > > + if ((flags & RENAME_WHITEOUT) && !capable(CAP_MKNOD)) > + return -EPERM; > + > retry: > from = user_path_parent(olddfd, oldname, &oldnd, lookup_flags); > if (IS_ERR(from)) { > --- a/include/uapi/linux/fs.h > +++ b/include/uapi/linux/fs.h > @@ -37,6 +37,7 @@ > > #define RENAME_NOREPLACE (1 << 0) /* Don't overwrite target */ > #define RENAME_EXCHANGE (1 << 1) /* Exchange source and dest */ > +#define RENAME_WHITEOUT (1 << 2) /* Whiteout source */ > > struct fstrim_range { > __u64 start; > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -215,6 +215,13 @@ typedef void (dio_iodone_t)(struct kiocb > #define ATTR_TIMES_SET (1 << 16) > > /* > + * Whiteout is represented by a char device. The following constants define the > + * mode and device number to use. > + */ > +#define WHITEOUT_MODE 0 > +#define WHITEOUT_DEV 0 > + > +/* > * This is the Inode Attributes structure, used for notify_change(). It > * uses the above definitions as flags, to know which values have changed. > * Also, in this manner, a Filesystem can look at only the values it cares -- Jan Kara <jack@xxxxxxx> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html