Hello Dave. I agree with the idea of copy ext4 for now, and make it the right way later. However, I'd say that might be better to push overlay people to write down some documentation before we actually implement it, so we can have something to base the code on. Give me until the end of the week so I can test this patch and take a more deep look at the patch (considering overlay whiteout implementation), and also test it with overlay filesystem. On Wed, Feb 25, 2015 at 05:39:07PM +1100, Dave Chinner wrote: > ping^2? > > On Thu, Feb 19, 2015 at 03:31:56PM +1100, Dave Chinner wrote: > > ping? > > > > On Wed, Feb 11, 2015 at 10:15:30PM +1100, Dave Chinner wrote: > > > From: Dave Chinner <dchinner@xxxxxxxxxx> > > > > > > Add a basic implementation of RENAME_WHITEOUT to the XFS rename > > > code. The implementation options considered are documented in the > > > code comments; the method chose was "copy ext4" because we are then > > > bug-for-bug compatible with the implementation done by the > > > overlayfs developers. > > > > > > I have a hacky renameat2 test for whiteouts copied from the existing > > > renameat2 tests in xfstests, and this code behaves the same as ext4 > > > in that rename test. I haven't done any testing with overlayfs, so > > > who knows whether that explodes or not. > > > > > > The rename code is getting pretty spaghetti now - I'll end up > > > spliting this patching whiteout support and cleanup, and I'll set > > > what possible cleanups I can make that will help make the code a > > > little more understandable.... > > > > > > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> > > > --- > > > fs/xfs/xfs_inode.c | 261 +++++++++++++++++++++++++++++++++++++++++------------ > > > fs/xfs/xfs_iops.c | 2 +- > > > 2 files changed, 205 insertions(+), 58 deletions(-) > > > > > > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c > > > index bf2d2c7..eef5db7 100644 > > > --- a/fs/xfs/xfs_inode.c > > > +++ b/fs/xfs/xfs_inode.c > > > @@ -2683,17 +2683,20 @@ xfs_remove( > > > */ > > > STATIC void > > > xfs_sort_for_rename( > > > - xfs_inode_t *dp1, /* in: old (source) directory inode */ > > > - xfs_inode_t *dp2, /* in: new (target) directory inode */ > > > - xfs_inode_t *ip1, /* in: inode of old entry */ > > > - xfs_inode_t *ip2, /* in: inode of new entry, if it > > > - already exists, NULL otherwise. */ > > > - xfs_inode_t **i_tab,/* out: array of inode returned, sorted */ > > > - int *num_inodes) /* out: number of inodes in array */ > > > + struct xfs_inode *dp1, /* in: old (source) directory inode */ > > > + struct xfs_inode *dp2, /* in: new (target) directory inode */ > > > + struct xfs_inode *ip1, /* in: inode of old entry */ > > > + struct xfs_inode *ip2, /* in: inode of new entry */ > > > + struct xfs_inode *wino, /* in: whiteout inode */ > > > + struct xfs_inode **i_tab,/* out: sorted array of inodes */ > > > + int *num_inodes) /* out: inodes in array */ > > > { > > > xfs_inode_t *temp; > > > int i, j; > > > > > > + ASSERT(*num_inodes == 5); > > > + memset(i_tab, 0, *num_inodes * sizeof(struct xfs_inode *)); > > > + > > > /* > > > * i_tab contains a list of pointers to inodes. We initialize > > > * the table here & we'll sort it. We will then use it to > > > @@ -2701,20 +2704,19 @@ xfs_sort_for_rename( > > > * > > > * Note that the table may contain duplicates. e.g., dp1 == dp2. > > > */ > > > - i_tab[0] = dp1; > > > - i_tab[1] = dp2; > > > - i_tab[2] = ip1; > > > - if (ip2) { > > > - *num_inodes = 4; > > > - i_tab[3] = ip2; > > > - } else { > > > - *num_inodes = 3; > > > - i_tab[3] = NULL; > > > - } > > > + i = 0; > > > + i_tab[i++] = dp1; > > > + i_tab[i++] = dp2; > > > + i_tab[i++] = ip1; > > > + if (ip2) > > > + i_tab[i++] = ip2; > > > + if (wino) > > > + i_tab[i++] = wino; > > > + *num_inodes = i; > > > > > > /* > > > * Sort the elements via bubble sort. (Remember, there are at > > > - * most 4 elements to sort, so this is adequate.) > > > + * most 5 elements to sort, so this is adequate.) > > > */ > > > for (i = 0; i < *num_inodes; i++) { > > > for (j = 1; j < *num_inodes; j++) { > > > @@ -2846,6 +2848,101 @@ out: > > > } > > > > > > /* > > > + * RENAME_WHITEOUT support. > > > + * > > > + * Whiteouts are used by overlayfs - it has a convention that a whiteout is a > > > + * character device inode with a major:minor of 0:0. Somebody has to be in an > > > + * altered state of mind to think this up, so whiteout inodes from this point at > > > + * called "wino"s. > > > + * > > > + * Now, because it's not documented anywhere, here's what RENAME_WHITEOUT does > > > + * on ext4: > > > + > > > +# echo bar > /mnt/scratch/bar > > > +# ls -l /mnt/scratch > > > +total 24 > > > +-rw-r--r-- 1 root root 4 Feb 11 20:22 bar > > > +-rw-r--r-- 1 root root 4 Feb 11 20:22 foo > > > +drwx------ 2 root root 16384 Feb 11 20:18 lost+found > > > +# src/renameat2 -w /mnt/scratch/foo /mnt/scratch/bar > > > +# ls -l /mnt/scratch > > > +total 20 > > > +-rw-r--r-- 1 root root 4 Feb 11 20:22 bar > > > +c--------- 1 root root 0, 0 Feb 11 20:23 foo > > > +drwx------ 2 root root 16384 Feb 11 20:18 lost+found > > > +# cat /mnt/scratch/bar > > > +foo > > > +# > > > + > > > + * In XFS rename terms, the operation that has been done is that source (foo) > > > + * has been moved to the target (bar), which is like a nomal rename operation, > > > + * but rather than the source being removed, it have been replaced with a wino. > > > + * > > > + * We can't allocate winos within the rename transaction due to allocation > > > + * being a multi-commit transaction, and rename needs to be a single, atomic > > > + * commit. Hence we have several options here, form most efficient to least > > > + * efficient: > > > + * > > > + * - use DT_WHT in the target dirent and do no wino allocation. > > > + * The main issue with this approach is that we need hooks in > > > + * lookup to create a virtual chardev inode to present to userspace > > > + * and in places where we might need to modify the dirent e.g. unlink. > > > + * Overlayfs also needs to be taught about DT_WHT. Most invasive change, > > > + * lowest overhead. > > > + * > > > + * - create a special wino in the root directory (e.g. a ".wino" dirent and > > > + * then hardlink every new whiteout to it. This means we only need to > > > + * create a single wino, and rename simply creates a hardlink to it. We > > > + * can use DT_WHT for these, though using DT_CHR means we won't have to > > > + * modify overlayfs, nor anything in userspace. Downside is we have to > > > + * look up the wino up on every operation and create it if it doesn't > > > + * exist. > > > + * > > > + * - copy ext4: create a special whiteout chardev inode for every whiteout. > > > + * This is more complex than the above options because of the lack of > > > + * atomicity between inode creation and the rename operation, requiring > > > + * us to create a tmpfile inode and then linking it into the directory > > > + * structure during the rename. At least with a tmpfile inode crashes > > > + * between the create and rename doesn't leave unreferenced inodes or > > > + * directory pollution around. > > > + * > > > + * By far the simplest thing to do is copy ext4. It's also the most > > > + * inefficient way of supporting whiteouts, but as an initial implementation we > > > + * can simply reuse existing functions and add a small amount of extra code the > > > + * the rename operation to handle the *fifth* inode in the transaction. > > > + * > > > + * Hence that is what is implemented first. When we have time or need we can > > > + * come back and implement one of the more efficient whiteout methods, but it's > > > + * not necessary for the first implementation. > > > + */ > > > + > > > +/* > > > + * xfs_rename_get_wino() > > > + * > > > + * Return a referenced, unlinked, unlocked inode that that can be used as a > > > + * whiteout in a rename transaction. > > > + */ > > > +static int > > > +xfs_rename_get_wino( > > > + struct xfs_inode *dp, > > > + struct xfs_inode **wino) > > > +{ > > > + struct xfs_inode *tmpfile; > > > + int error; > > > + > > > + error = xfs_create_tmpfile(dp, NULL, S_IFCHR | WHITEOUT_MODE, &tmpfile); > > > + if (error) > > > + return error; > > > + > > > + /* Satisfy xfs_bumplink that this is a real tmpfile */ > > > + xfs_finish_inode_setup(tmpfile); > > > + VFS_I(tmpfile)->i_state |= I_LINKABLE; > > > + > > > + *wino = tmpfile; > > > + return 0; > > > +} > > > + > > > +/* > > > * xfs_rename > > > */ > > > int > > > @@ -2867,40 +2964,52 @@ xfs_rename( > > > xfs_fsblock_t first_block; > > > int cancel_flags; > > > int committed; > > > - xfs_inode_t *inodes[4]; > > > + xfs_inode_t *inodes[5]; > > > + int num_inodes = 5; > > > int spaceres; > > > - int num_inodes; > > > + struct xfs_inode *wino = NULL; > > > > > > trace_xfs_rename(src_dp, target_dp, src_name, target_name); > > > > > > + /* > > > + * If we are doing a whiteout operation, get us the wino we will be > > > + * placing at the target. > > > + */ > > > + if (flags & RENAME_WHITEOUT) { > > > + ASSERT(!(flags & (RENAME_NOREPLACE | RENAME_EXCHANGE))); > > > + error = xfs_rename_get_wino(target_dp, &wino); > > > + if (error) > > > + return error; > > > + > > > + /* setup target dirent info as whiteout */ > > > + src_name->type = XFS_DIR3_FT_CHRDEV; > > > + } > > > + > > > new_parent = (src_dp != target_dp); > > > src_is_directory = S_ISDIR(src_ip->i_d.di_mode); > > > > > > - xfs_sort_for_rename(src_dp, target_dp, src_ip, target_ip, > > > + xfs_sort_for_rename(src_dp, target_dp, src_ip, target_ip, wino, > > > inodes, &num_inodes); > > > > > > + cancel_flags = 0; > > > xfs_bmap_init(&free_list, &first_block); > > > tp = xfs_trans_alloc(mp, XFS_TRANS_RENAME); > > > - cancel_flags = XFS_TRANS_RELEASE_LOG_RES; > > > spaceres = XFS_RENAME_SPACE_RES(mp, target_name->len); > > > error = xfs_trans_reserve(tp, &M_RES(mp)->tr_rename, spaceres, 0); > > > if (error == -ENOSPC) { > > > spaceres = 0; > > > error = xfs_trans_reserve(tp, &M_RES(mp)->tr_rename, 0, 0); > > > } > > > - if (error) { > > > - xfs_trans_cancel(tp, 0); > > > - goto std_return; > > > - } > > > + if (error) > > > + goto error_trans_cancel; > > > + cancel_flags = XFS_TRANS_RELEASE_LOG_RES; > > > > > > /* > > > * Attach the dquots to the inodes > > > */ > > > error = xfs_qm_vop_rename_dqattach(inodes); > > > - if (error) { > > > - xfs_trans_cancel(tp, cancel_flags); > > > - goto std_return; > > > - } > > > + if (error) > > > + goto error_trans_cancel; > > > > > > /* > > > * Lock all the participating inodes. Depending upon whether > > > @@ -2921,6 +3030,8 @@ xfs_rename( > > > xfs_trans_ijoin(tp, src_ip, XFS_ILOCK_EXCL); > > > if (target_ip) > > > xfs_trans_ijoin(tp, target_ip, XFS_ILOCK_EXCL); > > > + if (wino) > > > + xfs_trans_ijoin(tp, wino, XFS_ILOCK_EXCL); > > > > > > /* > > > * If we are using project inheritance, we only allow renames > > > @@ -2930,18 +3041,19 @@ xfs_rename( > > > if (unlikely((target_dp->i_d.di_flags & XFS_DIFLAG_PROJINHERIT) && > > > (xfs_get_projid(target_dp) != xfs_get_projid(src_ip)))) { > > > error = -EXDEV; > > > - goto error_return; > > > + goto error_trans_cancel; > > > } > > > > > > /* > > > * Handle RENAME_EXCHANGE flags > > > */ > > > if (flags & RENAME_EXCHANGE) { > > > + ASSERT(!wino); > > > error = xfs_cross_rename(tp, src_dp, src_name, src_ip, > > > target_dp, target_name, target_ip, > > > &free_list, &first_block, spaceres); > > > if (error) > > > - goto abort_return; > > > + goto error_trans_abort; > > > goto finish_rename; > > > } > > > > > > @@ -2956,7 +3068,7 @@ xfs_rename( > > > if (!spaceres) { > > > error = xfs_dir_canenter(tp, target_dp, target_name); > > > if (error) > > > - goto error_return; > > > + goto error_trans_cancel; > > > } > > > /* > > > * If target does not exist and the rename crosses > > > @@ -2967,9 +3079,9 @@ xfs_rename( > > > src_ip->i_ino, &first_block, > > > &free_list, spaceres); > > > if (error == -ENOSPC) > > > - goto error_return; > > > + goto error_trans_cancel; > > > if (error) > > > - goto abort_return; > > > + goto error_trans_abort; > > > > > > xfs_trans_ichgtime(tp, target_dp, > > > XFS_ICHGTIME_MOD | XFS_ICHGTIME_CHG); > > > @@ -2977,7 +3089,7 @@ xfs_rename( > > > if (new_parent && src_is_directory) { > > > error = xfs_bumplink(tp, target_dp); > > > if (error) > > > - goto abort_return; > > > + goto error_trans_abort; > > > } > > > } else { /* target_ip != NULL */ > > > /* > > > @@ -2992,7 +3104,7 @@ xfs_rename( > > > if (!(xfs_dir_isempty(target_ip)) || > > > (target_ip->i_d.di_nlink > 2)) { > > > error = -EEXIST; > > > - goto error_return; > > > + goto error_trans_cancel; > > > } > > > } > > > > > > @@ -3009,7 +3121,7 @@ xfs_rename( > > > src_ip->i_ino, > > > &first_block, &free_list, spaceres); > > > if (error) > > > - goto abort_return; > > > + goto error_trans_abort; > > > > > > xfs_trans_ichgtime(tp, target_dp, > > > XFS_ICHGTIME_MOD | XFS_ICHGTIME_CHG); > > > @@ -3020,7 +3132,7 @@ xfs_rename( > > > */ > > > error = xfs_droplink(tp, target_ip); > > > if (error) > > > - goto abort_return; > > > + goto error_trans_abort; > > > > > > if (src_is_directory) { > > > /* > > > @@ -3028,9 +3140,9 @@ xfs_rename( > > > */ > > > error = xfs_droplink(tp, target_ip); > > > if (error) > > > - goto abort_return; > > > + goto error_trans_abort; > > > } > > > - } /* target_ip != NULL */ > > > + } > > > > > > /* > > > * Remove the source. > > > @@ -3045,7 +3157,7 @@ xfs_rename( > > > &first_block, &free_list, spaceres); > > > ASSERT(error != -EEXIST); > > > if (error) > > > - goto abort_return; > > > + goto error_trans_abort; > > > } > > > > > > /* > > > @@ -3071,13 +3183,21 @@ xfs_rename( > > > */ > > > error = xfs_droplink(tp, src_dp); > > > if (error) > > > - goto abort_return; > > > + goto error_trans_abort; > > > } > > > > > > - error = xfs_dir_removename(tp, src_dp, src_name, src_ip->i_ino, > > > + /* > > > + * On a whiteout, we only update the source dirent with the wino, > > > + * otherwise we are removing it. > > > + */ > > > + if (wino) { > > > + error = xfs_dir_replace(tp, src_dp, src_name, wino->i_ino, > > > + &first_block, &free_list, spaceres); > > > + } else > > > + error = xfs_dir_removename(tp, src_dp, src_name, src_ip->i_ino, > > > &first_block, &free_list, spaceres); > > > if (error) > > > - goto abort_return; > > > + goto error_trans_abort; > > > > > > xfs_trans_ichgtime(tp, src_dp, XFS_ICHGTIME_MOD | XFS_ICHGTIME_CHG); > > > xfs_trans_log_inode(tp, src_dp, XFS_ILOG_CORE); > > > @@ -3090,31 +3210,58 @@ finish_rename: > > > * rename transaction goes to disk before returning to > > > * the user. > > > */ > > > - if (mp->m_flags & (XFS_MOUNT_WSYNC|XFS_MOUNT_DIRSYNC)) { > > > + if (mp->m_flags & (XFS_MOUNT_WSYNC|XFS_MOUNT_DIRSYNC)) > > > xfs_trans_set_sync(tp); > > > - } > > > > > > error = xfs_bmap_finish(&tp, &free_list, &committed); > > > - if (error) { > > > - xfs_bmap_cancel(&free_list); > > > - xfs_trans_cancel(tp, (XFS_TRANS_RELEASE_LOG_RES | > > > - XFS_TRANS_ABORT)); > > > - goto std_return; > > > + if (error) > > > + goto error_trans_abort; > > > + > > > + /* > > > + * Last thing we do is bump the link count on the wino. This means that > > > + * failures all the way up to this point leave the wino on the unlinked > > > + * list and so cleanup is a simple matter of dropping the remaining > > > + * reference to it. If we fail here after bumping the link count, we're > > > + * shutting down the filesystem so we'll never see the intermediate > > > + * state on disk. > > > + */ > > > + if (wino) { > > > + ASSERT(wino->i_d.di_nlink == 0); > > > + error = xfs_bumplink(tp, wino); > > > + if (error) > > > + goto error_trans_abort; > > > + error = xfs_iunlink_remove(tp, wino); > > > + if (error) > > > + goto error_trans_abort; > > > + xfs_trans_log_inode(tp, wino, XFS_ILOG_CORE); > > > + > > > + /* > > > + * now we have a real link, clear the "I'm a tmpfile" state > > > + * flag from the inode so it doesn't accidentally get misused in > > > + * future. > > > + */ > > > + VFS_I(wino)->i_state &= ~I_LINKABLE; > > > } > > > > > > /* > > > * trans_commit will unlock src_ip, target_ip & decrement > > > * the vnode references. > > > */ > > > - return xfs_trans_commit(tp, XFS_TRANS_RELEASE_LOG_RES); > > > + error = xfs_trans_commit(tp, XFS_TRANS_RELEASE_LOG_RES); > > > +out_release_wino: > > > + if (wino) > > > + IRELE(wino); > > > + return error; > > > > > > - abort_return: > > > + > > > +error_trans_abort: > > > cancel_flags |= XFS_TRANS_ABORT; > > > - error_return: > > > xfs_bmap_cancel(&free_list); > > > +error_trans_cancel: > > > xfs_trans_cancel(tp, cancel_flags); > > > - std_return: > > > - return error; > > > + > > > + /* Dropping the last reference on a tmpfile does the cleanup for us! */ > > > + goto out_release_wino; > > > } > > > > > > STATIC int > > > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c > > > index 6a77ea9..d4442d1 100644 > > > --- a/fs/xfs/xfs_iops.c > > > +++ b/fs/xfs/xfs_iops.c > > > @@ -393,7 +393,7 @@ xfs_vn_rename( > > > struct xfs_name oname; > > > struct xfs_name nname; > > > > > > - if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE)) > > > + if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT)) > > > return -EINVAL; > > > > > > /* if we are exchanging files, we need to set i_mode of both files */ > > > -- > > > 2.0.0 > > > > > > _______________________________________________ > > > xfs mailing list > > > xfs@xxxxxxxxxxx > > > http://oss.sgi.com/mailman/listinfo/xfs > > > > > > > -- > > Dave Chinner > > david@xxxxxxxxxxxxx > > > > _______________________________________________ > > xfs mailing list > > xfs@xxxxxxxxxxx > > http://oss.sgi.com/mailman/listinfo/xfs > > > > -- > Dave Chinner > david@xxxxxxxxxxxxx > > _______________________________________________ > xfs mailing list > xfs@xxxxxxxxxxx > http://oss.sgi.com/mailman/listinfo/xfs -- Carlos _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs