Thanks for the review Andreas. On Sat, Mar 28, 2020 at 5:01 PM Andreas Dilger <adilger@xxxxxxxxx> wrote: > > On Mar 25, 2020, at 3:37 AM, Harshad Shirwadkar <harshadshirwadkar@xxxxxxxxx> wrote: > > > > This patch adds shrinking support for htree based directories. The > > high level algorithm is as follows: > > > > * If after dentry removal the dirent block (let's call it B) becomes > > empty, then remove its references in its dx parent. > > * Swap its contents with that of the last block (L) in directory. > > * Update L's parents to point to B instead. > > * Remove L > > * Repeat this for all the ancestors of B. > > > > We add variants of dx_probe that allow us perform reverse lookups from > > a logical block to its dx parents. > > > > Ran kvm-xfstests smoke and verified that no new failures are > > introduced. Ran shrinking for directories with following number of > > files and then deleted files one by one: > > * 1000 (size before deletion 36K, after deletion 4K) > > * 10000 (size before deletion 196K, after deletion 4K) > > * 100000 (size before deletion 2.1M, after deletion 4K) > > * 200000 (size before deletion 4.2M, after deletion 4K) > > > > In all cases directory shrunk significantly. We fallback to linear > > directories if the directory becomes empty. > > > > But note that most of the shrinking happens during last 1-2% deletions > > in an average case. Therefore, the next step here is to merge dx nodes > > when possible. That can be achieved by storing the fullness index in > > htree nodes. But that's an on-disk format change. We can instead build > > on tooling added by this patch to perform reverse lookup on a dx > > node and then reading adjacent nodes to check their fullness. > > > > This patch supersedes the other directory shrinking patch sent in Aug > > 2019 ("ext4: attempt to shrink directory on dentry removal"). > > > > Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@xxxxxxxxx> > > --- > > fs/ext4/ext4_jbd2.h | 7 + > > fs/ext4/namei.c | 355 ++++++++++++++++++++++++++++++++++++++++++-- > > 2 files changed, 353 insertions(+), 9 deletions(-) > > > > diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c > > index d567b9589875..b78c6f9a6eba 100644 > > --- a/fs/ext4/namei.c > > +++ b/fs/ext4/namei.c > > > > +/* > > + * dx_probe with relaxed checks. This function is used in the directory > > + * shrinking code since we can run into intermediate states where we have > > + * internal dx nodes with count = 0. > > + */ > > +static inline struct dx_frame * > > +dx_probe_relaxed(struct ext4_filename *fname, struct inode *dir, > > + struct dx_hash_info *hinfo, struct dx_frame *frame_in) > > +{ > > + return __dx_probe(fname, dir, hinfo, frame_in, 0, false); > > +} > > + > > +/* > > + * Perform only a parttial dx_probe until we find block end_lblk. > > (typo) "partial" Ack > > > +static inline struct dx_frame * > > +dx_probe_partial(struct ext4_filename *fname, struct inode *dir, > > + struct dx_hash_info *hinfo, struct dx_frame *frame_in, > > + ext4_lblk_t end_lblk) > > +{ > > + return __dx_probe(fname, dir, hinfo, frame_in, end_lblk, false); > > +} > > + > [snip] Ack > > +/* > > + * This function tries to remove the entry of a dirent block (which was just > > + * emptied by the caller) from the dx frame. It does so by reducing the count by > > + * 1 and left shifting all the entries after the deleted entry. > > + */ > > +int > > +ext4_remove_dx_entry(handle_t *handle, struct inode *dir, > > + struct dx_frame *dx_frame) > > +{ > > + err = ext4_journal_get_write_access(handle, dx_frame->bh); > > + if (err) { > > + ext4_std_error(dir->i_sb, err); > > + return -EINVAL; > > + } > > + > > + for (; i < count - 1; i++) > > + entries[i] = entries[i + 1]; > > It would be more efficient to do this with "memmove()" rather than copying > each entry separately. Ack, implemented in V2. > > > + /* > > + * If i was 0 when we began above loop, we would have overwritten count > > + * and limit values since those values live in dx_entry->hash of the > > + * first entry. We need to update count but we should set limit as well. > > + */ > > + dx_set_count(entries, count - 1); > > + dx_set_limit(entries, limit); > > How hard is it to avoid clobbering these fields in the first place? > I'm just thinking that "clobber + fixup" is subject to race conditions > at various times in the past, and may become an issue in the future > (e.g. with parallel directory operations). > Ack, I agree better to not clobber in the first place. Implemented in V2. > > static inline bool is_empty_dirent_block(struct inode *dir, > > + struct buffer_head *bh) > > +{ > > This should be combined with ext4_empty_dir() to avoid code duplication. Thanks, this also simplifies ext4_empty_dir(). > > > + struct ext4_dir_entry_2 *de = (struct ext4_dir_entry_2 *)bh->b_data; > > + int csum_size = 0; > > + > > + if (ext4_has_metadata_csum(dir->i_sb) && is_dx(dir)) > > + csum_size = sizeof(struct ext4_dir_entry_tail); > > + > > + return ext4_rec_len_from_disk(de->rec_len, dir->i_sb->s_blocksize) == > > + dir->i_sb->s_blocksize - csum_size && de->inode == 0; > > +} > > This looks like a low cost way to determine the leaf block is empty, > but checking this for every unlink likely has a non-zero cost. Ack > > > @@ -2530,6 +2864,9 @@ static int ext4_delete_entry(handle_t *handle, > > if (unlikely(err)) > > goto out; > > > > + if (is_dx(dir)) > > + ext4_try_dir_shrink(handle, dir, lblk, bh); > > + > > return 0; > > It would be useful to run a comparison benchmark between the patched ext4 > and unpatched when deleting a large number of entries that checks both CPU > usage and performance. That will give us an idea of how much this costs > to be checked for every entry. CPU performance wasn't that different fo both the cases, but there are some differences in overall unlink performance. I added a performance evaluation section in V2. > > Also, rather than calling ext4_try_dir_shrink() and is_empty_dirent_block() > for every entry, couldn't this be returned from ext4_generic_delete_entry(), > since it has that information already. Thanks, implemented this in V2. Thanks, Harshad > > Cheers, Andreas > > > > >