On Mar 25, 2020, at 3:37 AM, Harshad Shirwadkar <harshadshirwadkar@xxxxxxxxx> wrote: > > This patch adds shrinking support for htree based directories. The > high level algorithm is as follows: > > * If after dentry removal the dirent block (let's call it B) becomes > empty, then remove its references in its dx parent. > * Swap its contents with that of the last block (L) in directory. > * Update L's parents to point to B instead. > * Remove L > * Repeat this for all the ancestors of B. > > We add variants of dx_probe that allow us perform reverse lookups from > a logical block to its dx parents. > > Ran kvm-xfstests smoke and verified that no new failures are > introduced. Ran shrinking for directories with following number of > files and then deleted files one by one: > * 1000 (size before deletion 36K, after deletion 4K) > * 10000 (size before deletion 196K, after deletion 4K) > * 100000 (size before deletion 2.1M, after deletion 4K) > * 200000 (size before deletion 4.2M, after deletion 4K) > > In all cases directory shrunk significantly. We fallback to linear > directories if the directory becomes empty. > > But note that most of the shrinking happens during last 1-2% deletions > in an average case. Therefore, the next step here is to merge dx nodes > when possible. That can be achieved by storing the fullness index in > htree nodes. But that's an on-disk format change. We can instead build > on tooling added by this patch to perform reverse lookup on a dx > node and then reading adjacent nodes to check their fullness. > > This patch supersedes the other directory shrinking patch sent in Aug > 2019 ("ext4: attempt to shrink directory on dentry removal"). > > Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@xxxxxxxxx> > --- > fs/ext4/ext4_jbd2.h | 7 + > fs/ext4/namei.c | 355 ++++++++++++++++++++++++++++++++++++++++++-- > 2 files changed, 353 insertions(+), 9 deletions(-) > > diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c > index d567b9589875..b78c6f9a6eba 100644 > --- a/fs/ext4/namei.c > +++ b/fs/ext4/namei.c > > +/* > + * dx_probe with relaxed checks. This function is used in the directory > + * shrinking code since we can run into intermediate states where we have > + * internal dx nodes with count = 0. > + */ > +static inline struct dx_frame * > +dx_probe_relaxed(struct ext4_filename *fname, struct inode *dir, > + struct dx_hash_info *hinfo, struct dx_frame *frame_in) > +{ > + return __dx_probe(fname, dir, hinfo, frame_in, 0, false); > +} > + > +/* > + * Perform only a parttial dx_probe until we find block end_lblk. (typo) "partial" > +static inline struct dx_frame * > +dx_probe_partial(struct ext4_filename *fname, struct inode *dir, > + struct dx_hash_info *hinfo, struct dx_frame *frame_in, > + ext4_lblk_t end_lblk) > +{ > + return __dx_probe(fname, dir, hinfo, frame_in, end_lblk, false); > +} > + [snip] > +/* > + * This function tries to remove the entry of a dirent block (which was just > + * emptied by the caller) from the dx frame. It does so by reducing the count by > + * 1 and left shifting all the entries after the deleted entry. > + */ > +int > +ext4_remove_dx_entry(handle_t *handle, struct inode *dir, > + struct dx_frame *dx_frame) > +{ > + err = ext4_journal_get_write_access(handle, dx_frame->bh); > + if (err) { > + ext4_std_error(dir->i_sb, err); > + return -EINVAL; > + } > + > + for (; i < count - 1; i++) > + entries[i] = entries[i + 1]; It would be more efficient to do this with "memmove()" rather than copying each entry separately. > + /* > + * If i was 0 when we began above loop, we would have overwritten count > + * and limit values since those values live in dx_entry->hash of the > + * first entry. We need to update count but we should set limit as well. > + */ > + dx_set_count(entries, count - 1); > + dx_set_limit(entries, limit); How hard is it to avoid clobbering these fields in the first place? I'm just thinking that "clobber + fixup" is subject to race conditions at various times in the past, and may become an issue in the future (e.g. with parallel directory operations). > static inline bool is_empty_dirent_block(struct inode *dir, > + struct buffer_head *bh) > +{ This should be combined with ext4_empty_dir() to avoid code duplication. > + struct ext4_dir_entry_2 *de = (struct ext4_dir_entry_2 *)bh->b_data; > + int csum_size = 0; > + > + if (ext4_has_metadata_csum(dir->i_sb) && is_dx(dir)) > + csum_size = sizeof(struct ext4_dir_entry_tail); > + > + return ext4_rec_len_from_disk(de->rec_len, dir->i_sb->s_blocksize) == > + dir->i_sb->s_blocksize - csum_size && de->inode == 0; > +} This looks like a low cost way to determine the leaf block is empty, but checking this for every unlink likely has a non-zero cost. > @@ -2530,6 +2864,9 @@ static int ext4_delete_entry(handle_t *handle, > if (unlikely(err)) > goto out; > > + if (is_dx(dir)) > + ext4_try_dir_shrink(handle, dir, lblk, bh); > + > return 0; It would be useful to run a comparison benchmark between the patched ext4 and unpatched when deleting a large number of entries that checks both CPU usage and performance. That will give us an idea of how much this costs to be checked for every entry. Also, rather than calling ext4_try_dir_shrink() and is_empty_dirent_block() for every entry, couldn't this be returned from ext4_generic_delete_entry(), since it has that information already. Cheers, Andreas
Attachment:
signature.asc
Description: Message signed with OpenPGP