On Fri 07-10-22 02:16:18, Ojaswin Mujoo wrote: > Currently, the kernel uses i_prealloc_list to hold all the inode > preallocations. This is known to cause degradation in performance in > workloads which perform large number of sparse writes on a single file. > This is mainly because functions like ext4_mb_normalize_request() and > ext4_mb_use_preallocated() iterate over this complete list, resulting in > slowdowns when large number of PAs are present. > > Patch 27bc446e2 partially fixed this by enforcing a limit of 512 for > the inode preallocation list and adding logic to continually trim the > list if it grows above the threshold, however our testing revealed that > a hardcoded value is not suitable for all kinds of workloads. > > To optimize this, add an rbtree to the inode and hold the inode > preallocations in this rbtree. This will make iterating over inode PAs > faster and scale much better than a linked list. Additionally, we also > had to remove the LRU logic that was added during trimming of the list > (in ext4_mb_release_context()) as it will add extra overhead in rbtree. > The discards now happen in the lowest-logical-offset-first order. > > ** Locking notes ** > > With the introduction of rbtree to maintain inode PAs, we can't use RCU > to walk the tree for searching since it can result in partial traversals > which might miss some nodes(or entire subtrees) while discards happen > in parallel (which happens under a lock). Hence this patch converts the > ei->i_prealloc_lock spin_lock to rw_lock. > > Almost all the codepaths that read/modify the PA rbtrees are protected > by the higher level inode->i_data_sem (except > ext4_mb_discard_group_preallocations() and ext4_clear_inode()) IIUC, the > only place we need lock protection is when one thread is reading > "searching" the PA rbtree (earlier protected under rcu_read_lock()) and > another is "deleting" the PAs in ext4_mb_discard_group_preallocations() > function (which iterates all the PAs using the grp->bb_prealloc_list and > deletes PAs from the tree without taking any inode lock (i_data_sem)). > > So, this patch converts all rcu_read_lock/unlock() paths for inode list > PA to use read_lock() and all places where we were using > ei->i_prealloc_lock spinlock will now be using write_lock(). > > Note that this makes the fast path (searching of the right PA e.g. > ext4_mb_use_preallocated() or ext4_mb_normalize_request()), now use > read_lock() instead of rcu_read_lock/unlock(). Ths also will now block > due to slow discard path (ext4_mb_discard_group_preallocations()) which > uses write_lock(). > > But this is not as bad as it looks. This is because - > > 1. The slow path only occurs when the normal allocation failed and we > can say that we are low on disk space. One can argue this scenario > won't be much frequent. > > 2. ext4_mb_discard_group_preallocations(), locks and unlocks the rwlock > for deleting every individual PA. This gives enough opportunity for > the fast path to acquire the read_lock for searching the PA inode > list. > > Signed-off-by: Ojaswin Mujoo <ojaswin@xxxxxxxxxxxxx> > Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@xxxxxxxxx> Looks mostly good to me now. Just three nits below. With those fixes feel free to add: Reviewed-by: Jan Kara <jack@xxxxxxx> > @@ -4031,19 +4054,27 @@ ext4_mb_pa_adjust_overlap(struct ext4_allocation_context *ac, > new_end = *end; > > /* check we don't cross already preallocated blocks */ > - rcu_read_lock(); > - list_for_each_entry_rcu(tmp_pa, &ei->i_prealloc_list, pa_node.inode_list) { > - if (tmp_pa->pa_deleted) > + read_lock(&ei->i_prealloc_lock); > + for (iter = ei->i_prealloc_node.rb_node; iter; > + iter = ext4_mb_pa_rb_next_iter(new_start, tmp_pa_start, iter)) { > + tmp_pa = rb_entry(iter, struct ext4_prealloc_space, > + pa_node.inode_node); > + tmp_pa_start = tmp_pa->pa_lstart; > + tmp_pa_end = tmp_pa->pa_lstart + EXT4_C2B(sbi, tmp_pa->pa_len); > + > + /* > + * If pa is deleted, ignore overlaps and just iterate in rbtree > + * based on tmp_pa_start > + */ > + if (tmp_pa->pa_deleted) { > continue; > + } Curly braces here are pointless. > @@ -4408,17 +4439,21 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac) > return false; > > /* first, try per-file preallocation */ > - rcu_read_lock(); > - list_for_each_entry_rcu(tmp_pa, &ei->i_prealloc_list, pa_node.inode_list) { > + read_lock(&ei->i_prealloc_lock); > + for (iter = ei->i_prealloc_node.rb_node; iter; > + iter = ext4_mb_pa_rb_next_iter(ac->ac_o_ex.fe_logical, tmp_pa_start, iter)) { > + tmp_pa = rb_entry(iter, struct ext4_prealloc_space, pa_node.inode_node); > > /* all fields in this condition don't change, > * so we can skip locking for them */ > tmp_pa_start = tmp_pa->pa_lstart; > tmp_pa_end = tmp_pa->pa_lstart + EXT4_C2B(sbi, tmp_pa->pa_len); > > + /* original request start doesn't lie in this PA */ > if (ac->ac_o_ex.fe_logical < tmp_pa_start || > - ac->ac_o_ex.fe_logical >= tmp_pa_end) > + ac->ac_o_ex.fe_logical >= tmp_pa_end) { > continue; > + } Again, curly braces here are pointless. > +static void ext4_mb_rb_insert(struct rb_root *root, struct rb_node *new, > + int (*cmp)(struct rb_node *, struct rb_node *)) > +{ > + struct rb_node **iter = &root->rb_node, *parent = NULL; > + > + while (*iter) { > + parent = *iter; > + if (cmp(new, *iter) < 0) > + iter = &((*iter)->rb_left); > + else > + iter = &((*iter)->rb_right); > + } > + > + rb_link_node(new, parent, iter); > + rb_insert_color(new, root); > +} I think I wrote it already last time: ext4_mb_rb_insert() is always called with ext4_mb_pa_cmp() as the comparison function. Furthemore ext4_mb_pa_cmp() is used nowhere else. So I'd just opencode ext4_mb_pa_cmp() in ext4_mb_rb_insert() and get rid of the indirect call. Better for speed as well as readability. Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR