Hi Ojaswin! I'm sorry for a bit delayed reply... On Thu 19-01-23 11:57:25, Ojaswin Mujoo wrote: > On Tue, Jan 17, 2023 at 12:03:35PM +0100, Jan Kara wrote: > > On Tue 17-01-23 16:00:47, Ojaswin Mujoo wrote: > > > On Mon, Jan 16, 2023 at 01:23:34PM +0100, Jan Kara wrote: > > > > > Since this covers the special case we discussed above, we will always > > > > > un-delete the PA when we encounter the special case and we can then > > > > > adjust for overlap and traverse the PA rbtree without any issues. > > > > > > > > > > Signed-off-by: Ojaswin Mujoo <ojaswin@xxxxxxxxxxxxx> > > > > > Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@xxxxxxxxx> > > > > > > Hi Jan, > > > Thanks for the review, sharing some of my thoughts below. > > > > > > > > > > > So I find this putting back of already deleted inode PA very fragile. For > > > > example in current code I suspect you've missed a case in ext4_mb_put_pa() > > > > which can mark inode PA (so it can then be spotted by > > > > ext4_mb_pa_adjust_overlap() and marked as in use again) but > > > > ext4_mb_put_pa() still goes on and destroys the PA. > > > > > > The 2 code paths that clash here are: > > > > > > ext4_mb_new_blocks() -> ext4_mb_release_context() -> ext4_mb_put_pa() > > > ext4_mb_new_blocks() -> ext4_mb_normalize_request() -> ext4_mb_pa_adjust_overlap() > > > > > > Since these are the only code paths from which these 2 functions are > > > called, for a given inode, access will always be serialized by the upper > > > level ei->i_data_sem, which is always taken when writing data blocks > > > using ext4_mb_new_block(). > > > > Indeed, inode->i_data_sem prevents the race I was afraid of. > > > > > From my understanding of the code, I feel only > > > ext4_mb_discard_group_preallocations() can race against other functions > > > that are modifying the PA rbtree since it does not take any inode locks. > > > > > > That being said, I do understand your concerns regarding the solution, > > > however I'm willing to work with the community to ensure our > > > implementation of this undelete feature is as robust as possible. Along > > > with fixing the bug reported here [1], I believe that it is also a good > > > optimization to have especially when the disk is near full and we are > > > seeing a lot of group discards going on. > > > > > > Also, in case the deleted PA completely lies inside our new range, it is > > > much better to just undelete and use it rather than deleting the > > > existing PA and reallocating the range again. I think the advantage > > > would be even bigger in ext4_mb_use_preallocated() function where we can > > > just undelete and use the PA and skip the entire allocation, incase original > > > range lies in a deleted PA. > > > > Thanks for explantion. However I think you're optimizing the wrong thing. > > We are running out of space (to run ext4_mb_discard_group_preallocations() > > at all) and we allocate from an area covered by PA that we've just decided > > to discard - if anything relies on performance of the filesystem in ENOSPC > > conditions it has serious problems no matter what. Sure, we should deliver > > the result (either ENOSPC or some block allocation) in a reasonable time > > but the performance does not really matter much because all the scanning > > and flushing is going to slow down everything a lot anyway. One additional > > scan of the rbtree is really negligible in this case. So what we should > > rather optimize for in this case is the code simplicity and maintainability > > of this rare corner-case that will also likely get only a small amount of > > testing. And in terms of code simplicity the delete & restart solution > > seems to be much better (at least as far as I'm imagining it - maybe the > > code will prove me wrong ;)). > Hi Jan, > > So I did try out the 'rb_erase from ext4_mb_adjust_overlap() and retry' method, > with ane extra pa_removed flag, but the locking is getting pretty messy. I'm > not sure if such a design is possible is the lock we currently have. > > Basically, the issue I'm facing is that we are having to drop the > locks read locks and accquire the write locks in > ext4_mb_adjust_overlap(), which looks something like this: > > spin_unlock(&tmp_pa->pa_lock); > read_unlock(&ei->i_prealloc_lock); > > write_lock(&ei->i_prealloc_lock); > spin_lock(&tmp_pa->pa_lock); > > We have to preserve the order and drop both tree and PA locks to avoid > deadlocks. With this approach, the issue is that in between dropping and > accquiring this lock, the group discard path can actually go ahead and free the > PA memory after calling rb erase on it, which can result in use after free in > the adjust overlap path. This is because the PA is freed without any locks in > discard path, as it assumes no other thread will have a reference to it. This > assumption was true earlier since our allocation path never gave up the rbtree > lock however it is not possible with this approach now. Essentially, the > concept of having two different areas where a PA can be deleted is bringing in > additional challenges and complexity, which might make things worse from a > maintainers/reviewers point of view. Right, I didn't realize that. That is nasty. > After brainstorming a bit, I think there might be a few alternatives here: > > 1. Instead of deleting PA in the adjust overlap thread, make it sleep till group > discard path goes ahead and deletes/frees it. At this point we can wake it up and retry > allocation. > > * Pros: We can be sure that PA would have been removed at the time of retry so > we don't waste extra retries. C > * Cons: Extra complexity in code. > > 2. Just go for a retry in adjust overlap without doing anything. In ideal case, > by the time we start retrying the PA might be already removed. Worse case: We > keep looping again and again since discard path has not deleted it yet. > > * Pros: Simplest approach, code remains straightforward. > * Cons: We can end up uselessly retrying if the discard path doesn't delete the PA fast enough. Well, I think cond_resched() + goto retry would be OK here. We could also cycle the corresponding group lock which would wait for ext4_mb_discard_group_preallocations() to finish but that is going to burn the CPU even more than the cond_resched() + retry as we'll be just spinning on the spinlock. Sleeping is IMHO not warranted as the whole ext4_mb_discard_group_preallocations() is running under a spinlock anyway so it should better be a very short sleep. Or actually I have one more possible solution: What the adjusting function is doing that it looks up PA before and after ac->ac_o_ex.fe_logical and trims start & end to not overlap these PAs. So we could just lookup these two PAs (ignoring the deleted state) and then just iterate from these with rb_prev() & rb_next() until we find not-deleted ones. What do you think? > 3. The approach of undeleting the PA (proposed in this patchset) that > we've already discussed. > > Now, to be honest, I still prefer the undelete PA approach as it makes more > sense to me and I think the code is simple enough as there are not many paths > that might race. Mostly just adjust_overlap and group discard or > use_preallocated and group discard. Yeah, I'm still not too keen on this but I'm willing to reconsider if above approach proves to be too expensive under ENOSPC conditions... Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR