On Thu, Sep 29, 2022 at 02:53:11PM +0200, Jan Kara wrote: > On Tue 27-09-22 14:46:48, Ojaswin Mujoo wrote: > > Earlier, inode PAs were stored in a linked list. This caused a need to > > periodically trim the list down inorder to avoid growing it to a very > > large size, as this would severly affect performance during list > > iteration. > > > > Recent patches changed this list to an rbtree, and since the tree scales > > up much better, we no longer need to have the trim functionality, hence > > remove it. > > > > Signed-off-by: Ojaswin Mujoo <ojaswin@xxxxxxxxxxxxx> > > Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@xxxxxxxxx> > > I'm kind of wondering: Now there won't be performance issues with much > more inode PAs but probably we don't want to let them grow completely out > of control? E.g. I can imagine that if we'd have 1 billion of inode PAs > attached to an inode, things would get wonky both in terms of memory > consumption and also in terms of CPU time spent for the cases where we > still do iterate all of the PAs... Is there anything which keeps inode PAs > reasonably bounded? > > Honza > Hi Jan, Sorry for the delay in response, I was on leave for the last few days. So as per my understanding, after this patch, the only path where we would need to traverse all the PAs is the ext4_discard_preallocations() call where we discard all the PAs of an inode one by one (eg when closing the file etc). Such a discard is a colder path as we don't usually expect to do it as often as say allocating blocks to an inode. Originally, the limit was added in this patch [1] because of the time lost in O(N) traversal in the allocation path (ext4_mb_use_preallocated and ext4_mb_normalize_request). Since the rbtree addressed this scalability issue we had decided to remove the trim logic in this patchset. [1] https://lore.kernel.org/all/d7a98178-056b-6db5-6bce-4ead23f4a257@xxxxxxxxx/ That being said, I do agree that there should be some way to limit the PAs from taking up an unreasonable amount of buddy space, memory and CPU cycles in use cases like database files and disk files of long running VMs. Previously the limit was 512 PAs per inode and trim was happening in an LRU fashion, which is not very straightforward to implement in trees. Another approach is rather than having a hard limit, we can throttle the PAs based on some parameter like total active PAs in FS or FSUtil% of the PAs but we might need to take care of fairness so one inode is not holding all the PAs while others get throttled. Anyways, I think the trimming part would need some brainstorming to get right so just wondering if we could keep that as part of a separate patchset and remove the trimming logic for now since rbtree has addressed the scalability concerns in allocation path. Do let me know your thoughts on this. Regards, Ojaswin