On Mon, Nov 29, 2021 at 09:39:16AM +0100, Michal Hocko wrote: > On Fri 26-11-21 16:26:23, Hao Lee wrote: > [...] > > I will try Matthew's idea to use semaphore or mutex to limit the number of BE > > jobs that are in the exiting path. This sounds like a feasible approach for > > our scenario... > > I am not really sure this is something that would be acceptable. Your > problem is resource partitioning. Papering that over by a lock is not > the right way to go. Besides that you will likely hit a hard question on > how many tasks to allow to run concurrently. Whatever the value some > workload will very likely going to suffer. We cannot assume admin to > chose the right value because there is no clear answer for that. Not to > mention other potential problems - e.g. even more priority inversions > etc. I don't see how we get priority inversions. These tasks are exiting; at the point they take the semaphore, they should not be holding any locks. They're holding a resource (memory) that needs to be released, but a task wanting to acquire memory must already be prepared to sleep. I see this as being a thundering herd problem. We have dozens, maybe hundreds of tasks all trying to free their memory at once. If we force the herd to go through a narrow gap, they arrive at the spinlock in an orderly manner.