On Fri, Jun 02, 2023 at 11:13:09AM +0800, Qi Zheng wrote: > Hi Dave, > > On 2023/6/2 07:06, Dave Chinner wrote: > > On Thu, Jun 01, 2023 at 04:43:32PM +0800, Qi Zheng wrote: > > > Hi Dave, > > > On 2023/6/1 07:48, Dave Chinner wrote: > > > > On Wed, May 31, 2023 at 09:57:40AM +0000, Qi Zheng wrote: > > > > > From: Kirill Tkhai <tkhai@xxxxx> > > > > I don't really like this ->destroy_super() callback, especially as > > > > it's completely undocumented as to why it exists. This is purely a > > > > work-around for handling extended filesystem superblock shrinker > > > > functionality, yet there's nothing that tells the reader this. > > > > > > > > It also seems to imply that the superblock shrinker can continue to > > > > run after the existing unregister_shrinker() call before ->kill_sb() > > > > is called. This violates the assumption made in filesystems that the > > > > superblock shrinkers have been stopped and will never run again > > > > before ->kill_sb() is called. Hence ->kill_sb() implementations > > > > assume there is nothing else accessing filesystem owned structures > > > > and it can tear down internal structures safely. > > > > > > > > Realistically, the days of XFS using this superblock shrinker > > > > extension are numbered. We've got a lot of the infrastructure we > > > > need in place to get rid of the background inode reclaim > > > > infrastructure that requires this shrinker extension, and it's on my > > > > list of things that need to be addressed in the near future. > > > > > > > > In fact, now that I look at it, I think the shmem usage of this > > > > superblock shrinker interface is broken - it returns SHRINK_STOP to > > > > ->free_cached_objects(), but the only valid return value is the > > > > number of objects freed (i.e. 0 is nothing freed). These special > > > > superblock extension interfaces do not work like a normal > > > > shrinker.... > > > > > > > > Hence I think the shmem usage should be replaced with an separate > > > > internal shmem shrinker that is managed by the filesystem itself > > > > (similar to how XFS has multiple internal shrinkers). > > > > > > > > At this point, then the only user of this interface is (again) XFS. > > > > Given this, adding new VFS methods for a single filesystem > > > > for functionality that is planned to be removed is probably not the > > > > best approach to solving the problem. > > > > > > Thanks for such a detailed analysis. Kirill Tkhai just proposeed a > > > new method[1], I cc'd you on the email. > > > > I;ve just read through that thread, and I've looked at the original > > patch that caused the regression. > > > > I'm a bit annoyed right now. Nobody cc'd me on the original patches > > nor were any of the subsystems that use shrinkers were cc'd on the > > patches that changed shrinker behaviour. I only find out about this > > Sorry about that, my mistake. I followed the results of > scripts/get_maintainer.pl before. Sometimes I wonder if people who contribute a lot to a subsystem should be more aggressive about listing themselves explicitly in MAINTAINERS but then I look at the ~600 emails that came in while I was on vacation for 6 days over a long weekend and ... shut up. :P > > because someone tries to fix something they broke by *breaking more > > stuff* and not even realising how broken what they are proposing is. > > Yes, this slows down the speed of umount. But the benefit is that > slab shrink becomes lockless, the mount operation and slab shrink no > longer affect each other, and the IPC no longer drops significantly, > etc. The lockless shrink seems like a good thing to have, but ... is it really true that the superblock shrinker can still be running after ->kill_sb? /That/ is surprising to me. --D > And I used bpftrace to measure the time consumption of > unregister_shrinker(): > > ``` > And I just tested it on a physical machine (Intel(R) Xeon(R) Platinum > 8260 CPU @ 2.40GHz) and the results are as follows: > > 1) use synchronize_srcu(): > > @ns[umount]: > [8K, 16K) 83 |@@@@@@@ | > [16K, 32K) 578 > |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > [32K, 64K) 78 |@@@@@@@ | > [64K, 128K) 6 | | > [128K, 256K) 7 | | > [256K, 512K) 29 |@@ | > [512K, 1M) 51 |@@@@ | > [1M, 2M) 90 |@@@@@@@@ | > [2M, 4M) 70 |@@@@@@ | > [4M, 8M) 8 | | > > 2) use synchronize_srcu_expedited(): > > @ns[umount]: > [8K, 16K) 31 |@@ | > [16K, 32K) 803 > |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > [32K, 64K) 158 |@@@@@@@@@@ | > [64K, 128K) 4 | | > [128K, 256K) 2 | | > [256K, 512K) 2 | | > ``` > > With synchronize_srcu(), most of the time consumption is between 16us and > 32us, the worst case between 4ms and 8ms. Is this totally > unacceptable? > > This performance regression report comes from a stress test. Will the > umount action be executed so frequently under real workloads? > > If there are really unacceptable, after applying the newly proposed > method, umount will be as fast as before (or even faster). > > Thanks, > Qi > > > > > The previous code was not broken and it provided specific guarantees > > to subsystems via unregister_shrinker(). From the above discussion, > > it appears that the original authors of these changes either did not > > know about or did not understand them, so that casts doubt in my > > mind about the attempted solution and all the proposed fixes for it. > > > > I don't have the time right now unravel this mess and fully > > understand the original problem, changes or the band-aids that are > > being thrown around. We are also getting quite late in the cycle to > > be doing major surgery to critical infrastructure, especially as it > > gives so little time to review regression test whatever new solution > > is proposed. > > > > Given this appears to be a change introduced in 6.4-rc1, I think the > > right thing to do is to revert the change rather than make things > > worse by trying to shove some "quick fix" into the kernel to address > > it. > > > > Andrew, could you please sort out a series to revert this shrinker > > infrastructure change and all the dependent hacks that have been > > added to try to fix it so far? > > > > -Dave. > > -- > Thanks, > Qi