On Wed, Mar 21, 2018 at 07:15:14PM +0300, Kirill Tkhai wrote: > On 20.03.2018 17:34, Dave Chinner wrote: > > On Tue, Mar 20, 2018 at 04:15:16PM +0300, Kirill Tkhai wrote: > >> On 20.03.2018 03:18, Dave Chinner wrote: > >>> On Mon, Mar 19, 2018 at 02:06:01PM +0300, Kirill Tkhai wrote: > >>>> On 17.03.2018 00:39, Dave Chinner wrote: > >>> Actually, it is fair, because: > >>> > >>> /* proportion the scan between the caches */ > >>> dentries = mult_frac(sc->nr_to_scan, dentries, total_objects); > >>> inodes = mult_frac(sc->nr_to_scan, inodes, total_objects); > >>> fs_objects = mult_frac(sc->nr_to_scan, fs_objects, total_objects); > >>> > >>> This means that if the number of objects in the memcg aware VFS > >>> caches are tiny compared to the global XFS inode cache, then they > >>> only get a tiny amount of the total scanning done by the shrinker. > >> > >> This is just wrong. If you have two memcgs, the first is charged > >> by xfs dentries and inodes, while the second is not charged by xfs > >> dentries and inodes, the second will response for xfs shrink as well > >> as the first. > > > > That makes no sense to me. Shrinkers are per-sb, so they only > > shrink per-filesystem dentries and inodes. If your memcgs are on > > different filesystems, then they'll behave according to the size of > > that filesystem's caches, not the cache of some other filesystem. > > But this break the isolation purpose of memcg. When someone places > different services in different pairs of memcg and cpu cgroup, he > wants do not do foreign work. Too bad. Filesystems *break memcg isolation all the time*. Think about it. The filesystem journal is owned by the filesystem, not the memcg that is writing a transaction to it. If one memcg does lots of metadata modifications, it can use all the journal space and starve iitself and all other memcgs of journal space while the filesystem does writeback to clear it out. Another example: if one memcg is manipulating free space in a particular allocation group, other memcgs that need to manipulate space in those allocation groups will block and be prevented from operating until the other memcg finishes it's work. Another example: inode IO in XFS are done in clusters of 32. read or write any inode in the cluster, and we read/write all the other inodes in that cluster, too. Hence if we need to read an inode and the cluster is busy because the filesystem journal needed flushing or another inode in the cluster is being unlinked (e.g. by a process in a different memcg), then the read in the first memcg will block until whatever is being done on the buffer is complete. IOWs, even at the physical on-disk inode level we violate memcg resource isolation principles. I can go on, but I think you get the picture: Filesystems are full of global structures whose access arbitration mechanisms fundamentally break the concept of operational memcg isolation. With that in mind, we really don't care that the shrinker does global work and violate memcg isolation principles because we violately them everywhere. IOWs, if we try to enforce memcg isolation in the shrinker, then we can't guarantee forwards progress in memory reclaim because we end up with multi-dimensional memcg dependencies at the physical layers of the filesystem structure.... I don't expect people who know nothing about XFS or filesystems to understand the complexity of the interactions we are dealing with here. Everything is a compromise when it comes to the memory reclaim code as tehre are so many corner cases we have to handle. In this situation, perfect is the enemy of good.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html