On Wed, Oct 20, 2021 at 11:09:07AM +0200, Michal Hocko wrote: > > 3. We would need to extend this functionality to other file systems of > > persistent disk, then mount that file system with 'memcg=<dedicated > > shared library memcg>'. Jobs can then use the shared library and any > > memory allocated due to loading the shared library is charged to a > > dedicated memcg, and not charged to the job using the shared library. > > This is more of a question for fs people. My understanding is rather > limited so I cannot even imagine all the possible setups but just from > a very high level understanding bind mounts can get really interesting. > Can those disagree on the memcg? > > I am pretty sure I didn't get to think through this very deeply, my gut > feeling tells me that this will open many interesting questions and I am > not sure whether it solves more problems than it introduces at this moment. > I would be really curious what others think about this. My understanding of the proposal is that the mount option would be on the superblock, and it would not be a per-bind-mount option, ala the ro mount option. In other words, the designation of the target memcg for which all tmpfs files would be charged would be something that would be stored in the struct super. I'm also going to assume that the only thing that gets charged is memory for files that are backed on the tmpfs. So for example, if there is a MAP_PRIVATE mapping, the base page would have be charged to the target memcg when the file was originally created. However, if the process tries to modify a private mapping, and there page allocated on the copy-on-write would get charged to the process's memcg, and not to the tmpfs's target memcg. If we make these simplifying assumptions, then it should be fairly simple. Essentially, the model is that whenever we do the file system equivalent of "block allocation" for the file system, the tmpfs file system has all of the pages associated with that file system is charged to the target memcg. That's pretty straightforward, and is pretty easy to model and anticipate. In fact, if the only use case was #3 (shared libraries and library runtimes) this workload could be accomodated without needing any kernel changes. This could be done by simply having the setup process run in the "target memcg", and it would simply copy all of the shared libraries and runtime files into the tmpfs at setup time. So that would get charged to the memcg which first allocated the file, and that would be the setup memcg. And all of the Kubernetes containers that use these shared libraries and language runtimes, when they map those pages read-only into their task processes, since those tmpfs pages were charged to the setup memcg, they won't get charged to the task containers. And I *do* believe that it's much easier to anticipate how much memory will be used by these shared files, and so we don't need to potentially give a task container enough memory quota so that if it is the first container to start running, it gets charged with all of the memory, while all of the other containers can afford to freeload off the first container --- but we still have to give those containers enough memory in their memcg in case those *other* containers happen to be the first one to get launched. Cheers, - Ted