On Thu, May 4, 2023 at 2:07 AM Michal Hocko <mhocko@xxxxxxxx> wrote: > > On Wed 03-05-23 08:09:28, Suren Baghdasaryan wrote: > > On Wed, May 3, 2023 at 12:25 AM Michal Hocko <mhocko@xxxxxxxx> wrote: > [...] > > Thanks for summarizing! > > > > > At least those I find the most important: > > > - This is a big change and it adds a significant maintenance burden > > > because each allocation entry point needs to be handled specifically. > > > The cost will grow with the intended coverage especially there when > > > allocation is hidden in a library code. > > > > Do you mean with more allocations in the codebase more codetags will > > be generated? Is that the concern? > > No. I am mostly concerned about the _maintenance_ overhead. For the > bare tracking (without profiling and thus stack traces) only those > allocations that are directly inlined into the consumer are really > of any use. That increases the code impact of the tracing because any > relevant allocation location has to go through the micro surgery. > > e.g. is it really interesting to know that there is a likely memory > leak in seq_file proper doing and allocation? No as it is the specific > implementation using seq_file that is leaking most likely. There are > other examples like that See? Yes, I see that. One level tracking does not provide all the information needed to track such issues. Something more informative would cost more. That's why our proposal is to have a light-weight mechanism to get a high level picture and then be able to zoom into a specific area using context capture. If you have ideas to improve this, I'm open to suggestions. > > > Or maybe as you commented in > > another patch that context capturing feature does not limit how many > > stacks will be captured? > > That is a memory overhead which can be really huge and it would be nice > to be more explicit about that in the cover letter. It is a downside for > sure but not something that has a code maintenance impact and it is an > opt-in so it can be enabled only when necessary. You are right, I'll add that into the cover letter. > > Quite honestly, though, the more I look into context capturing part it > seems to me that there is much more to be reconsidered there and if you > really want to move forward with the code tagging part then you should > drop that for now. It would make the whole series smaller and easier to > digest. Sure, I don't see an issue with removing that for now and refining the mechanism before posting again. > > > > - It has been brought up that this is duplicating functionality already > > > available via existing tracing infrastructure. You should make it very > > > clear why that is not suitable for the job > > > > I experimented with using tracing with _RET_IP_ to implement this > > accounting. The major issue is the _RET_IP_ to codetag lookup runtime > > overhead which is orders of magnitude higher than proposed code > > tagging approach. With code tagging proposal, that link is resolved at > > compile time. Since we want this mechanism deployed in production, we > > want to keep the overhead to the absolute minimum. > > You asked me before how much overhead would be tolerable and the > > answer will always be "as small as possible". This is especially true > > for slab allocators which are ridiculously fast and regressing them > > would be very noticable (due to the frequent use). > > It would have been more convincing if you had some numbers at hands. > E.g. this is a typical workload we are dealing with. With the compile > time tags we are able to learn this with that much of cost. With a dynamic > tracing we are able to learn this much with that cost. See? As small as > possible is a rather vague term that different people will have a very > different idea about. I'm rerunning my tests with the latest kernel to collect the comparison data. I profiled these solutions before but the kernel changed since then, so I need to update them. > > > There is another issue, which I think can be solved in a smart way but > > will either affect performance or would require more memory. With the > > tracing approach we don't know beforehand how many individual > > allocation sites exist, so we have to allocate code tags (or similar > > structures for counting) at runtime vs compile time. We can be smart > > about it and allocate in batches or even preallocate more than we need > > beforehand but, as I said, it will require some kind of compromise. > > I have tried our usual distribution config (only vmlinux without modules > so the real impact will be larger as we build a lot of stuff into > modules) just to get an idea: > text data bss dec hex filename > 28755345 17040322 19845124 65640791 3e99957 vmlinux.before > 28867168 17571838 19386372 65825378 3ec6a62 vmlinux.after > > Less than 1% for text 3% for data. This is not all that terrible > for an initial submission and a more dynamic approach could be added > later. E.g. with a smaller pre-allocated hash table that could be > expanded lazily. Anyway not something I would be losing sleep over. This > can always be improved later on. Ah, right. I should have mentioned this overhead too. Thanks for keeping me honest. > > I understand that code tagging creates additional maintenance burdens > > but I hope it also produces enough benefits that people will want > > this. The cost is also hopefully amortized when additional > > applications like the ones we presented in RFC [1] are built using the > > same framework. > > TBH I am much more concerned about the maintenance burden on the MM side > than the actual code tagging itslef which is much more self contained. I > haven't seen other potential applications of the same infrastructure and > maybe the code impact would be much smaller than in the MM proper. Our > allocator API is really hairy and convoluted. Yes, other applications are much smaller and cleaner. MM allocation code is quite complex indeed. > > > > - We already have page_owner infrastructure that provides allocation > > > tracking data. Why it cannot be used/extended? > > > > 1. The overhead. > > Do you have any numbers? Will post once my tests are completed. > > > 2. Covers only page allocators. > > Yes this sucks. > > > > I didn't think about extending the page_owner approach to slab > > allocators but I suspect it would not be trivial. I don't see > > attaching an owner to every slab object to be a scalable solution. The > > overhead would again be of concern here. > > This would have been a nice argument to mention in the changelog so that > we know that you have considered that option at least. Why should I (as > a reviewer) wild guess that? Sorry, It's hard to remember all the decisions, discussions and conclusions when working on a feature over a long time period. I'll include more information about that. > > > I should point out that there was one important technical concern > > about lack of a kill switch for this feature, which was an issue for > > distributions that can't disable the CONFIG flag. In this series we > > addressed that concern. > > Thanks, that is certainly appreciated. I haven't looked deeper into that > part but from the cover letter I have understood that CONFIG_MEM_ALLOC_PROFILING > implies unconditional page_ext and therefore the memory overhead > assosiated with that. There seems to be a killswitch nomem_profiling but > from a quick look it doesn't seem to disable page_ext allocations. I > might be missing something there of course. Having a highlevel > describtion for that would be really nice as well. Right, will add a description of that as well. We eliminate the runtime overhead but not the memory one. However I believe it's also doable using page_ext_operations.need callback. Will look into it. Thanks, Suren. > > > [1] https://lore.kernel.org/all/20220830214919.53220-1-surenb@xxxxxxxxxx/ > > -- > Michal Hocko > SUSE Labs