Hello Shakeel, thank you for reviewing my patch! On Fri, Nov 8, 2024 at 5:43 PM Shakeel Butt <shakeel.butt@xxxxxxxxx> wrote: > > On Fri, Nov 08, 2024 at 01:29:45PM -0800, Joshua Hahn wrote: > > This patch introduces mem_cgroup_charge_hugetlb, which combines the > > logic of mem_cgroup{try,commit}_hugetlb. This reduces the footprint of > > memcg in hugetlb code, and also consolidates the error path that memcg > > can take into just one point. > > > > Signed-off-by: Joshua Hahn <joshua.hahnjy@xxxxxxxxx> > > - if (!memcg_charge_ret) > > - mem_cgroup_commit_charge(folio, memcg); > > - lruvec_stat_mod_folio(folio, NR_HUGETLB, pages_per_huge_page(h)); > > - mem_cgroup_put(memcg); > > + ret = mem_cgroup_charge_hugetlb(folio, gfp); > > + if (ret == -ENOMEM) { > > + spin_unlock_irq(&hugetlb_lock); > > spin_unlock_irq?? Thank you for the catch. I completely missed this after I swapped the position of mem_cgroup_charge_hugetlb to be called without the lock. I will fix this. > > + free_huge_folio(folio); > > free_huge_folio() will call lruvec_stat_mod_folio() unconditionally but > you are only calling it on success. This may underflow the metric. I was actually thinking about this too. I was wondering what would make sense -- in the original draft of this patch, I had the charge increment be called unconditionally as well. The idea was that even though it would not make sense to have the stat incremented when there is an error, it would eventually be corrected by free_huge_folio's decrement. However, because there is nothing stopping the user from checking the stat in this period, they may temporarily see that the value is incremented in memory.stat, even though they were not able to obtain this page. With that said, maybe it makes sense to increment unconditionally, if free also decrements unconditionally. This race condition is not something that will cause a huge problem for the user, although users relying on userspace monitors for memory.stat to handle memory management may see some problems. Maybe what would make the most sense is to do both incrementing & decrementing conditionally as well. Thank you for your feedback, I will iterate on this for the next version! > > +int mem_cgroup_charge_hugetlb(struct folio *folio, gfp_t gfp) > > +{ > > + struct mem_cgroup *memcg = get_mem_cgroup_from_current(); > > + int ret = 0; > > + > > + if (mem_cgroup_disabled() || !memcg_accounts_hugetlb() || > > + !memcg || !cgroup_subsys_on_dfl(memory_cgrp_subsys)) { > > + ret = -EOPNOTSUPP; > > why EOPNOTSUPP? You need to return 0 here. We do want > lruvec_stat_mod_folio() to be called. In this case, I was just preserving the original code's return statements. That is, in mem_cgroup_hugetlb_try_charge, the intended behavior was to return -EOPNOTSUPP if any of these conditions were met. If I understand the code correctly, calling lruvec_stat_mod_folio() on this failure will be a noop, since either memcg doesn't account hugetlb folios / there is no memcg / memcg is disabled. With all of this said, I think your feedback makes the most sense here, given the new semantics of the function: if there is no memcg or memcg doesn't account hugetlb, then there is no way that the limit can be reached! I will go forward with returning 0, and calling lruvec_stat_mod_folio (which will be a noop). Thank you for your detailed feedback. I wish I had caught these errors myself, thank you for your time in reviewing my patch. I hope you have a great rest of your weekend! Joshua