> On Oct 20, 2022, at 1:06 AM, Huang, Ying <ying.huang@xxxxxxxxx> wrote: > > "Alex Zhu (Kernel)" <alexlzhu@xxxxxxxx> writes: > >>> On Oct 19, 2022, at 6:24 PM, Huang, Ying <ying.huang@xxxxxxxxx> wrote: >>> >>> "Alex Zhu (Kernel)" <alexlzhu@xxxxxxxx> writes: >>> >>>>> On Oct 19, 2022, at 12:04 AM, Huang, Ying <ying.huang@xxxxxxxxx> wrote: >>>>> >>>>>> >>>>> <alexlzhu@xxxxxx> writes: >>>>> > > [snip] > >>>>>> diff --git a/mm/list_lru.c b/mm/list_lru.c >>>>>> index a05e5bef3b40..8cc56a84b554 100644 >>>>>> --- a/mm/list_lru.c >>>>>> +++ b/mm/list_lru.c >>>>>> @@ -140,6 +140,32 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item) >>>>>> } >>>>>> EXPORT_SYMBOL_GPL(list_lru_add); >>>>>> >>>>>> +bool list_lru_add_page(struct list_lru *lru, struct page *page, struct list_head *item) >>>>>> +{ >>>>>> + int nid = page_to_nid(page); >>>>>> + struct list_lru_node *nlru = &lru->node[nid]; >>>>>> + struct list_lru_one *l; >>>>>> + struct mem_cgroup *memcg; >>>>>> + unsigned long flags; >>>>>> + >>>>>> + spin_lock_irqsave(&nlru->lock, flags); >>>>>> + if (list_empty(item)) { >>>>>> + memcg = page_memcg(page); >>>>>> + l = list_lru_from_memcg_idx(lru, nid, memcg_kmem_id(memcg)); >>>>>> + list_add_tail(item, &l->list); >>>>>> + /* Set shrinker bit if the first element was added */ >>>>>> + if (!l->nr_items++) >>>>>> + set_shrinker_bit(memcg, nid, >>>>>> + lru_shrinker_id(lru)); >>>>>> + nlru->nr_items++; >>>>>> + spin_unlock_irqrestore(&nlru->lock, flags); >>>>>> + return true; >>>>>> + } >>>>>> + spin_unlock_irqrestore(&nlru->lock, flags); >>>>>> + return false; >>>>>> +} >>>>>> +EXPORT_SYMBOL_GPL(list_lru_add_page); >>>>>> + >>>>> >>>>> It appears that only 2 lines are different from list_lru_add(). Is it >>>>> possible for us to share code? For example, add another flag for >>>>> page_memcg() case? >>>> >>>> I believe there are 4 lines. The page_to_nid(page) and the spin_lock_irqsave/restore. >>>> It was implemented this way as we found we needed to take a page as a parameter and obtain the node id from >>>> the page. This is because the THP is not necessarily a slab object, as the list_lru_add/delete code assumes. >>>> >>>> Also, there is a potential deadlock when split_huge_page is called from reclaim and when split_huge_page is called >>>> by the THP Shrinker, which is why we need irqsave/restore. >>> >>> Can you provide more details on this? If it's necessary, I think it should be OK to use irqsave/restore. >> >> There is an ABBA deadlock between reclaim and the THP shrinker where >> one waits for the page lock and the other waits for the list_lru lock. > > Per my understanding, when we hold list_lru lock, we only use > trylock_page() and backoff if we fail to acquire the page lock. So it > appears that we are safe here? If not, can you provide the detailed > race condition information, for example actions on different CPUs? The ABBA deadlock was the reason we added this try lock_page(). It is safe now, was not safe in v3. > >> There is another deadlock where free_transhuge_page interrupts the THP >> shrinker while it is splitting THPs (has the list_lru lock) and then >> acquires the same list_lru lock (self deadlock). These happened in our >> prod experiments. I do believe irqsave/restore is necessary. > > Got it. We may use list_lru lock inside IRQ handler when free pages. I > think this information is good for code comments or patch description. Sounds good! I’ll add some comments to make that more clear. > >>> >>>> I though this would be cleaner than attempting to shared code with list_lru_add/delete. Only the shrinker makes use of this. >>>> All other use cases assume slab objects. >>> >>> Not only for slab objects now. The original code works for slab and >>> normal page. Which is controlled by list_lru->memcg_aware. You can add >>> another flag for your new use case, or refactor it to use a function >>> pointer. >>> >>> Best Regards, >>> Huang, Ying >> >> I’ll take a look at this tomorrow and get back to you. Thanks for the suggestion! >>> >>>>> > > [snip] > > Best Regards, > Huang, Ying