On 19/03/2024 02:20, Yin Fengwei wrote: > > > On 3/18/24 23:35, Ryan Roberts wrote: >> On 18/03/2024 10:05, David Hildenbrand wrote: >>> On 18.03.24 11:00, Yin, Fengwei wrote: >>>> >>>> >>>> On 3/18/2024 10:16 AM, Huang, Ying wrote: >>>>> Ryan Roberts <ryan.roberts@xxxxxxx> writes: >>>>> >>>>>> Hi Yin Fengwei, >>>>>> >>>>>> On 15/03/2024 11:12, David Hildenbrand wrote: >>>>>>> On 15.03.24 11:49, Ryan Roberts wrote: >>>>>>>> On 15/03/2024 10:43, David Hildenbrand wrote: >>>>>>>>> On 11.03.24 16:00, Ryan Roberts wrote: >>>>>>>>>> Now that swap supports storing all mTHP sizes, avoid splitting large >>>>>>>>>> folios before swap-out. This benefits performance of the swap-out path >>>>>>>>>> by eliding split_folio_to_list(), which is expensive, and also sets us >>>>>>>>>> up for swapping in large folios in a future series. >>>>>>>>>> >>>>>>>>>> If the folio is partially mapped, we continue to split it since we want >>>>>>>>>> to avoid the extra IO overhead and storage of writing out pages >>>>>>>>>> uneccessarily. >>>>>>>>>> >>>>>>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@xxxxxxx> >>>>>>>>>> --- >>>>>>>>>> mm/vmscan.c | 9 +++++---- >>>>>>>>>> 1 file changed, 5 insertions(+), 4 deletions(-) >>>>>>>>>> >>>>>>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c >>>>>>>>>> index cf7d4cf47f1a..0ebec99e04c6 100644 >>>>>>>>>> --- a/mm/vmscan.c >>>>>>>>>> +++ b/mm/vmscan.c >>>>>>>>>> @@ -1222,11 +1222,12 @@ static unsigned int shrink_folio_list(struct >>>>>>>>>> list_head >>>>>>>>>> *folio_list, >>>>>>>>>> if (!can_split_folio(folio, NULL)) >>>>>>>>>> goto activate_locked; >>>>>>>>>> /* >>>>>>>>>> - * Split folios without a PMD map right >>>>>>>>>> - * away. Chances are some or all of the >>>>>>>>>> - * tail pages can be freed without IO. >>>>>>>>>> + * Split partially mapped folios map >>>>>>>>>> + * right away. Chances are some or all >>>>>>>>>> + * of the tail pages can be freed >>>>>>>>>> + * without IO. >>>>>>>>>> */ >>>>>>>>>> - if (!folio_entire_mapcount(folio) && >>>>>>>>>> + if (!list_empty(&folio->_deferred_list) && >>>>>>>>>> split_folio_to_list(folio, >>>>>>>>>> folio_list)) >>>>>>>>>> goto activate_locked; >>>>>>>>> >>>>>>>>> Not sure if we might have to annotate that with data_race(). >>>>>>>> >>>>>>>> I asked that exact question to Matthew in another context bt didn't get a >>>>>>>> response. There are examples of checking if the deferred list is empty >>>>>>>> with and >>>>>>>> without data_race() in the code base. But list_empty() is implemented like >>>>>>>> this: >>>>>>>> >>>>>>>> static inline int list_empty(const struct list_head *head) >>>>>>>> { >>>>>>>> return READ_ONCE(head->next) == head; >>>>>>>> } >>>>>>>> >>>>>>>> So I assumed the READ_ONCE() makes everything safe without a lock? Perhaps >>>>>>>> not >>>>>>>> sufficient for KCSAN? >>>> I don't think READ_ONCE() can replace the lock. >> >> But it doesn't ensure we get a consistent value and that the compiler orders the >> load correctly. There are lots of patterns in the kernel that use READ_ONCE() >> without a lock and they don't use data_race() - e.g. ptep_get_lockless(). > They (ptep_get_lockless() and deferred_list) have different access pattern > (or race pattern) here. I don't think they are comparable. > >> >> It sounds like none of us really understand what data_race() is for, so I guess >> I'll just do a KCSAN build and invoke the code path to see if it complains. > READ_ONCE() in list_empty will shutdown the KCSAN also. OK, I found some time to run the test with KCSAN; nothing fires. But then I read the docs and looked at the code a bit. Documentation/dev-tools/kcsan.rst states: In an execution, two memory accesses form a *data race* if they *conflict*, they happen concurrently in different threads, and at least one of them is a *plain access*; they *conflict* if both access the same memory location, and at least one is a write. It also clarifies the READ_ONCE() is a "marked access". So we would have a data race if there was a concurrent, *plain* write to folio->_deferred_list.next. This can occur in a couple of places I believe, for example: deferred_split_folio() list_add_tail() __list_add() new->next = next; deferred_split_scan() list_move() list_add() __list_add() new->next = next; So if either partially deferred_split_folio() or deferred_split_scan() can run concurrently with shrink_folio_list(), for the same folio (I beleive both can can), then we have a race, and this list_empty() check needs to be protected with data_race(). The race is safe/by design, but it does need to be marked. I'll fix this in my next version. Thanks, Ryan > >> >> >>>> >>>>>>> >>>>>>> Yeah, there is only one use of data_race with that list. >>>>>>> >>>>>>> It was added in f3ebdf042df4 ("THP: avoid lock when check whether THP is in >>>>>>> deferred list"). >>>>>>> >>>>>>> Looks like that was added right in v1 of that change [1], so my best guess is >>>>>>> that it is not actually required. >>>>>>> >>>>>>> If not required, likely we should just cleanup the single user. >>>>>>> >>>>>>> [1] >>>>>>> https://lore.kernel.org/linux-mm/20230417075643.3287513-2-fengwei.yin@xxxxxxxxx/ >>>>>> >>>>>> Do you have any recollection of why you added the data_race() markup? >>>>> >>>>> Per my understanding, this is used to mark that the code accesses >>>>> folio->_deferred_list without lock intentionally, while >>>>> folio->_deferred_list may be changed in parallel. IIUC, this is what >>>>> data_race() is used for. Or, my understanding is wrong? >>>> Yes. This is my understanding also. >>> >>> Why don't we have a data_race() in deferred_split_folio() then, before taking >>> the lock? >>> >>> It's used a bit inconsistently here. >>> >>