On Wed, Jul 26, 2017 at 06:13:15PM -0700, Nadav Amit wrote: > Nadav Amit <nadav.amit@xxxxxxxxx> wrote: > > > Minchan Kim <minchan@xxxxxxxxxx> wrote: > > > >> On Wed, Jul 26, 2017 at 05:09:09PM -0700, Nadav Amit wrote: > >>> Minchan Kim <minchan@xxxxxxxxxx> wrote: > >>> > >>>> Hello Nadav, > >>>> > >>>> On Wed, Jul 26, 2017 at 12:18:37PM -0700, Nadav Amit wrote: > >>>>> Mel Gorman <mgorman@xxxxxxx> wrote: > >>>>> > >>>>>> On Wed, Jul 26, 2017 at 02:43:06PM +0900, Minchan Kim wrote: > >>>>>>>> I'm relying on the fact you are the madv_free author to determine if > >>>>>>>> it's really necessary. The race in question is CPU 0 running madv_free > >>>>>>>> and updating some PTEs while CPU 1 is also running madv_free and looking > >>>>>>>> at the same PTEs. CPU 1 may have writable TLB entries for a page but fail > >>>>>>>> the pte_dirty check (because CPU 0 has updated it already) and potentially > >>>>>>>> fail to flush. Hence, when madv_free on CPU 1 returns, there are still > >>>>>>>> potentially writable TLB entries and the underlying PTE is still present > >>>>>>>> so that a subsequent write does not necessarily propagate the dirty bit > >>>>>>>> to the underlying PTE any more. Reclaim at some unknown time at the future > >>>>>>>> may then see that the PTE is still clean and discard the page even though > >>>>>>>> a write has happened in the meantime. I think this is possible but I could > >>>>>>>> have missed some protection in madv_free that prevents it happening. > >>>>>>> > >>>>>>> Thanks for the detail. You didn't miss anything. It can happen and then > >>>>>>> it's really bug. IOW, if application does write something after madv_free, > >>>>>>> it must see the written value, not zero. > >>>>>>> > >>>>>>> How about adding [set|clear]_tlb_flush_pending in tlb batchin interface? > >>>>>>> With it, when tlb_finish_mmu is called, we can know we skip the flush > >>>>>>> but there is pending flush, so flush focefully to avoid madv_dontneed > >>>>>>> as well as madv_free scenario. > >>>>>> > >>>>>> I *think* this is ok as it's simply more expensive on the KSM side in > >>>>>> the event of a race but no other harmful change is made assuming that > >>>>>> KSM is the only race-prone. The check for mm_tlb_flush_pending also > >>>>>> happens under the PTL so there should be sufficient protection from the > >>>>>> mm struct update being visible at teh right time. > >>>>>> > >>>>>> Check using the test program from "mm: Always flush VMA ranges affected > >>>>>> by zap_page_range v2" if it handles the madvise case as well as that > >>>>>> would give some degree of safety. Make sure it's tested against 4.13-rc2 > >>>>>> instead of mmotm which already includes the madv_dontneed fix. If yours > >>>>>> works for both then it supersedes the mmotm patch. > >>>>>> > >>>>>> It would also be interesting if Nadav would use his slowdown hack to see > >>>>>> if he can still force the corruption. > >>>>> > >>>>> The proposed fix for the KSM side is likely to work (I will try later), but > >>>>> on the tlb_finish_mmu() side, I think there is a problem, since if any TLB > >>>>> flush is performed by tlb_flush_mmu(), flush_tlb_mm_range() will not be > >>>>> executed. This means that tlb_finish_mmu() may flush one TLB entry, leave > >>>>> another one stale and not flush it. > >>>> > >>>> Okay, I will change that part like this to avoid partial flush problem. > >>>> > >>>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > >>>> index 1c42d69490e4..87d0ebac6605 100644 > >>>> --- a/include/linux/mm_types.h > >>>> +++ b/include/linux/mm_types.h > >>>> @@ -529,10 +529,13 @@ static inline cpumask_t *mm_cpumask(struct mm_struct *mm) > >>>> * The barriers below prevent the compiler from re-ordering the instructions > >>>> * around the memory barriers that are already present in the code. > >>>> */ > >>>> -static inline bool mm_tlb_flush_pending(struct mm_struct *mm) > >>>> +static inline int mm_tlb_flush_pending(struct mm_struct *mm) > >>>> { > >>>> + int nr_pending; > >>>> + > >>>> barrier(); > >>>> - return atomic_read(&mm->tlb_flush_pending) > 0; > >>>> + nr_pending = atomic_read(&mm->tlb_flush_pending); > >>>> + return nr_pending; > >>>> } > >>>> static inline void set_tlb_flush_pending(struct mm_struct *mm) > >>>> { > >>>> diff --git a/mm/memory.c b/mm/memory.c > >>>> index d5c5e6497c70..b5320e96ec51 100644 > >>>> --- a/mm/memory.c > >>>> +++ b/mm/memory.c > >>>> @@ -286,11 +286,15 @@ bool tlb_flush_mmu(struct mmu_gather *tlb) > >>>> void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end) > >>>> { > >>>> struct mmu_gather_batch *batch, *next; > >>>> - bool flushed = tlb_flush_mmu(tlb); > >>>> > >>>> + if (!tlb->fullmm && !tlb->need_flush_all && > >>>> + mm_tlb_flush_pending(tlb->mm) > 1) { > >>> > >>> I saw you noticed my comment about the access of the flag without a lock. I > >>> must say it feels strange that a memory barrier would be needed here, but > >>> that what I understood from the documentation. > >> > >> I saw your recent barriers fix patch, too. > >> [PATCH v2 2/2] mm: migrate: fix barriers around tlb_flush_pending > >> > >> As I commented out in there, I hope to use below here without being > >> aware of complex barrier stuff. Instead, mm_tlb_flush_pending should > >> call the right barrier inside. > >> > >> mm_tlb_flush_pending(tlb->mm, false:no-pte-locked) > 1 > > > > I will address it in v3. > > > > > >>>> + tlb->start = min(start, tlb->start); > >>>> + tlb->end = max(end, tlb->end); > >>> > >>> Err… You open-code mmu_gather which is arch-specific. It appears that all of > >>> them have start and end members, but not need_flush_all. Besides, I am not > >> > >> When I see tlb_gather_mmu which is not arch-specific, it intializes > >> need_flush_all to zero so it would be no harmful although some of > >> architecture doesn't set the flag. > >> Please correct me if I miss something. > > > > Oh.. my bad. I missed the fact that this code is under “#ifdef > > HAVE_GENERIC_MMU_GATHER”. But that means that arch-specific tlb_finish_mmu() > > implementations (s390, arm) may need to be modified as well. > > > >>> sure whether they regard start and end the same way. > >> > >> I understand your worry but my patch takes longer range by min/max > >> so I cannot imagine how it breaks. During looking the code, I found > >> __tlb_adjust_range so better to use it rather than open-code. > >> > >> > >> diff --git a/mm/memory.c b/mm/memory.c > >> index b5320e96ec51..b23188daa396 100644 > >> --- a/mm/memory.c > >> +++ b/mm/memory.c > >> @@ -288,10 +288,8 @@ void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long e > >> struct mmu_gather_batch *batch, *next; > >> > >> if (!tlb->fullmm && !tlb->need_flush_all && > >> - mm_tlb_flush_pending(tlb->mm) > 1) { > >> - tlb->start = min(start, tlb->start); > >> - tlb->end = max(end, tlb->end); > >> - } > >> + mm_tlb_flush_pending(tlb->mm) > 1) > >> + __tlb_adjust_range(tlb->mm, start, end - start); > >> > >> tlb_flush_mmu(tlb); > >> clear_tlb_flush_pending(tlb->mm); > > > > This one is better, especially as I now understand it is only for the > > generic MMU gather (which I missed before). > > There is one issue I forgot: pte_accessible() on x86 regards > mm_tlb_flush_pending() as an indication for NUMA migration. But now the code > does not make too much sense: > > if ((pte_flags(a) & _PAGE_PROTNONE) && > mm_tlb_flush_pending(mm)) > > Either we remove the _PAGE_PROTNONE check or we need to use the atomic field > to count separately pending flushes due to migration and due to other > reasons. The first option is safer, but Mel objected to it, because of the > performance implications. The second one requires some thought on how to > build a single counter for multiple reasons and avoid a potential overflow. > > Thoughts? > I'm really new for the autoNUMA so not sure I understand your concern If your concern is that increasing places where add up pending count, autoNUMA performance might be hurt. Right? If so, above _PAGE_PROTNONE check will filter out most of cases? Maybe, Mel could answer. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>