Re: [PATCH] mm: Add new vma flag VM_LOCAL_CPU

Boaz Harrosh <boazh@xxxxxxxxxx> · Tue, 15 May 2018 16:24:05 +0300

On 15/05/18 14:54, Boaz Harrosh wrote:
> On 15/05/18 03:44, Matthew Wilcox wrote:
>> On Mon, May 14, 2018 at 02:49:01PM -0700, Andrew Morton wrote:
>>> On Mon, 14 May 2018 20:28:01 +0300 Boaz Harrosh <boazh@xxxxxxxxxx> wrote:
>>>> In this project we utilize a per-core server thread so everything
>>>> is kept local. If we use the regular zap_ptes() API All CPU's
>>>> are scheduled for the unmap, though in our case we know that we
>>>> have only used a single core. The regular zap_ptes adds a very big
>>>> latency on every operation and mostly kills the concurrency of the
>>>> over all system. Because it imposes a serialization between all cores
>>>
>>> I'd have thought that in this situation, only the local CPU's bit is
>>> set in the vma's mm_cpumask() and the remote invalidations are not
>>> performed.  Is that a misunderstanding, or is all that stuff not working
>>> correctly?
>>
>> I think you misunderstand Boaz's architecture.  He has one thread per CPU,
>> so every bit will be set in the mm's (not vma's) mm_cpumask.
>>
> 
> Hi Andrew, Matthew
> 
> Yes I have been trying to investigate and trace this for days.
> Please see the code below:
> 
>> #define flush_tlb_range(vma, start, end)	\
>> 		flush_tlb_mm_range(vma->vm_mm, start, end, vma->vm_flags)
> 
> The mm_struct @mm below comes from here vma->vm_mm
> 
>> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
>> index e055d1a..1d398a0 100644
>> --- a/arch/x86/mm/tlb.c
>> +++ b/arch/x86/mm/tlb.c
>> @@ -611,39 +611,40 @@ static unsigned long tlb_single_page_flush_ceiling __read_mostly = 33;
>>  void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
>>  				unsigned long end, unsigned long vmflag)
>>  {
>>  	int cpu;
>>  
>>  	struct flush_tlb_info info __aligned(SMP_CACHE_BYTES) = {
>>  		.mm = mm,
>>  	};
>>  
>>  	cpu = get_cpu();
>>  
>>  	/* This is also a barrier that synchronizes with switch_mm(). */
>>  	info.new_tlb_gen = inc_mm_tlb_gen(mm);
>>  
>>  	/* Should we flush just the requested range? */
>>  	if ((end != TLB_FLUSH_ALL) &&
>>  	    !(vmflag & VM_HUGETLB) &&
>>  	    ((end - start) >> PAGE_SHIFT) <= tlb_single_page_flush_ceiling) {
>>  		info.start = start;
>>  		info.end = end;
>>  	} else {
>>  		info.start = 0UL;
>>  		info.end = TLB_FLUSH_ALL;
>>  	}
>>  
>>  	if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
>>  		VM_WARN_ON(irqs_disabled());
>>  		local_irq_disable();
>>  		flush_tlb_func_local(&info, TLB_LOCAL_MM_SHOOTDOWN);
>>  		local_irq_enable();
>>  	}
>>  
>> -	if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
>> +	if (!(vmflag & VM_LOCAL_CPU) &&
>> +	    cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
>>  		flush_tlb_others(mm_cpumask(mm), &info);
>>  
> 
> I have been tracing the mm_cpumask(vma->vm_mm) at my driver at
> different points. At vma creation (file_operations->mmap()), 
> and before the call to insert_pfn (which calls here).
> 
> At the beginning I was wishful thinking that the mm_cpumask(vma->vm_mm)
> should have a single bit set just as the affinity of the thread on
> creation of that thread. But then I saw that at %80 of the times some
> other random bits are also set.
> 
> Yes Random. Always the thread affinity (single) bit was set but
> then zero one or two more bits were set as well. Never seen more then
> two though, which baffles me a lot.
> 
> If it was like Matthew said .i.e the cpumask of the all process
> then I would expect all the bits to be set. Because I have a thread
> on each core. And also I would even expect that all vma->vm_mm
> or maybe mm_cpumask(vma->vm_mm) to point to the same global object.
> But it was not so. it was pointing to some thread unique object but
> still those phantom bits were set all over. (And I am almost sure
> same vma had those bits change over time)
> 
> So I would love some mm guy to explain where are those bits collected?
> But I do not think this is a Kernel bug because as Matthew showed.
> that vma above can and is allowed to leak memory addresses to other
> threads / cores in the same process. So it appears that the Kernel
> has some correct logic behind its madness.
> 
Hi Mark

So you see %20 of the times the mm_cpumask(vma->vm_mm) is a single
bit set. And a natural call to flush_tlb_range will only invalidate
the local cpu. Are you familiar with this logic?

> Which brings me to another question. How can I find from
> within a thread Say at the file_operations->mmap() call that the thread
> is indeed core-pinned. What mm_cpumask should I inspect?
> 

Mark, Peter do you know how I can check the above?

Thanks
Boaz

>>  	put_cpu();
>>  }
> 
> Thanks
> Boaz
>