Re: Deferring work in the page fault handler

"Vegard Nossum" <vegard.nossum@xxxxxxxxx> · Thu, 22 May 2008 19:59:56 +0200

On Thu, May 22, 2008 at 7:25 PM, Peter Teoh <htmldeveloper@xxxxxxxxx> wrote:
> Thanks for the reply.   I would appreciate if someone can help to
> clear just a few more doubts....
>

Hi, no problem :-)

> On Thu, May 22, 2008 at 7:31 PM, Vegard Nossum <vegard.nossum@xxxxxxxxx> wrote:
>> On 5/22/08, Peter Teoh <htmldeveloper@xxxxxxxxx> wrote:
>>>  d.   any problem with multi-CPU, PAE scenario?
>>>
>>
>> We will disable all but one CPU at run-time if the kernel was compiled
>> with CONFIG_SMP=y. This is because there is a race between CPUs if one
>> of them is modifying the page tables and the page table change "leaks"
>> into other TLBs.
>>
>
> sorry i don't understand this.
>
> just to confirm this:   In linux kernel, there is only one kernel
> pagetable, shared by all the different processes, and all the
> different CPUs right?

Correct.

>
> so current kernel is definitely able to handle concurrent modification
> of  the pagetable, right?  (either via locks or lockless algorithm).
> I mean, for example, supposed the PT has multiple locks - for
> different regions of memory (either different GFP or node level) and
> if one CPU is modifying the PT, then another CPU will blocked if the
> same region of memory is attempted to lock, but otherwise it can just
> go ahead to read/write the other region of memory - owned by a
> different set of locks...  I may not be right.....so in the context of
> kmemcheck - how does the race arises?
>

Okay, so the main problem is -- we can lock before changing the page
table itself, but we cannot lock the memory location before it is
modified -- because it can be modified from anywhere on any cpu!

So imagine this scenario: We have two tasks A and B on different CPUs.

Task A accesses some memory location which is being tracked by
kmemcheck. This access triggers a page fault and in the page fault
handler, we lock the page (where the lock is doesn't really matter).
Then we mark the PTE present.

Now task B comes along and accesses the very same memory location.
Since task B didn't have this page in the cache, it looks it up from
RAM. Ah -- the PTE is present; the CPU can happily access this memory
location, and no page fault is generated, so the lock is never even
attempted to be taken.

(Now task A restarts the faulting instruction, marks the PTE
non-present and unlocks the page lock.)

Do you see a way around this? The race window is admittedly incredible
small. But it's a race :-)

This is why we need to duplicate the page tables. Then one CPU can
change the PTE to present without affecting any of the other CPUs in
the system. If you can think of another way to do this... :-)

(Note: It may not be necessary to duplicate the _whole_ page-table
structure. I didn't pursue this thought yet.)

Vegard

-- 
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
	-- E. W. Dijkstra, EWD1036

--
To unsubscribe from this list: send an email with
"unsubscribe kernelnewbies" to ecartis@xxxxxxxxxxxx
Please read the FAQ at http://kernelnewbies.org/FAQ