Re: Deferring work in the page fault handler

"Peter Teoh" <htmldeveloper@xxxxxxxxx> · Fri, 23 May 2008 12:27:56 +0800

wow....i understand better....but still not enough to answer some questions.....

On Fri, May 23, 2008 at 1:59 AM, Vegard Nossum <vegard.nossum@xxxxxxxxx> wrote:
> On Thu, May 22, 2008 at 7:25 PM, Peter Teoh <htmldeveloper@xxxxxxxxx> wrote:
>> Thanks for the reply.   I would appreciate if someone can help to
>> clear just a few more doubts....
>>
>
> Hi, no problem :-)
>
>> On Thu, May 22, 2008 at 7:31 PM, Vegard Nossum <vegard.nossum@xxxxxxxxx> wrote:
>>> On 5/22/08, Peter Teoh <htmldeveloper@xxxxxxxxx> wrote:
>>>>  d.   any problem with multi-CPU, PAE scenario?
>>>>
>>>
>>> We will disable all but one CPU at run-time if the kernel was compiled
>>> with CONFIG_SMP=y. This is because there is a race between CPUs if one
>>> of them is modifying the page tables and the page table change "leaks"
>>> into other TLBs.
>>>
>>
>> sorry i don't understand this.
>>
>> just to confirm this:   In linux kernel, there is only one kernel
>> pagetable, shared by all the different processes, and all the
>> different CPUs right?
>
> Correct.
>
>>
>> so current kernel is definitely able to handle concurrent modification
>> of  the pagetable, right?  (either via locks or lockless algorithm).
>> I mean, for example, supposed the PT has multiple locks - for
>> different regions of memory (either different GFP or node level) and
>> if one CPU is modifying the PT, then another CPU will blocked if the
>> same region of memory is attempted to lock, but otherwise it can just
>> go ahead to read/write the other region of memory - owned by a
>> different set of locks...  I may not be right.....so in the context of
>> kmemcheck - how does the race arises?
>>
>
> Okay, so the main problem is -- we can lock before changing the page
> table itself, but we cannot lock the memory location before it is
> modified -- because it can be modified from anywhere on any cpu!
>
> So imagine this scenario: We have two tasks A and B on different CPUs.
>
> Task A accesses some memory location which is being tracked by
> kmemcheck. This access triggers a page fault and in the page fault
> handler, we lock the page (where the lock is doesn't really matter).
> Then we mark the PTE present.
>
> Now task B comes along and accesses the very same memory location.
> Since task B didn't have this page in the cache, it looks it up from
> RAM. Ah -- the PTE is present; the CPU can happily access this memory
> location, and no page fault is generated, so the lock is never even
> attempted to be taken.
>

The example fully explain what the key problem is.....yes...to
rephrase it, problem is to track ALL READ on the memory, and then
check for each of  these READ, whether any writes has happened before
or not....correct???    And u need to set the memory as NP, so as to
enable the hardware faulting mechanism to occur.   but i suspect that
may not be necessary, and is an overkill.

ok...now  the kernel part is a  common pool among the CPUs, so
everyone can modify / see each other.   and bottomline is u want to
identify ALL reads, on unwritten memory....and it should generate
fault, correct?   So this is the key.....identify all read on
unwritten.....and so long as it is unwritten, it should remain as NP
(not present)....ie, attempted read should generate fault,
irregardless whether the memory is allocated or not (currently NP flag
is used as allocated vs non-allocated state, right?)

Similarly, other scenario are possible - so long as there is one
write, any CPU should not  fault....all read should succeed.     So in
all other scenario, we don't need the NP flag to be turned on,
reading will always go ahead, because in history there is already at
least one write, whichever CPU done it we don't really need to know,
am I right?

Does this sound logical?

Now rephrasing your problem statement:

> Task A accesses some memory location which is being tracked by
> kmemcheck. This access triggers a page fault and in the page fault
> handler, we lock the page (where the lock is doesn't really matter).
> Then we mark the PTE present.

let me make a guess, the reason why u marked present - is so as to
maintain the execution as per normal flow, right?   (i,e, u don't want
to trigger the exception path)......so even if the memory is unwritten
before, and task A is reading it the first time....u just want  it to
execute as per normal.....ie, no exception, and this therefore
requires that the memory be marked present, and then task A can
continue, task B can continue etc....

but personally i think otherwise.....ie, so long as it is unwritten
before....just let it continue to remain as not-present, and continue
with the exception path of execution....ie, this design will not
identify all errors at one go....but one by one....until there is no
more errors....

so this is the tradeoff in my suggestion here....not sure if u agree
with me or not...:-)....

> (Now task A restarts the faulting instruction, marks the PTE
> non-present and unlocks the page lock.)
>
> Do you see a way around this? The race window is admittedly incredible
> small. But it's a race :-)
>
> This is why we need to duplicate the page tables. Then one CPU can
> change the PTE to present without affecting any of the other CPUs in
> the system. If you can think of another way to do this... :-)
>
> (Note: It may not be necessary to duplicate the _whole_ page-table
> structure. I didn't pursue this thought yet.)
>

-- 
Regards,
Peter Teoh

--
To unsubscribe from this list: send an email with
"unsubscribe kernelnewbies" to ecartis@xxxxxxxxxxxx
Please read the FAQ at http://kernelnewbies.org/FAQ