Re: Deferring work in the page fault handler

"Peter Teoh" <htmldeveloper@xxxxxxxxx> · Fri, 23 May 2008 01:25:32 +0800

Thanks for the reply.   I would appreciate if someone can help to
clear just a few more doubts....

On Thu, May 22, 2008 at 7:31 PM, Vegard Nossum <vegard.nossum@xxxxxxxxx> wrote:
> On 5/22/08, Peter Teoh <htmldeveloper@xxxxxxxxx> wrote:
>> On Wed, May 21, 2008 at 6:04 PM, Vegard Nossum <vegard.nossum@xxxxxxxxx> wrote:
>>  > In the kmemcheck code I take a lot of page faults from any kernel
>>  > context (with interrupts enabled or disabled). This means that there
>>  > are a lot of things I can't do. Taking locks is dangerous while
>>  > handling a page fault occurring in interrupt context. In addition to
>>  > this, I must _not_ access any memory allocated by kmalloc(), as this
>>  > may generate a new (recursive) page fault.
>>  >
>>  > Currently, I am deferring work to be done later by using a timer that
>>  > triggers every HZ. This allows me to do what I want in the right
>>  > context, e.g. interrupts enabled and no locks taken.
>>  >
>>  > However, the timer triggers even when I don't need it, and once a
>>  > second is usually too slow when I actually do need it. So I am looking
>>  > for a way to schedule my deferred work as soon as interrupts are
>>  > disabled in the context that caused a page fault.
>>  >
>>  > I was reading Matthew Wilcox's paper on softirqs, tasklets, bottom
>>  > halves, task queues, work queues, and timers. But I am still a little
>>  > unsure of the best way to proceed. My requirement of not accessing
>>  > dynamically allocated memory seem unprecedented in the kernel. E.g.,
>>  > one of my earliest attempts included using a kernel thread and waking
>>  > it up from the page fault handler, but this did not work because
>>  > adding the kthread to a runqueue would access dynamically allocated
>>  > memory.
>>
>>
>> I have not read the patch yet, but this concept interest me very much:
>>
>>  a.   If u tracked every read before it is written - how do u know if
>>  it is written or not?   Ie, for each write, u have to set a bit to
>>  indicate that the byte of memory is written?   or is it done at the
>>  word/page level?
>
> Yep, we catch all accesses, both reads and writes. So on write, we set
> a bit, and on read, we check that the bit is set. (We actually have a
> few more states, but that's the basic idea, yeah.)
>
> The granularity of initialized/uninitialized is on the byte level. It
> would be too hard to do this for bit level granularity since we are
> not emulating the code (like valgrind does).
>
>>
>>  b.   it is only for kernel memory - right?  process memory may be
>>  swapped out, a huge performance tradeoff to make to do that.
>
> Yes, only for kernel memory allocated using kmalloc() or kmem_cache_alloc().
>
>>
>>  c.   how about DMA memory?   (hardware devices will write to
>>  it....which will not trigger the normal pagetable mechanism, so it is
>>  not possible capture writing to these memory?)
>
> Yep, this is entirely correct. We do have this exact problem; the
> solution is to annotate these memory areas by allocating them using
> the __GFP_NOTRACK flag. This item is discussed in the
> Documentation/kmemcheck.txt file of the patch.
>
>>
>>  d.   any problem with multi-CPU, PAE scenario?
>>
>
> We will disable all but one CPU at run-time if the kernel was compiled
> with CONFIG_SMP=y. This is because there is a race between CPUs if one
> of them is modifying the page tables and the page table change "leaks"
> into other TLBs.
>

sorry i don't understand this.

just to confirm this:   In linux kernel, there is only one kernel
pagetable, shared by all the different processes, and all the
different CPUs right?

so current kernel is definitely able to handle concurrent modification
of  the pagetable, right?  (either via locks or lockless algorithm).
I mean, for example, supposed the PT has multiple locks - for
different regions of memory (either different GFP or node level) and
if one CPU is modifying the PT, then another CPU will blocked if the
same region of memory is attempted to lock, but otherwise it can just
go ahead to read/write the other region of memory - owned by a
different set of locks...  I may not be right.....so in the context of
kmemcheck - how does the race arises?

> A proposed solution here is to make a copy of all the page tables for
> each CPU in the system. This is a rather heavy and difficult change to
> make, so I am not doing it for now :-) This item is also discussed in
> the Documentation/kmemcheck.txt file.
>
> PAE/PSE is fine; when a page is being tracked, we split it to 4k
> physical pages. This used to be a big problem but now I think we are
> finally there :-)
>
> The current tree can be found at:
> http://git.kernel.org/?p=linux/kernel/git/vegard/kmemcheck.git;a=shortlog;h=current
>
> I won't get angry if you decide to try it out ;-)

Thank you very much.....have a nice day :-).

-- 
Regards,
Peter Teoh

--
To unsubscribe from this list: send an email with
"unsubscribe kernelnewbies" to ecartis@xxxxxxxxxxxx
Please read the FAQ at http://kernelnewbies.org/FAQ