Re: Deferring work in the page fault handler

"Vegard Nossum" <vegard.nossum@xxxxxxxxx> · Thu, 22 May 2008 13:31:30 +0200

On 5/22/08, Peter Teoh <htmldeveloper@xxxxxxxxx> wrote:
> On Wed, May 21, 2008 at 6:04 PM, Vegard Nossum <vegard.nossum@xxxxxxxxx> wrote:
>  > In the kmemcheck code I take a lot of page faults from any kernel
>  > context (with interrupts enabled or disabled). This means that there
>  > are a lot of things I can't do. Taking locks is dangerous while
>  > handling a page fault occurring in interrupt context. In addition to
>  > this, I must _not_ access any memory allocated by kmalloc(), as this
>  > may generate a new (recursive) page fault.
>  >
>  > Currently, I am deferring work to be done later by using a timer that
>  > triggers every HZ. This allows me to do what I want in the right
>  > context, e.g. interrupts enabled and no locks taken.
>  >
>  > However, the timer triggers even when I don't need it, and once a
>  > second is usually too slow when I actually do need it. So I am looking
>  > for a way to schedule my deferred work as soon as interrupts are
>  > disabled in the context that caused a page fault.
>  >
>  > I was reading Matthew Wilcox's paper on softirqs, tasklets, bottom
>  > halves, task queues, work queues, and timers. But I am still a little
>  > unsure of the best way to proceed. My requirement of not accessing
>  > dynamically allocated memory seem unprecedented in the kernel. E.g.,
>  > one of my earliest attempts included using a kernel thread and waking
>  > it up from the page fault handler, but this did not work because
>  > adding the kthread to a runqueue would access dynamically allocated
>  > memory.
>
>
> I have not read the patch yet, but this concept interest me very much:
>
>  a.   If u tracked every read before it is written - how do u know if
>  it is written or not?   Ie, for each write, u have to set a bit to
>  indicate that the byte of memory is written?   or is it done at the
>  word/page level?

Yep, we catch all accesses, both reads and writes. So on write, we set
a bit, and on read, we check that the bit is set. (We actually have a
few more states, but that's the basic idea, yeah.)

The granularity of initialized/uninitialized is on the byte level. It
would be too hard to do this for bit level granularity since we are
not emulating the code (like valgrind does).

>
>  b.   it is only for kernel memory - right?  process memory may be
>  swapped out, a huge performance tradeoff to make to do that.

Yes, only for kernel memory allocated using kmalloc() or kmem_cache_alloc().

>
>  c.   how about DMA memory?   (hardware devices will write to
>  it....which will not trigger the normal pagetable mechanism, so it is
>  not possible capture writing to these memory?)

Yep, this is entirely correct. We do have this exact problem; the
solution is to annotate these memory areas by allocating them using
the __GFP_NOTRACK flag. This item is discussed in the
Documentation/kmemcheck.txt file of the patch.

>
>  d.   any problem with multi-CPU, PAE scenario?
>

We will disable all but one CPU at run-time if the kernel was compiled
with CONFIG_SMP=y. This is because there is a race between CPUs if one
of them is modifying the page tables and the page table change "leaks"
into other TLBs.

A proposed solution here is to make a copy of all the page tables for
each CPU in the system. This is a rather heavy and difficult change to
make, so I am not doing it for now :-) This item is also discussed in
the Documentation/kmemcheck.txt file.

PAE/PSE is fine; when a page is being tracked, we split it to 4k
physical pages. This used to be a big problem but now I think we are
finally there :-)

The current tree can be found at:
http://git.kernel.org/?p=linux/kernel/git/vegard/kmemcheck.git;a=shortlog;h=current

I won't get angry if you decide to try it out ;-)

Vegard

-- 
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
	-- E. W. Dijkstra, EWD1036

--
To unsubscribe from this list: send an email with
"unsubscribe kernelnewbies" to ecartis@xxxxxxxxxxxx
Please read the FAQ at http://kernelnewbies.org/FAQ