syscall nesting

"Fernando Apesteguía" <fernando.apesteguia@xxxxxxxxx> · Sun, 4 Jun 2006 20:23:54 +0200

Thanks for your reply and sorry for my insistence :) 

This code is from oprofile-0.9.1\oprofile-0.9.1\module\x86\op_syscalls.c

void op_save_syscalls(void)

{

    old_sys_fork = sys_call_table[__NR_fork];

    old_sys_vfork = sys_call_table[__NR_vfork];

    old_sys_clone = sys_call_table[__NR_clone];

    old_sys_execve = sys_call_table[__NR_execve];

    old_old_mmap = sys_call_table[__NR_mmap];

#ifdef HAVE_MMAP2

    old_sys_mmap2 = sys_call_table[__NR_mmap2];

#endif

    old_sys_init_module = sys_call_table[__NR_init_module];

    old_sys_exit = sys_call_table[__NR_exit];

}

void op_intercept_syscalls(void)

{

    sys_call_table[__NR_fork] = my_sys_fork;

    sys_call_table[__NR_vfork] = my_sys_vfork;

    sys_call_table[__NR_clone] = my_sys_clone;

    sys_call_table[__NR_execve] = my_sys_execve;

    sys_call_table[__NR_mmap] = my_old_mmap;

#ifdef HAVE_MMAP2

    sys_call_table[__NR_mmap2] = my_sys_mmap2;

#endif

    sys_call_table[__NR_init_module] = my_sys_init_module;

    sys_call_table[__NR_exit] = my_sys_exit;

}

So I think that they "fake" (not sure if this is the proper word,
sorry) the system because they change the original syscalls by the
their ones.

Regardless of the approach and the "ethic" questions... could you tell me why this works?

asmlinkage static int my_sys_mmap2(ulong addr, ulong len,

    ulong prot, ulong flags, ulong fd, ulong pgoff)

{

    int ret;

    MOD_INC_USE_COUNT;

    ret = old_sys_mmap2(addr, len, prot, flags, fd, pgoff);

    if ((prot & PROT_EXEC) && ret >= 0)

        out_mmap(ret, len, prot, flags, fd, pgoff << PAGE_SHIFT);

    MOD_DEC_USE_COUNT;

    return ret;

}

Well, the module use counter is increased and decreased to keep in
track that there are syscalls in progress, but is this code that
follows safe?

void op_restore_syscalls(void)

{

    sys_call_table[__NR_fork] = old_sys_fork;

    sys_call_table[__NR_vfork] = old_sys_vfork;

    sys_call_table[__NR_clone] = old_sys_clone;

    sys_call_table[__NR_execve] = old_sys_execve;

    sys_call_table[__NR_mmap] = old_old_mmap;

#ifdef HAVE_MMAP2

    sys_call_table[__NR_mmap2] = old_sys_mmap2;

#endif

    sys_call_table[__NR_init_module] = old_sys_init_module;

    sys_call_table[__NR_exit] = old_sys_exit;

}

I suppose it is because in fact Oprofile is widely used... but I can't imagine why.

Best regards

(And many thanks again for your fast, really fast replies)

---------- Forwarded message ----------
From: Arjan van de Ven <arjan@xxxxxxxxxxxxx>
Date: Jun 4, 2006 8:06 PM

Subject: Re: syscall nesting
To: Fernando Apesteguía <fernando.apesteguia@xxxxxxxxx>
Cc: kernelnewbies@xxxxxxxxxxxx

On Sun, 2006-06-04 at 19:50 +0200, Fernando Apesteguía wrote:
> Maybe it is not as simple.
>
>  I asked for a technical question for a real problem

replacing/inserting system calls is a real problem. Trust me, I don't

underestimate the extend of that problem. These problems are the reason
the system call table is not exported to modules, you simply cannot do
it correctly. That's not just my opinion, but people far smarter than me

(say Linus) also agree there.

> I know that this can be used for malicious software (viruses,
> trojans...)

yup.

> but it is not the case. I know for example, that the use of _syscall*

> is not recommended.... well but what if I want to use it to learn more
> about that? And it is widely known that the use of "goto" is not a
> good programming practice and the linux kernel uses it (for

> performance reasons, I think).
>
> I only want to play with that profilers and try to make my own one in
> the same way (although maybe this is not the best approach).

doing it by overriding system calls is the wrong way for sure.

Oprofile doesn't need to do this for example; it really depends on how
you want to profile and what you want to profile. If you only want to
track system calls, the audit subsystem has the infrastructure for this

already, all you'd need to do is write the layer on top to interpret the
events. If you want to use performance counters.. why not build on top
of the oprofile infrastructure ? I'm not saying "be oprofile", but

oprofile is multiple layers, and I suspect you should be able to reuse
the lower layers of it as is (or with really small changes) and still
make a profiler that is both your own and does what you want...