Re: Compat 32-bit syscall entry from 64-bit task!? [was: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF]

Roland McGrath <mcgrathr@xxxxxxxxxx> · Tue, 17 Jan 2012 17:07:04 -0800

On Tue, Jan 17, 2012 at 4:56 PM, Indan Zupancic <indan@xxxxxx> wrote:
> Wait: If a tasks is set to 64 bit mode, but calls into the kernel via
> int 0x80 it's changed to 32 bit mode for that system call and back to
> 64 bit mode when the system call is finished!?

Well, saying it like that suggests that there is more of a "mode change"
than really exists.  It's simply that any task can use int $0x80 and
this always means using the 32-bit syscall table with TS_COMPAT set.

> Our ptrace jailer is checking cs to figure out if a task is a compat task
> or not, if the kernel can change that behind our back it means our jailer
> isn't secure for x86_64 with compat enabled. Or is cs changed before the
> ptrace stuff and ptrace sees the "right" cs value? If not, we have to add
> an expensive PTRACE_PEEKTEXT to check if it's an int 0x80 or not. Or is
> there another way?

I don't think there's another way.  hpa and I once discussed adding a field
to the extractable "register state" that would say which method the syscall
in progress had taken to enter the kernel.  That would tell you which
flavor of syscall instruction was used (or none, i.e. a trap/interrupt).
But nobody ever had a real need for it, and we didn't pursue it further.
(We originally talked about it in the context of distinguishing whether a
32-bit task had used sysenter or syscall or int $0x80, I think.)

> I think this behaviour is so unexpected that it can only cause security
> problems in the long run. Is anyone counting on this? Where is this
> behaviour documented?

It's documented the same place the entire Linux machine-level ABI is
documented, which is nowhere.  Someone somewhere may once have been
counting on it.  (The story I heard was about an implementation of valgrind
for 32-bit code that ran in 64-bit tasks, but I don't know for sure that it
was really done.)  The general rule is that if it ever worked before in a
coherent way, we don't break binary compatibility.

In the implementation, it would require a special check to make it barf.
It's really just something that falls out of how the hardware and the
kernel implementation works.  I suppose you could add such a check under a
new kconfig option that's marked as being potentially incompatible with
some old applications.  Good luck with that.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html