Re: [PATCH] syscalls: Document OCI seccomp filter interactions & workaround

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Nov 24, 2020 at 06:30:28PM +0100, Jann Horn wrote:
> On Tue, Nov 24, 2020 at 6:15 PM Greg KH <gregkh@xxxxxxxxxxxxxxxxxxx> wrote:
> > On Tue, Nov 24, 2020 at 06:06:38PM +0100, Jann Horn wrote:
> > > +seccomp maintainers/reviewers
> > > [thread context is at
> > > https://lore.kernel.org/linux-api/87lfer2c0b.fsf@xxxxxxxxxxxxxxxxxxxxxxxxx/
> > > ]
> > >
> > > On Tue, Nov 24, 2020 at 5:49 PM Christoph Hellwig <hch@xxxxxxxxxxxxx> wrote:
> > > > On Tue, Nov 24, 2020 at 03:08:05PM +0100, Mark Wielaard wrote:
> > > > > For valgrind the issue is statx which we try to use before falling back
> > > > > to stat64, fstatat or stat (depending on architecture, not all define
> > > > > all of these). The problem with these fallbacks is that under some
> > > > > containers (libseccomp versions) they might return EPERM instead of
> > > > > ENOSYS. This causes really obscure errors that are really hard to
> > > > > diagnose.
> > > >
> > > > So find a way to detect these completely broken container run times
> > > > and refuse to run under them at all.  After all they've decided to
> > > > deliberately break the syscall ABI.  (and yes, we gave the the rope
> > > > to do that with seccomp :().
> > >
> > > FWIW, if the consensus is that seccomp filters that return -EPERM by
> > > default are categorically wrong, I think it should be fairly easy to
> > > add a check to the seccomp core that detects whether the installed
> > > filter returns EPERM for some fixed unused syscall number and, if so,
> > > prints a warning to dmesg or something along those lines...
> >
> > Why?  seccomp is saying "this syscall is not permitted", so -EPERM seems
> > like the correct error to provide here.  It's not -ENOSYS as the syscall
> > is present.
> >
> > As everyone knows, there are other ways to have -EPERM be returned from
> > a syscall if you don't have the correct permissions to do something.
> > Why is seccomp being singled out here?  It's doing the correct thing.
> 
> AFAIU from what the others have said, it's being singled out because
> it means that for two semantically equivalent operations (e.g.
> openat() vs open()), one can fail while the other works because the
> filter doesn't know about one of the syscalls. Normally semantically
> equivalent syscalls are supposed to be subject to the same checks, and
> if one of them fails, trying the other one won't help.

They aren't being subject to the same checks, if the seccomp permissions
are different for both of them, they will get different answers.

Trying to use this to determine if the syscall is present or not is not
ok, and as Christian just said, needs to be fixed in userspace.  We
can't change the kernel ABI now, odds are someone else relies on the api
we have had in place and it can not be changed :)

thanks,

greg k-h



[Index of Archives]     [Kernel Newbies]     [Security]     [Netfilter]     [Bugtraq]     [Linux FS]     [Yosemite Forum]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Video 4 Linux]     [Device Mapper]     [Linux Resources]

  Powered by Linux