On Fri, Jan 20, 2023 at 7:50 PM Mateusz Guzik <mjguzik@xxxxxxxxx> wrote: > On 1/20/23, Paul Moore <paul@xxxxxxxxxxxxxx> wrote: > > On Mon, Jan 16, 2023 at 4:21 PM Mateusz Guzik <mjguzik@xxxxxxxxx> wrote: > >> > >> access(2) remains commonly used, for example on exec: > >> access("/etc/ld.so.preload", R_OK) > >> > >> or when running gcc: strace -c gcc empty.c > >> % time seconds usecs/call calls errors syscall > >> ------ ----------- ----------- --------- --------- ---------------- > >> 0.00 0.000000 0 42 26 access > >> > >> It falls down to do_faccessat without the AT_EACCESS flag, which in turn > >> results in allocation of new creds in order to modify fsuid/fsgid and > >> caps. This is a very expensive process single-threaded and most notably > >> multi-threaded, with numerous structures getting refed and unrefed on > >> imminent new cred destruction. > >> > >> Turns out for typical consumers the resulting creds would be identical > >> and this can be checked upfront, avoiding the hard work. > >> > >> An access benchmark plugged into will-it-scale running on Cascade Lake > >> shows: > >> test proc before after > >> access1 1 1310582 2908735 (+121%) # distinct files > >> access1 24 4716491 63822173 (+1353%) # distinct files > >> access2 24 2378041 5370335 (+125%) # same file > > > > Out of curiosity, do you have any measurements of the impact this > > patch has on the AT_EACCESS case when the creds do need to be > > modified? > > I could not be arsed to bench that. I'm not saying there is literally 0 > impact, but it should not be high and the massive win in the case I > patched imho justifies it. That's one way to respond to an honest question asking if you've done any tests on the other side of the change. I agree the impact should be less than the advantage you've shown, but sometimes it's nice to see these things. > Last week I got a private reply from Linus suggesting the new checks > only happen once at commit_cred() time, which would mean there would be > at most one extra branch for the case you are concerned with. However, > this quickly turn out to be rather hairy as there are games being > played for example in copy_creds() which assigns them *without* calling > commit_creds(). I was not comfortable pre-computing without sorting out > the mess first and he conceded the new branchfest is not necessarily a > big deal. > > That said, if you want some performance recovered for this case, here > is an easy one: > > static const struct cred *access_override_creds(void) > [..] > old_cred = override_creds(override_cred); > > /* override_cred() gets its own ref */ > put_cred(override_cred); > > As in the new creds get refed only to get unrefed immediately after. > Whacking the 2 atomics should make up for it no problem on x86-64. > > Also see below. > > >> The above benchmarks are not integrated into will-it-scale, but can be > >> found in a pull request: > >> https://github.com/antonblanchard/will-it-scale/pull/36/files > >> > >> Signed-off-by: Mateusz Guzik <mjguzik@xxxxxxxxx> > >> > >> v2: > >> - fix current->cred usage warn reported by the kernel test robot > >> Link: https://lore.kernel.org/all/202301150709.9EC6UKBT-lkp@xxxxxxxxx/ > >> --- > >> fs/open.c | 32 +++++++++++++++++++++++++++++++- > >> 1 file changed, 31 insertions(+), 1 deletion(-) > >> > >> diff --git a/fs/open.c b/fs/open.c > >> index 82c1a28b3308..3c068a38044c 100644 > >> --- a/fs/open.c > >> +++ b/fs/open.c > >> @@ -367,7 +367,37 @@ COMPAT_SYSCALL_DEFINE6(fallocate, int, fd, int, mode, > >> compat_arg_u64_dual(offset > >> * access() needs to use the real uid/gid, not the effective uid/gid. > >> * We do this by temporarily clearing all FS-related capabilities and > >> * switching the fsuid/fsgid around to the real ones. > >> + * > >> + * Creating new credentials is expensive, so we try to skip doing it, > >> + * which we can if the result would match what we already got. > >> */ > >> +static bool access_need_override_creds(int flags) > >> +{ > >> + const struct cred *cred; > >> + > >> + if (flags & AT_EACCESS) > >> + return false; > >> + > >> + cred = current_cred(); > >> + if (!uid_eq(cred->fsuid, cred->uid) || > >> + !gid_eq(cred->fsgid, cred->gid)) > >> + return true; > >> + > >> + if (!issecure(SECURE_NO_SETUID_FIXUP)) { > >> + kuid_t root_uid = make_kuid(cred->user_ns, 0); > >> + if (!uid_eq(cred->uid, root_uid)) { > >> + if (!cap_isclear(cred->cap_effective)) > >> + return true; > >> + } else { > >> + if (!cap_isidentical(cred->cap_effective, > >> + cred->cap_permitted)) > >> + return true; > >> + } > >> + } > >> + > >> + return false; > >> +} > > > > I worry a little that with nothing connecting > > access_need_override_creds() to access_override_creds() there is a bug > > waiting to happen if/when only one of the functions is updated. > > These funcs are literally next to each other, I don't think that is easy > to miss. I concede a comment in access_override_creds to take a look at > access_need_override_creds would not hurt, but I don't know if a resend > to add it is justified. Perhaps it's because I have to deal with a fair amount of code getting changed in one place and not another, but I would think that a comment would be the least one could do here and would justify a respin. > > Given the limited credential changes in access_override_creds(), I > > wonder if a better solution would be to see if we could create a > > light(er)weight prepare_creds()/override_creds() that would avoid some > > of the prepare_creds() hotspots (I'm assuming that is where most of > > the time is being spent). It's possible this could help improve the > > performance of other, similar operations that need to modify task > > creds for a brief, and synchronous, period of time. ... > For a Real Solution(tm) for a general case I think has to start with an > observartion creds either persist for a long time *OR* keep getting > recreated. This would suggest holding on to them and looking them up > instead just allocating, but all this opens another can of worms and > I don't believe is worth the effort at this stage. But maybe someone > has a better idea. > > That said, for the case of access(), I had the following in mind but > once more considered it not justified at this stage. > > pseudocode-wise: > struct cred *prepare_shallow_creds(void) > new = kmem_cache_alloc(cred_jar, GFP_KERNEL); > old = task->cred; > memcpy(new, old, sizeof(struct cred)); > > here new creds have all the same pointers as old, but the target objs > are only kept alive by the old creds still refing them. So by API > contract you are required to keep them around. > > after you temporarily assign them you call revert_shallow_creds(): > if (tempcred->usage == 1) > /* nobody refed them, do the non_rcu check */ > ... > else > /* somebody grabbed them, legitimize creds by > * grabbing the missing refs > */ > get_uid(new->user); > get_user_ns(new->user_ns); > get_group_info(new->group_info); > /* and so on */ > > So this would shave work from the case you are concerned with probably > for all calls. > > I do think this is an ok idea overall, but I felt like overengineering for the > problem at hand *at the time*. In my opinion a generalized shallow copy approach has more value than a one-off solution that has the potential to fall out of sync and cause a problem in the future (I recognize that you disagree on the likelihood of this happening). > For some context, I'm looking at performance of certain VFS stuff and > there is some serious fish to fry in the fstat department. I assumed it was part of some larger perf work, but I'm not sure the context is that important for this particular discussion. > The patch I > posted is definitely worthwhile perf-wise and easy enough to reason > about that I did no expect much opposition to it. If anything I expected > opposition to the idea outlined above. Ultimately it's a call for the VFS folks as they are responsible for the access() code. -- paul-moore.com