On 1/21/23, Mateusz Guzik <mjguzik@xxxxxxxxx> wrote: > On 1/20/23, Paul Moore <paul@xxxxxxxxxxxxxx> wrote: >> On Mon, Jan 16, 2023 at 4:21 PM Mateusz Guzik <mjguzik@xxxxxxxxx> wrote: >>> >>> access(2) remains commonly used, for example on exec: >>> access("/etc/ld.so.preload", R_OK) >>> >>> or when running gcc: strace -c gcc empty.c >>> % time seconds usecs/call calls errors syscall >>> ------ ----------- ----------- --------- --------- ---------------- >>> 0.00 0.000000 0 42 26 access >>> >>> It falls down to do_faccessat without the AT_EACCESS flag, which in turn >>> results in allocation of new creds in order to modify fsuid/fsgid and >>> caps. This is a very expensive process single-threaded and most notably >>> multi-threaded, with numerous structures getting refed and unrefed on >>> imminent new cred destruction. >>> >>> Turns out for typical consumers the resulting creds would be identical >>> and this can be checked upfront, avoiding the hard work. >>> >>> An access benchmark plugged into will-it-scale running on Cascade Lake >>> shows: >>> test proc before after >>> access1 1 1310582 2908735 (+121%) # distinct files >>> access1 24 4716491 63822173 (+1353%) # distinct files >>> access2 24 2378041 5370335 (+125%) # same file >> >> Out of curiosity, do you have any measurements of the impact this >> patch has on the AT_EACCESS case when the creds do need to be >> modified? >> > > I could not be arsed to bench that. I'm not saying there is literally 0 > impact, but it should not be high and the massive win in the case I > patched imho justifies it. > > Last week I got a private reply from Linus suggesting the new checks > only happen once at commit_cred() time, which would mean there would be > at most one extra branch for the case you are concerned with. However, > this quickly turn out to be rather hairy as there are games being > played for example in copy_creds() which assigns them *without* calling > commit_creds(). I was not comfortable pre-computing without sorting out > the mess first and he conceded the new branchfest is not necessarily a > big deal. > > That said, if you want some performance recovered for this case, here > is an easy one: > > static const struct cred *access_override_creds(void) > [..] > old_cred = override_creds(override_cred); > > /* override_cred() gets its own ref */ > put_cred(override_cred); > > As in the new creds get refed only to get unrefed immediately after. > Whacking the 2 atomics should make up for it no problem on x86-64. > > Also see below. > >>> The above benchmarks are not integrated into will-it-scale, but can be >>> found in a pull request: >>> https://github.com/antonblanchard/will-it-scale/pull/36/files >>> >>> Signed-off-by: Mateusz Guzik <mjguzik@xxxxxxxxx> >>> >>> v2: >>> - fix current->cred usage warn reported by the kernel test robot >>> Link: https://lore.kernel.org/all/202301150709.9EC6UKBT-lkp@xxxxxxxxx/ >>> --- >>> fs/open.c | 32 +++++++++++++++++++++++++++++++- >>> 1 file changed, 31 insertions(+), 1 deletion(-) >>> >>> diff --git a/fs/open.c b/fs/open.c >>> index 82c1a28b3308..3c068a38044c 100644 >>> --- a/fs/open.c >>> +++ b/fs/open.c >>> @@ -367,7 +367,37 @@ COMPAT_SYSCALL_DEFINE6(fallocate, int, fd, int, >>> mode, >>> compat_arg_u64_dual(offset >>> * access() needs to use the real uid/gid, not the effective uid/gid. >>> * We do this by temporarily clearing all FS-related capabilities and >>> * switching the fsuid/fsgid around to the real ones. >>> + * >>> + * Creating new credentials is expensive, so we try to skip doing it, >>> + * which we can if the result would match what we already got. >>> */ >>> +static bool access_need_override_creds(int flags) >>> +{ >>> + const struct cred *cred; >>> + >>> + if (flags & AT_EACCESS) >>> + return false; >>> + >>> + cred = current_cred(); >>> + if (!uid_eq(cred->fsuid, cred->uid) || >>> + !gid_eq(cred->fsgid, cred->gid)) >>> + return true; >>> + >>> + if (!issecure(SECURE_NO_SETUID_FIXUP)) { >>> + kuid_t root_uid = make_kuid(cred->user_ns, 0); >>> + if (!uid_eq(cred->uid, root_uid)) { >>> + if (!cap_isclear(cred->cap_effective)) >>> + return true; >>> + } else { >>> + if (!cap_isidentical(cred->cap_effective, >>> + cred->cap_permitted)) >>> + return true; >>> + } >>> + } >>> + >>> + return false; >>> +} >> >> I worry a little that with nothing connecting >> access_need_override_creds() to access_override_creds() there is a bug >> waiting to happen if/when only one of the functions is updated. >> > > These funcs are literally next to each other, I don't think that is easy > to miss. I concede a comment in access_override_creds to take a look at > access_need_override_creds would not hurt, but I don't know if a resend > to add it is justified. > >> Given the limited credential changes in access_override_creds(), I >> wonder if a better solution would be to see if we could create a >> light(er)weight prepare_creds()/override_creds() that would avoid some >> of the prepare_creds() hotspots (I'm assuming that is where most of >> the time is being spent). It's possible this could help improve the >> performance of other, similar operations that need to modify task >> creds for a brief, and synchronous, period of time. >> > > So the fundamental problem here is that several refs need to be grabbed > to make fully-fledged credentials. Then, as you are done, you have to > undo them. This clearly sucks single-threaded. As other threads copying > the same creds do the same atomics on the same vars, this sucks even > more multithreaded. > > As a cop out one may notice several of these are probably always the > same for creds derived from given base, so perhaps there can be an obj > which wraps them and then you only have to ref *that* obj to implicitly > hold on to the entire thing. > > As in this (and more): > get_uid(new->user); > get_user_ns(new->user_ns); > get_group_info(new->group_info); > > would get replaced with: > new->fancy_obj = getref_fancy_obj(new->fancy_obj); > /* populate pointers here */ > > ... conceptually at least. > > But even then, while less shafted both multi and single-threaded, there > is still a bottleneck. > > For a Real Solution(tm) for a general case I think has to start with an > observartion creds either persist for a long time *OR* keep getting > recreated. This would suggest holding on to them and looking them up > instead just allocating, but all this opens another can of worms and > I don't believe is worth the effort at this stage. But maybe someone > has a better idea. > > That said, for the case of access(), I had the following in mind but > once more considered it not justified at this stage. > > pseudocode-wise: > struct cred *prepare_shallow_creds(void) > new = kmem_cache_alloc(cred_jar, GFP_KERNEL); > old = task->cred; > memcpy(new, old, sizeof(struct cred)); > > here new creds have all the same pointers as old, but the target objs > are only kept alive by the old creds still refing them. So by API > contract you are required to keep them around. > > after you temporarily assign them you call revert_shallow_creds(): > if (tempcred->usage == 1) > /* nobody refed them, do the non_rcu check */ > ... > else > /* somebody grabbed them, legitimize creds by > * grabbing the missing refs > */ > get_uid(new->user); > get_user_ns(new->user_ns); > get_group_info(new->group_info); > /* and so on */ > > So this would shave work from the case you are concerned with probably > for all calls. > > I do think this is an ok idea overall, but I felt like overengineering for > the > problem at hand *at the time*. > > For some context, I'm looking at performance of certain VFS stuff and > there is some serious fish to fry in the fstat department. The patch I > posted is definitely worthwhile perf-wise and easy enough to reason > about that I did no expect much opposition to it. If anything I expected > opposition to the idea outlined above. > So just in case, if ultimately the consensus is that shallow copy or similar is the way to go, I can take a stab some time next week. >> Have you done any profiling inside of access_override_creds() to see >> where the most time is spent? Looking at the code I have some gut >> feelings on the hotspots, but it would be good to see some proper data >> before jumping to any conclusions. >> > > See above. It's the atomics all the way down. > > -- > Mateusz Guzik <mjguzik gmail.com> > -- Mateusz Guzik <mjguzik gmail.com>