Dan Williams <dan.j.williams@xxxxxxxxx> writes: > On Wed, Jan 3, 2018 at 9:01 PM, Eric W. Biederman <ebiederm@xxxxxxxxxxxx> wrote: >> "Williams, Dan J" <dan.j.williams@xxxxxxxxx> writes: >> >> >> >>> Note that these are "a human looked at static analysis reports and >>> could not rationalize that these are false positives". Specific domain >>> knowledge about these paths may find that some of them are indeed false >>> positives. >>> >>> The change to m_start in kernel/user_namespace.c is interesting because >>> that's an example where the nospec_load() approach by itself would need >>> to barrier speculation twice whereas if_nospec can do it once for the >>> whole block. >> >> >> This user_namespace.c change is very convoluted for what it is trying to >> do. > > Sorry this was my rebase on top of commit d5e7b3c5f51f "userns: Don't > read extents twice in m_start" the original change from Elena was > simpler. Part of the complexity arises from converting the common > kernel pattern of > > if (<invalid condition>) > return NULL; > do_stuff; > > ...to: > > if (<valid conidtion>) { > barrier(); > do_stuff; > } > >> It simplifies to a one liner that just adds osb() after pos >= >> extents. AKA: >> >> if (pos >= extents) >> return NULL; >> + osb(); >> >> Is the intent to hide which branch branch we take based on extents, >> after the pos check? > > The intent is to prevent speculative execution from triggering any > reads when 'pos' is invalid. If that is the intent I think the patch you posted is woefully inadequate. We have many many more seq files in proc than just /proc/<pid>/uid_map. >> I suspect this implies that using a user namespace and a crafted uid >> map you can hit this in stat, on the fast path. >> >> At which point I suspect we will be better off extending struct >> user_namespace by a few pointers, so there is no union and remove the >> need for blocking speculation entirely. > > How does this help prevent a speculative read with an invalid 'pos' > reading arbitrary kernel addresses? I though the concern was extents. I am now convinced that collectively we need a much better description of the problem than currently exists. Either the patch you presented missed a whole lot like 90%+ of the user/kernel interface or there is some mitigating factor that I am not seeing. Either way until reasonable people can read the code and agree on the potential exploitability of it, I will be nacking these patches. >>> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c >>> index 246d4d4ce5c7..aa0be8cef2d4 100644 >>> --- a/kernel/user_namespace.c >>> +++ b/kernel/user_namespace.c >>> @@ -648,15 +648,18 @@ static void *m_start(struct seq_file *seq, loff_t *ppos, >>> { >>> loff_t pos = *ppos; >>> unsigned extents = map->nr_extents; >>> - smp_rmb(); >>> >>> - if (pos >= extents) >>> - return NULL; >>> + /* paired with smp_wmb in map_write */ >>> + smp_rmb(); >>> >>> - if (extents <= UID_GID_MAP_MAX_BASE_EXTENTS) >>> - return &map->extent[pos]; >>> + if (pos < extents) { >>> + osb(); >>> + if (extents <= UID_GID_MAP_MAX_BASE_EXTENTS) >>> + return &map->extent[pos]; >>> + return &map->forward[pos]; >>> + } >>> >>> - return &map->forward[pos]; >>> + return NULL; >>> } >>> >>> static void *uid_m_start(struct seq_file *seq, loff_t *ppos) >> >> >> >>> diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c >>> index 8ca9915befc8..7f83abdea255 100644 >>> --- a/net/mpls/af_mpls.c >>> +++ b/net/mpls/af_mpls.c >>> @@ -81,6 +81,8 @@ static struct mpls_route *mpls_route_input_rcu(struct net *net, unsigned index) >>> if (index < net->mpls.platform_labels) { >>> struct mpls_route __rcu **platform_label = >>> rcu_dereference(net->mpls.platform_label); >>> + >>> + osb(); >>> rt = rcu_dereference(platform_label[index]); >>> } >>> return rt; >> >> Ouch! This adds a barrier in the middle of an rcu lookup, on the >> fast path for routing mpls packets. Which if memory serves will >> noticably slow down software processing of mpls packets. >> >> Why does osb() fall after the branch for validity? So that we allow >> speculation up until then? > > It falls there so that the cpu only issues reads with known good 'index' values. > >> I suspect it would be better to have those barriers in the tun/tap >> interfaces where userspace can inject packets and thus time them. Then >> the code could still speculate and go fast for remote packets. >> >> Or does the speculation stomping have to be immediately at the place >> where we use data from userspace to perform a table lookup? > > The speculation stomping barrier has to be between where we validate > the input and when we may speculate on invalid input. So a serializing instruction at the kernel/user boundary (like say loading cr3) is not enough? That would seem to break any chance of a controlled timing. > So, yes, moving > the user controllable input validation earlier and out of the fast > path would be preferred. Think of this patch purely as a static > analysis warning that something might need to be done to resolve the > report. That isn't what I was suggesting. I was just suggesting a serialization instruction earlier in the pipeline. Given what I have seen in other parts of the thread I think an and instruction that just limits the index to a sane range is generally applicable, and should be cheap enough to not care about. Further it seems to apply to the pattern the static checkers were catching, so I suspect that is the pattern we want to stress for limiting speculation. Assuming of course the compiler won't just optimize the and of the index out. Eric