On Fri, Feb 28, 2020 at 11:43 AM Will Deacon <will@xxxxxxxxxx> wrote: > On Wed, Feb 26, 2020 at 05:10:46AM +0100, Jann Horn wrote: > > On Thu, Nov 21, 2019 at 12:58 PM Will Deacon <will@xxxxxxxxxx> wrote: > > > + * If another thread also performs a refcount_inc() operation between the two > > > + * atomic operations, then the count will continue to edge closer to 0. If it > > > + * reaches a value of 1 before /any/ of the threads reset it to the saturated > > > + * value, then a concurrent refcount_dec_and_test() may erroneously free the > > > + * underlying object. Given the precise timing details involved with the > > > + * round-robin scheduling of each thread manipulating the refcount and the need > > > + * to hit the race multiple times in succession, there doesn't appear to be a > > > + * practical avenue of attack even if using refcount_add() operations with > > > + * larger increments. > > > > On top of that, the number of threads that can actually be running at > > a given time is capped. See include/linux/threads.h, where it is > > capped to pow(2, 22): > > > > /* > > * A maximum of 4 million PIDs should be enough for a while. > > * [NOTE: PID/TIDs are limited to 2^29 ~= 500+ million, see futex.h.] > > */ > > #define PID_MAX_LIMIT (CONFIG_BASE_SMALL ? PAGE_SIZE * 8 : \ > > (sizeof(long) > 4 ? 4 * 1024 * 1024 : PID_MAX_DEFAULT)) > > > > And in the futex UAPI header, we have this, baking a TID limit into > > the userspace API (note that this is pow(2,30), not pow(2,29) as the > > comment in threads.h claims - I'm not sure where that difference comes > > from): > > > > /* > > * The rest of the robust-futex field is for the TID: > > */ > > #define FUTEX_TID_MASK 0x3fffffff > > > > So AFAICS, with the current PID_MAX_LIMIT, if you assume that all > > participating refcount operations are non-batched (delta 1) and the > > attacker can't cause the threads to oops in the middle of the refcount > > operation (maybe that would be possible if you managed to find > > something like a NULL pointer dereference in perf software event code > > and had perf paranoia at <=1 , or something like that? - I'm not > > sure), then even in a theoretical scenario where an attacker spawns > > the maximum number of tasks possible and manages to get all of them to > > sequentially preempt while being in the middle of increment operations > > in several nested contexts (I'm not sure whether that can even happen > > - you're not going to take typical sleeping exceptions like page > > faults in the middle of a refcount op), the attacker will stay > > comfortably inside the saturated range. Even if the PID_MAX_LIMIT is > > at some point raised to the maximum permitted by the futex UAPI, this > > still holds as long as you assume no nesting. Hm, should I send a > > patch to add something like this to the explanatory comment? > > Sure, I'd be happy to improve the document by adding this -- please send > out a patch for review. It's probably also worth mentioning the batching > use-cases, although I struggle to reason about the window between the > {under,over}flow occuring and saturation. In the overflow case, it's fine, right? If you briefly crossed into the saturation range and then went back down, using some tasks with half-completed refcounting operations, then the refcount is still behaving as a correct non-saturating refcount. (And it can't cross over to the other end of the saturation range, because that's twice as much distance as you'd need to unpin a saturated refcount.) And in the underflow case, we can't deterministically protect anyway without some external mechanism to protect the object's lifetime while someone is already freeing it - so that's pretty much just a best-effort thing anyway. > > Of course, if someone uses refcount batching with sufficiently large > > values, those guarantees go out of the window - if we wanted to be > > perfectionist about this, we could make the batched operations do slow > > cmpxchg stuff while letting the much more performance-critical > > single-reference case continue to use the fast saturation scheme. > > OTOH, the networking folks would probably hate that, since they're > > using the batched ops for ->sk_wmem_alloc stuff, where they count > > bytes as references? So I guess maybe we should leave it as-is. > > Agreed. If we hamper the performance here, then people will either look > to disable the checking or they will switch to atomic_t, which puts us > back to square one. Perfection is the enemy of the good and all that. > > Having said that, I'd still be keen to learn about any practical attacks > on this in case there is something smart we can do to mitigate them > without cmpxchg(). For example, one silly approach might be to bound the > maximum increment and split larger ones up using a loop. I guess that'd do the job. I don't know whether it's worth the trouble in practice though.