Re: Next round: revised futex(2) man page for review

Darren Hart <dvhart@xxxxxxxxxxxxx> · Mon, 24 Aug 2015 14:47:16 -0700

On Sat, Aug 08, 2015 at 08:57:35AM +0200, Michael Kerrisk (man-pages) wrote:

...

> >> .\" FIXME ===== End of adapted Hart/Guniguntala text =====
> >>
> >>
> >>
> >> .\" FIXME We need some explanation in the following paragraph of *why*
> >> .\"       it is important to note that "the kernel will update the
> >> .\"       futex word's value prior
> >>        It is important to note to returning to user space" . Can someone
> >>        explain?   that  the  kernel  will  update the futex word's value
> >>        prior to returning to user space.  Unlike the other futex  opera‐
> >>        tions  described  above, the PI futex operations are designed for
> >>        the implementation of very specific IPC mechanisms.
> > 
> > If the kernel didn't perform the update prior to returning to userspace,
> > we could end up in an invalid state. Such as having an owner, but the
> > value being 0. Or having waiters, but not having FUTEX_WAITERS set.
> 
> So I've now reworked this passage to read:
> 
>        It  is  important  to  note that the kernel will update the futex
>        word's value prior to returning to user  space.   (This  prevents
>        the possibility of the futex word's value ending up in an invalid
>        state, such as having an owner but the value being 0,  or  having
>        waiters but not having the FUTEX_WAITERS bit set.)
> 
> Okay?

Yes.

> 
> >> .\"
> >> .\" FIXME XXX In discussing errors for FUTEX_CMP_REQUEUE_PI, Darren Hart
> >> .\"       made the observation that "EINVAL is returned if the non-pi 
> >> .\"       to pi or op pairing semantics are violated."
> >> .\"       Probably there needs to be a general statement about this
> >> .\"       requirement, probably located at about this point in the page.
> >> .\"       Darren (or someone else), care to take a shot at this?
> > 
> > We can probably borrow from either the futex.c comments or the
> > futex-requeue-pi.txt in Documentation. Also, it is important to note
> > that the PI requeue operations require two distinct uadders (although
> > that is implied by requiring "non-pi to pi" as a futex cannot be both.
> > 
> > Or... perhaps something like:
> > 
> > 	Due to the kernel imposed futex word value policy, PI futex
> > 	operations have additional usage requirements:
> > 	
> > 	FUTEX_WAIT_REQUEUE_PI must be paired with FUTEX_CMP_REQUEUE_PI
> > 	and be performed from a non-pi futex to a distinct pi futex.
> > 	Failing to do so will return EINVAL. 
> 
> For which operation does the EINVAL occur: FUTEX_WAIT_REQUEUE_PI or 
> FUTEX_CMP_REQUEUE_PI?

FUTEX_WAIT_REQUEUE_PI can return -EINVAL if called with invalid parameters, such
as uaddr==uaddr2, or (in the case of SHARED futexes), the associated keys match
(meaning it's the same futex word - shared memory, inode, etc.). This can't
happen if the stated policy of requeueing from non-pi to pi is followed as the
same word cannot be both non-pi and pi at the same time, requiring them to be
unique futex words.

FUTEX_CMP_REQUEUE_PI will fail similarly if uaddr and uaddr2 are the same futex
word. Also, if nr_wake != 1.

But, to the point I was making above, FUTEX_CMP_REQUEUE_PI must reque uaddr to
same uaddr2 specified in the previous FUTEX_WAIT_REQUEUE_PI call.
FUTEX_WAIT_REQUEUE_PI sets up the operation, FUTEX_CMP_REQUEUE_PI completes it,
and they must agree on uaddr and uaddr2.

...

> > And their PRIVATE counterparts of course (which is assumed as it is a
> > flag to the opcode).
> 
> Yes. But I don't think that needs to be called out explicitly here (?).

Agreed.

> 
> >> .\" FIXME XXX ===== Start of adapted Hart/Guniguntala text =====
> >> .\"       The following text is drawn from the Hart/Guniguntala paper
> >> .\"       (listed in SEE ALSO), but I have reworded some pieces
> >> .\"       significantly. Please check it.
> >>
> >>        The PI futex operations described below  differ  from  the  other
> >>        futex  operations  in  that  they impose policy on the use of the
> >>        value of the futex word:
> >>
> >>        *  If the lock is not acquired, the futex word's value  shall  be
> >>           0.
> >>
> >>        *  If  the  lock is acquired, the futex word's value shall be the
> >>           thread ID (TID; see gettid(2)) of the owning thread.
> >>
> >>        *  If the lock is owned and there are threads contending for  the
> >>           lock,  then  the  FUTEX_WAITERS  bit shall be set in the futex
> >>           word's value; in other words, this value is:
> >>
> >>               FUTEX_WAITERS | TID
> >>
> >>
> >>        Note that a PI futex word never just has the value FUTEX_WAITERS,
> >>        which is a permissible state for non-PI futexes.
> > 
> > The second clause is inappropriate. I don't know if that was yours or
> > mine, but non-PI futexes do not have a kernel defined value policy, so
> > ==FUTEX_WAITERS cannot be a "permissible state" as any value is
> > permissible for non-PI futexes, and none have a kernel defined state.
> > 
> > Perhaps include a Note under the third bullet as:
> > 
> > 	  Note: It is invalid for a PI futex word to have no owner and
> > 	        FUTEX_WAITERS set.
> 
> Done.
> 
> >>        With this policy in place, a user-space application can acquire a
> >>        not-acquired lock or release a lock that no other threads try  to
> > 
> > "that no other threads try to acquire" seems out of place. I think
> > "atomic instructions" is sufficient to express how contention is
> > handled.
> 
> Yup, changed.
> 
> >>        acquire using atomic instructions executed in user space (e.g., a
> >>        compare-and-swap operation such as cmpxchg on the  x86  architec‐
> >>        ture).   Acquiring  a  lock simply consists of using compare-and-
> >>        swap to atomically set the futex word's value to the caller's TID
> >>        if  its  previous  value  was 0.  Releasing a lock requires using
> >>        compare-and-swap to set the futex word's value to 0 if the previ‐
> >>        ous value was the expected TID.
> >>
> >>        If a futex is already acquired (i.e., has a nonzero value), wait‐
> >>        ers must employ the FUTEX_LOCK_PI operation to acquire the  lock.
> >>        If other threads are waiting for the lock, then the FUTEX_WAITERS
> >>        bit is set in the futex value; in this case, the lock owner  must
> >>        employ the FUTEX_UNLOCK_PI operation to release the lock.
> >>
> >>        In  the  cases  where  callers  are forced into the kernel (i.e.,
> >>        required to perform a futex() call), they then deal directly with
> >>        a so-called RT-mutex, a kernel locking mechanism which implements
> >>        the required priority-inheritance semantics.  After the  RT-mutex
> >>        is  acquired,  the futex value is updated accordingly, before the
> >>        calling thread returns to user space.
> > 
> > This last paragraph relies on kernel implementation rather than
> > behavior. If the RT-mutex is renamed or the mechanism is implemented
> > differently in futexes, this section will require updating. Is that
> > appropriate for a user-space man page?
> 
> In the end, I'm not sure. This is (so far) my best attempt at trying
> to convey an explanation of the behavior provided by the API.
> 
>           results).
> 
> But see my question just above. I'll tweak the first bullet point a
> little after I hear back from you.

Arg, lost context. Which question?

In my humble opinion, the paragraph about RT-mutex above should perhaps instead
read something like:

	    In  the  cases  where  callers  are forced into the kernel (i.e.,
	    required to perform a futex() call), they then deal directly with
	    Linux kernel internal mechanisms which implement the required
	    priority-inheritance semantics.  Once the internal locking structure
	    is  acquired,  the futex value is updated accordingly, before the
	    calling thread returns to user space.

I'm not terribly particular about this, but in my opinion, we should not refer
to internal-only kernel structures in the man page. Feel free to ignore, or to
defer to a differing opinion from Thomas or others.

Thanks for all your work on this!

-- 
Darren Hart
Intel Open Source Technology Center
--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html