Re: Next round: revised futex(2) man page for review

"Michael Kerrisk (man-pages)" <mtk.manpages@xxxxxxxxx> · Sat, 08 Aug 2015 08:57:35 +0200

Hi Darren,

Some of my comments below will refer to the reply I just sent
to tglx (and the list) a few minutes ago.

On 08/06/2015 12:21 AM, Darren Hart wrote:
> On Mon, Jul 27, 2015 at 02:07:15PM +0200, Michael Kerrisk (man-pages) wrote:
>> Hello all,
>>
> 
> Michael, thank you for your diligence in following up and collecting
> reviews. I've attempted to respond to what I was specifically called out
> in or where I had something specific to add in addition to other
> replies.

Thanks!

> After this, will you send another version (numbered for reference
> maybe?) with any remaining FIXMEs that haven't yet been addressed
> according to your accounting?

Yes, I'll be sending out another draft (probably after a short delay,
while I see what further responses come back on the mails I just sent.)
In any case, the latest version of the page can be found at
http://git.kernel.org/cgit/docs/man-pages/man-pages.git/log/?h=draft_futex

>>    Priority-inheritance futexes
>>        Linux supports priority-inheritance (PI) futexes in order to han‐
>>        dle priority-inversion problems that can be encountered with nor‐
>>        mal  futex  locks.  Priority inversion is the problem that occurs
>>        when a high-priority task is blocked waiting to  acquire  a  lock
>>        held  by a low-priority task, while tasks at an intermediate pri‐
>>        ority continuously preempt the low-priority task  from  the  CPU.
>>        Consequently,  the  low-priority  task  makes  no progress toward
>>        releasing the lock, and the high-priority task remains blocked.
>>
>>        Priority inheritance is a mechanism for dealing with  the  prior‐
>>        ity-inversion problem.  With this mechanism, when a high-priority
>>        task becomes blocked by a lock held by a low-priority  task,  the
>>        latter's priority is temporarily raised to that of the former, so
>>        that it is not preempted by any intermediate level tasks, and can
>>        thus  make  progress toward releasing the lock.  To be effective,
>>        priority inheritance must be transitive, meaning that if a  high-
>>        priority task blocks on a lock held by a lower-priority task that
>>        is itself blocked by lock held by  another  intermediate-priority
>>        task  (and  so  on, for chains of arbitrary length), then both of
>>        those task (or more generally, all of the tasks in a lock  chain)
>>        have  their priorities raised to be the same as the high-priority
>>        task.
>>
>> .\" FIXME XXX The following is my attempt at a definition of PI futexes,
>> .\"       based on mail discussions with Darren Hart. Does it seem okay?
>>
>>        From a user-space perspective, what makes a futex PI-aware  is  a
>>        policy  agreement  between  user  space  and the kernel about the
>>        value of the futex word (described in a moment), coupled with the
>>        use  of  the  PI futex operations described below (in particular,
>>        FUTEX_LOCK_PI, FUTEX_TRYLOCK_PI, and FUTEX_CMP_REQUEUE_PI).
> 
> Yes. Was this intended to be a complete opcode list? 

No. I'll remove that list, in case its misunderstood that way.

> PI operations must
> use paired operations.
> 
> (FUTEX_LOCK_PI | FUTEX_TRYLOCK_PI) : FUTEX_UNLOCK_PI
> FUTEX_WAIT_REQUEUE_PI : FUTEX_CMP_REQUEUE_PI

And now I've made that point explicitly in the page. See my comment 
lower down.

> And their PRIVATE counterparts of course (which is assumed as it is a
> flag to the opcode).

Yes. But I don't think that needs to be called out explicitly here (?).

>> .\" FIXME XXX ===== Start of adapted Hart/Guniguntala text =====
>> .\"       The following text is drawn from the Hart/Guniguntala paper
>> .\"       (listed in SEE ALSO), but I have reworded some pieces
>> .\"       significantly. Please check it.
>>
>>        The PI futex operations described below  differ  from  the  other
>>        futex  operations  in  that  they impose policy on the use of the
>>        value of the futex word:
>>
>>        *  If the lock is not acquired, the futex word's value  shall  be
>>           0.
>>
>>        *  If  the  lock is acquired, the futex word's value shall be the
>>           thread ID (TID; see gettid(2)) of the owning thread.
>>
>>        *  If the lock is owned and there are threads contending for  the
>>           lock,  then  the  FUTEX_WAITERS  bit shall be set in the futex
>>           word's value; in other words, this value is:
>>
>>               FUTEX_WAITERS | TID
>>
>>
>>        Note that a PI futex word never just has the value FUTEX_WAITERS,
>>        which is a permissible state for non-PI futexes.
> 
> The second clause is inappropriate. I don't know if that was yours or
> mine, but non-PI futexes do not have a kernel defined value policy, so
> ==FUTEX_WAITERS cannot be a "permissible state" as any value is
> permissible for non-PI futexes, and none have a kernel defined state.
> 
> Perhaps include a Note under the third bullet as:
> 
> 	  Note: It is invalid for a PI futex word to have no owner and
> 	        FUTEX_WAITERS set.

Done.

>>        With this policy in place, a user-space application can acquire a
>>        not-acquired lock or release a lock that no other threads try  to
> 
> "that no other threads try to acquire" seems out of place. I think
> "atomic instructions" is sufficient to express how contention is
> handled.

Yup, changed.

>>        acquire using atomic instructions executed in user space (e.g., a
>>        compare-and-swap operation such as cmpxchg on the  x86  architec‐
>>        ture).   Acquiring  a  lock simply consists of using compare-and-
>>        swap to atomically set the futex word's value to the caller's TID
>>        if  its  previous  value  was 0.  Releasing a lock requires using
>>        compare-and-swap to set the futex word's value to 0 if the previ‐
>>        ous value was the expected TID.
>>
>>        If a futex is already acquired (i.e., has a nonzero value), wait‐
>>        ers must employ the FUTEX_LOCK_PI operation to acquire the  lock.
>>        If other threads are waiting for the lock, then the FUTEX_WAITERS
>>        bit is set in the futex value; in this case, the lock owner  must
>>        employ the FUTEX_UNLOCK_PI operation to release the lock.
>>
>>        In  the  cases  where  callers  are forced into the kernel (i.e.,
>>        required to perform a futex() call), they then deal directly with
>>        a so-called RT-mutex, a kernel locking mechanism which implements
>>        the required priority-inheritance semantics.  After the  RT-mutex
>>        is  acquired,  the futex value is updated accordingly, before the
>>        calling thread returns to user space.
> 
> This last paragraph relies on kernel implementation rather than
> behavior. If the RT-mutex is renamed or the mechanism is implemented
> differently in futexes, this section will require updating. Is that
> appropriate for a user-space man page?

In the end, I'm not sure. This is (so far) my best attempt at trying
to convey an explanation of the behavior provided by the API.

>> .\" FIXME ===== End of adapted Hart/Guniguntala text =====
>>
>>
>>
>> .\" FIXME We need some explanation in the following paragraph of *why*
>> .\"       it is important to note that "the kernel will update the
>> .\"       futex word's value prior
>>        It is important to note to returning to user space" . Can someone
>>        explain?   that  the  kernel  will  update the futex word's value
>>        prior to returning to user space.  Unlike the other futex  opera‐
>>        tions  described  above, the PI futex operations are designed for
>>        the implementation of very specific IPC mechanisms.
> 
> If the kernel didn't perform the update prior to returning to userspace,
> we could end up in an invalid state. Such as having an owner, but the
> value being 0. Or having waiters, but not having FUTEX_WAITERS set.

So I've now reworked this passage to read:

       It  is  important  to  note that the kernel will update the futex
       word's value prior to returning to user  space.   (This  prevents
       the possibility of the futex word's value ending up in an invalid
       state, such as having an owner but the value being 0,  or  having
       waiters but not having the FUTEX_WAITERS bit set.)

Okay?

>> .\"
>> .\" FIXME XXX In discussing errors for FUTEX_CMP_REQUEUE_PI, Darren Hart
>> .\"       made the observation that "EINVAL is returned if the non-pi 
>> .\"       to pi or op pairing semantics are violated."
>> .\"       Probably there needs to be a general statement about this
>> .\"       requirement, probably located at about this point in the page.
>> .\"       Darren (or someone else), care to take a shot at this?
> 
> We can probably borrow from either the futex.c comments or the
> futex-requeue-pi.txt in Documentation. Also, it is important to note
> that the PI requeue operations require two distinct uadders (although
> that is implied by requiring "non-pi to pi" as a futex cannot be both.
> 
> Or... perhaps something like:
> 
> 	Due to the kernel imposed futex word value policy, PI futex
> 	operations have additional usage requirements:
> 	
> 	FUTEX_WAIT_REQUEUE_PI must be paired with FUTEX_CMP_REQUEUE_PI
> 	and be performed from a non-pi futex to a distinct pi futex.
> 	Failing to do so will return EINVAL. 

For which operation does the EINVAL occur: FUTEX_WAIT_REQUEUE_PI or 
FUTEX_CMP_REQUEUE_PI?

>       Additionally,
> 	FUTEX_CMP_REQUEUE_PI requires that nr_wake=1. [We state in the
> 	docs that nr_requeue should be INT_MAX for broadcast and 0 for
> 	signal... but that may be overly specific to libc for this
> 	manual]
> 	
> 	Similarly, FUTEX_UNLOCK_PI must only be called on a futex owned
> 	by the calling thread as defined by the value policy, otherwise
> 	EPERM is returned.
> 
> Were you looking for something like that - or were you looking for
> justification for these requirements?

That's good. I've reworked a piece of the page to be:

       PI futexes are operated on by specifying one of the values listed
       below  in  futex_op.   Note  that the PI futex operations must be
       used as paired operations and  are  subject  to  some  additional
       requirements:

       *  FUTEX_LOCK_PI  and FUTEX_TRYLOCK_PI pair with FUTEX_UNLOCK_PI.
          FUTEX_UNLOCK_PI must be called only on a futex  owned  by  the
          calling  thread, as defined by the value policy, otherwise the
          error EPERM results.

       *  FUTEX_WAIT_REQUEUE_PI pairs with  FUTEX_CMP_REQUEUE_PI.   This
          must  be  performed from a non-PI futex to a distinct PI futex
          (or the error EINVAL results).  Additionally, val (the  number
          of  waiters  to  be  woken)  must  be  1  (or the error EINVAL
          results).

But see my question just above. I'll tweak the first bullet point a
little after I hear back from you.

> ...
> 
>>        FUTEX_LOCK_PI (since Linux 2.6.18)
>> .\" FIXME I did some significant rewording of tglx's text to create
>> .\"       the text below.
>> .\"       Please check the following paragraph, in case I injected
>> .\"       errors.
>>               This  operation  is used after after an attempt to acquire
>>               the lock  via  an  atomic  user-space  instruction  failed
>>               because  the  futex word has a nonzero value—specifically,
>>               because it contained the  namespace-specific  TID  of  the
>>               lock owner.
> 
> Acked.

Thanks.

>>        FUTEX_TRYLOCK_PI (since Linux 2.6.18)
>> .\" FIXME I think it would be helpful here to say a few more words about
>> .\"       the difference(s) between FUTEX_LOCK_PI and FUTEX_TRYLOCK_PI.
>> .\"       Can someone propose something?
>>               This operation tries to acquire the futex  at  uaddr.   It
>>               deals  with  the situation where the TID value at uaddr is
>>               0, but the FUTEX_WAITERS bit is set.   User  space  cannot
>>               handle this condition in a race-free manner
>> .\" FIXME How does the situation in the previous sentence come about?
>> .\"       Probably it would be helpful to say something about that in
>> .\"       the man page.
>> .\" FIXME And *how* does FUTEX_TRYLOCK_PI deal with this situation?
> 
> I guess I wouldn't expect to see this detail in the manual. That state
> should never exist in userspace as far as I understand it, which makes
> it a kernel implementation detail and not relevant to a usage manual.

See my recent reply to Thomas Gleixner.

>>
>>        FUTEX_CMP_REQUEUE_PI (since Linux 2.6.31)
>>               This operation is a PI-aware variant of FUTEX_CMP_REQUEUE.
>>               It    requeues    waiters    that    are    blocked    via
>>               FUTEX_WAIT_REQUEUE_PI on uaddr from a non-PI source  futex
>>               (uaddr) to a PI target futex (uaddr2).
>>
>>               As with FUTEX_CMP_REQUEUE, this operation wakes up a maxi‐
>>               mum of val waiters that are waiting on the futex at uaddr.
>>               However, for FUTEX_CMP_REQUEUE_PI, val is required to be 1
>>               (since the main point is to avoid a thundering herd).  The
>>               remaining  waiters  are removed from the wait queue of the
>>               source futex at uaddr and added to the wait queue  of  the
>>               target futex at uaddr2.
>>
>>               The val2 and val3 arguments serve the same purposes as for
>>               FUTEX_CMP_REQUEUE.
>> .\" FIXME The page at http://locklessinc.com/articles/futex_cheat_sheet/
>> .\"       notes that "priority-inheritance Futex to priority-inheritance
>> .\"       Futex requeues are currently unsupported". Do we need to say
>> .\"       something in the man page about that?
>>
> 
> I noted this above in response to your request for detail about the
> *REQUEUE_PI semantics and error codes. Was that sufficient?

Between input from you and tglx, I think we're okay.

>>        FUTEX_WAIT_REQUEUE_PI (since Linux 2.6.31)
>>
>> .\" FIXME I find the next sentence (from tglx) pretty hard to grok.
>> .\"       Could someone explain it a bit more?
>>
>>               Wait operation to wait on a  non-PI  futex  at  uaddr  and
>>               potentially  be  requeued  onto a PI futex at uaddr2.  The
>>               wait operation on uaddr is the same  as  FUTEX_WAIT.   
> 
> The point tglx is making here is that you must know ahead of time and
> tell the kernel that you intend to use this futex in a REQUEUE_PI
> operation, and not a regular REQUEUE. This is determined by the both the
> op codes as well as the required arguments, which I also documented
> above. Is more detail required?

I think I've got it now. See my revised version of this text in my reply
to tglx, and let me know if anything is amiss.

>> .\" FIXME I'm not quite clear on the meaning of the following sentence.
>> .\"       Is this trying to say that while blocked in a
>> .\"       FUTEX_WAIT_REQUEUE_PI, it could happen that another
>> .\"       task does a FUTEX_WAKE on uaddr that simply causes
>> .\"       a normal wake, with the result that the FUTEX_WAIT_REQUEUE_PI
>> .\"       does not complete? What happens then to the FUTEX_WAIT_REQUEUE_PI
>> .\"       opertion? Does it remain blocked, or does it unblock
>> .\"       In which case, what does user space see?
>>
>>               The
>>               waiter   can  be  removed  from  the  wait  on  uaddr  via
>>               FUTEX_WAKE without requeueing on uaddr2.
> 
> Userspace should see the task wake and continue executing. This would
> effectively be a cancelation operation - which I didn't think was
> supported. Thomas?

Thomas responded on this point, and I revised my text. Please see
my reply to tglx and let me know if anything is amiss.

>> .\" FIXME XXX The following is a reworded version of Darren Hart's text.
>> .\"       Please check that I did not introduce any errors.
>>        EINVAL (FUTEX_CMP_REQUEUE_PI) An attempt was made  to  requeue  a
>>               waiter  to a futex other than that specified by the match‐
>>               ing FUTEX_WAIT_REQUEUE_PI call for that waiter.
> 
> Acked.

Thanks!

Thanks for the help, Darren!

Cheers,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html