Re: [RFC][PATCH 0/5] arch: atomic rework

Torvald Riegel <triegel@xxxxxxxxxx> · Mon, 03 Mar 2014 16:36:01 +0100

On Thu, 2014-02-27 at 09:01 -0800, Linus Torvalds wrote:
> On Thu, Feb 27, 2014 at 7:37 AM, Torvald Riegel <triegel@xxxxxxxxxx> wrote:
> > Regarding the latter, we make a fresh start at each mo_consume load (ie,
> > we assume we know nothing -- L could have returned any possible value);
> > I believe this is easier to reason about than other scopes like function
> > granularities (what happens on inlining?), or translation units.  It
> > should also be simple to implement for compilers, and would hopefully
> > not constrain optimization too much.
> >
> > [...]
> >
> > Paul's litmus test would work, because we guarantee to the programmer
> > that it can assume that the mo_consume load would return any value
> > allowed by the type; effectively, this forbids the compiler analysis
> > Paul thought about:
> 
> So realistically, since with the new wording we can ignore the silly
> cases (ie "p-p") and we can ignore the trivial-to-optimize compiler
> cases ("if (p == &variable) .. use p"), and you would forbid the
> "global value range optimization case" that Paul bright up, what
> remains would seem to be just really subtle compiler transformations
> of data dependencies to control dependencies.
> 
> And the only such thing I can think of is basically compiler-initiated
> value-prediction, presumably directed by PGO (since now if the value
> prediction is in the source code, it's considered to break the value
> chain).

The other example that comes to mind would be feedback-directed JIT
compilation.  I don't think that's widely used today, and it might never
be for the kernel -- but *in the standard*, we at least have to consider
what the future might bring.

> The good thing is that afaik, value-prediction is largely not used in
> real life, afaik. There are lots of papers on it, but I don't think
> anybody actually does it (although I can easily see some
> specint-specific optimization pattern that is build up around it).
> 
> And even value prediction is actually fine, as long as the compiler
> can see the memory *source* of the value prediction (and it isn't a
> mo_consume). So it really ends up limiting your value prediction in
> very simple ways: you cannot do it to function arguments if they are
> registers. But you can still do value prediction on values you loaded
> from memory, if you can actually *see* that memory op.

I think one would need to show that the source is *not even indirectly*
a mo_consume load.  With the wording I proposed, value dependencies
don't break when storing to / loading from memory locations.

Thus, if a compiler ends up at a memory load after waling SSA, it needs
to prove that the load cannot read a value that (1) was produced by a
store sequenced-before the load and (2) might carry a value dependency
(e.g., by being a mo_consume load) that the value prediction in question
would break.  This, in general, requires alias analysis.
Deciding whether a prediction would break a value dependency has to
consider what later stages in a compiler would be doing, including LTO
or further rounds of inlining/optimizations.  OTOH, if the compiler can
treat an mo_consume load as returning all possible values (eg, by
ignoring all knowledge about it), then it can certainly do so with other
memory loads too.

So, I think that the constraints due to value dependencies can matter in
practice.  However, the impact on optimizations on
non-mo_consume-related code are hard to estimate -- I don't see a huge
amount of impact right now, but I also wouldn't want to predict that
this can't change in the future.

> Of course, on more strongly ordered CPU's, even that "register
> argument" limitation goes away.
> 
> So I agree that there is basically no real optimization constraint.
> Value-prediction is of dubious value to begin with, and the actual
> constraint on its use if some compiler writer really wants to is not
> onerous.
> 
> > What I have in mind is roughly the following (totally made-up syntax --
> > suggestions for how to do this properly are very welcome):
> > * Have a type modifier (eg, like restrict), that specifies that
> > operations on data of this type are preserving value dependencies:
> 
> So I'm not violently opposed, but I think the upsides are not great.
> Note that my earlier suggestion to use "restrict" wasn't because I
> believed the annotation itself would be visible, but basically just as
> a legalistic promise to the compiler that *if* it found an alias, then
> it didn't need to worry about ordering. So to me, that type modifier
> was about conceptual guarantees, not about actual value chains.
> 
> Anyway, the reason I don't believe any type modifier (and
> "[[carries_dependency]]" is basically just that) is worth it is simply
> that it adds a real burden on the programmer, without actually giving
> the programmer any real upside:
> 
> Within a single function, the compiler already sees that mo_consume
> source, and so doing a type-based restriction doesn't really help. The
> information is already there, without any burden on the programmer.

I think it's not just a question of whether we're talking a single
function or across functions, but to which extent other code can detect
whether it might have to consider value dependencies.  The store/load
case above is an example that complicates the detection for a compiler.

In cases in which the mo_consume load is used directly, we don't need to
use any annotations on the type:
  int val = atomic_load_explicit(ptr, mo_consume)->value;

However, if we need to use the load's result more than once (which I
think will happen often), then we do need the type annotation:
  s value_dep_preserving *ptr = atomic_load_explicit(ptr, mo_consume);
  if (ptr != 0)
    int val = ptr->value;

If we want to avoid the annotation in this case, and still want to avoid
the store/load vs. alias analysis problem mentioned above, we'd need to
require that ptr isn't a variable that's visible to other code not
related to this mo_consume load.  But I believe that such a requirement
would be awkward, and also hard to specify.

I hope that Paul's look at rcu_derefence() usage could provide some
indication of how much annotation overhead there actually would be for a
programmer.

> 
> And across functions, the compiler has already - by definition -
> mostly lost sight of all the things it could use to reduce the value
> space.

I don't think that I agree here.  

Assume we have two separate functions bar and foo, and one temporary
variable t of a type int012 that holds values 0,1,2 (excuse the somewhat
artificial example):

int012 t;
int arr[20];

int bar(int a)
{
  bar_stuff(a);  // compiler knows this is noop with arguments 0 or 1
                 // and this will *never* touch t nor arr
  return a;
}

int foo(int a)
{
  foo_stuff(a); // compiler knows this is noop with arguments 1 or 2
                // and this will *never* touch t nor arr
  return a;
}

void main() {
  t = atomic_load_explicit(&source, mo_consume);
  x = arr[bar(foo(t))];  // value-dependent?
}

If a compiler looks at foo() and bar() separately, I think it might want
to optimize bar() to the following:

int bar_opt(int a)
{
  if (a != 2) return a;
  bar_stuff(a);
  return a;
}

int foo_opt(int a)
{
  if (a != 0) return a;
  foo_stuff(a);
  return a;
}

I think that those could be valuable optimizations for general-purpose
code.

What happens if the compiler does LTO afterwards and combines the foo
and bar calls?:

int bar_opt_foo_opt(int a)
{
  if (a == 1) return a;
  if (a == 0) foo_stuff(a);
  else bar_stuff(a);
  return a;
}

This still looks like a good thing to do for general-purpose code, and
it doesn't do any value prediction.

If we inline this into main, it becomes kind of difficult for the
compiler because it cannot just weave in bar_opt_foo_opt, or it might
get:

  t = atomic_load_explicit(&source, mo_consume);
  if (t == 1) goto next;
  if (t == 0) foo_stuff(t);
  else bar_stuff(t);
access:
  x = arr[t];  // value-dependent?

Would this be still value-dependent for the hardware, or would the
branch prediction interfere?

Even if this would still be okay from the hardware POV, other compiler
transformations now need to pay attention to where the value comes from.
In particular, we can't specialize this into the following (which
doesn't predict any values):

  t = atomic_load_explicit(&source, mo_consume);
  if (t == 1) x = arr[1];
  else {
    if (t == 0) foo_stuff(t);
    else bar_stuff(t);
    x = arr[t];
  }

We could argue that this wouldn't be allowed because t is coming from an
mo_consume load, but then we also need to say that this can in fact
affect compiler transformations other than just value prediction.

At least in this example, introducing the value_dep_preserving type
modifier would have made this easier because it would have allowed the
compiler to avoid the initial value prediction.

However, I think that we might get a few interesting issues even with
value_dep_preserving, so this needs further investigation.  Nonetheless,
my gut feeling is that having value_dep_preserving makes this all a lot
easier because if unsure, the compiler can just use a bigger hammer for
value_dep_preserving without having to worry about preventing
optimizations on unrelated code.

> Even Paul's example doesn't really work if the use of the
> "mo_consume" value has been passed to another function, because inside
> a separate function, the compiler couldn't see that the value it uses
> comes from only two possible values.
> 
> And as mentioned, even *if* the compiler wants to do value prediction
> that turns a data dependency into a control dependency, the limitation
> to say "no, you can't do it unless you saw where the value got loaded"
> really isn't that onerous.
> 
> I bet that if you ask actual production compiler people (as opposed to
> perhaps academia), none of them actually really believe in value
> prediction to begin with.

What about keeping whether we really need value_dep_preserving as an
open question for now, and try to get more feeback by compiler
implementers on what the consequences of not having it would be?  This
should help us assess the current implementation perspective.  We would
still need to extrapolate what future compilers might want to do,
though; thus, deciding to not use it will remain somewhat risky for
future optimizations.

--
To unsubscribe from this list: send the line "unsubscribe linux-arch" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html