Re: [PATCH v5 3.1.0-rc4-tip 8/26] x86: analyze instruction and determine fixups.

Josh Stone <jistone@xxxxxxxxxx> · Mon, 26 Sep 2011 12:59:46 -0700

On 09/23/2011 04:53 AM, Masami Hiramatsu wrote:
>> Masami looked at this and found that SystemTap sdt.h currently requires
>> an extra userspace memory store in order to activate probes.  Each probe
>> has a "semaphore" 16-bit counter which applications may test before
>> hitting the probe itself.  This is used to avoid overhead in
>> applications that do expensive argument processing (e.g. creating
>> strings) for probes.
> Indeed, originally, those semaphores designed for such use cases.
> However, some applications *always* use it (e.g. qemu-kvm).

I found that qemu-kvm generates its tracepoints like this:

  static inline void trace_$name($args) {
      if (QEMU_${nameupper}_ENABLED()) {
          QEMU_${nameupper}($argnames);
      }
  }

In that case, the $args are always computed to call the inline, so
you'll basically just get a memory read, jump, NOP.  There's no benefit
from checking ENABLED() here, and removing it would leave only the NOP.
 Even if you invent an improved mechanism for ENABLED(), that doesn't
change the fact that it's doing useless work here.

So in this case, it may be better to patch qemu, assuming my statements
hold for DTrace's implementation on other platforms too.  The ENABLED()
guard still does have other genuine uses though, as with the string
preparation in Python's probes.

On 09/23/2011 09:51 AM, Stefan Hajnoczi wrote:
>> I'm not sure that we should stick on the current implementation
>> of the sdt.h. I think we'd better modify the sdt.h to replace
>> such semaphores with checking whether the tracepoint is changed from nop.
> 
> I like this option.  The only implication is that all userspace tracing
> needs to go through uprobes if we want to support multiple consumers
> tracing the same address.

This limitation is practically true already, since sharing consumers
have to negotiate the breakpoint anyway.

If we can find a better way to handle semaphores, we at systemtap will
welcome sdt.h improvements.  On the face of it, checking one's own NOP
for modification sounds pretty elegant, but I'm not convinced that it's
possible in practice.

For one, it requires arch specific knowledge in sdt.h of what the NOP or
breakpoint looks like, whereas sdt.h currently only knows whether to use
NOP or NOP 0, without knowledge of how that's encoded.  And this gets
trickier with archs like IA64 where you're part of a bundle.  So this
much is hard, but not impossible.

Another issue is that there's not an easy compile-time correlation
between semaphore checks and probe locations, nor is it necessarily a
1:1 mapping.  The FOO_ENABLED() and PROBE_FOO() code blocks are
distinct, and the compiler can do many tricks with them, loop unrolling,
function specialization, etc.  And if we start placing constraints to
prevent this, then I think we'll be impacting code-gen of the
application more than we'd like.

So, I invite sdt.h prototypes of the nop check, but I'm skeptical...

>> Or, we can introduce an add-hoc ptrace code to perftools for modifying
>> those semaphores. However, this means that user always has to use
>> perf to trace applications, and it's hard to trace multiple applications
>> at a time (can we attach all of them?)...
> 
> I don't think perf needs to stay attached to the processes.  It just
> needs to increment the semaphores on startup and decrement them on
> shutdown.

You're still relying on getting in there twice, but ptrace could be busy
either or both times.  Plus you could inadvertently block any of the
other legitimate ptrace apps, especially if doing systemwide probing.

FWIW, the counter-semaphore is really only useful for the case where the
breakpoint is placed centrally (by uprobes), but the semaphore is
managed by each separate consumer.  In that case each consumer can
inc/dec their presence.  But if uprobes were to manage this itself, it
basically becomes a simple flag.  So, it would do the trick to have
uprobes take an extra inode offset as a flag to write in a 1, which is
admittedly a bit gross, but IMO the most workable.

Josh

PS - context for the CCed systemtap list:
https://lkml.org/lkml/2011/9/23/93

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>