Re: [PATCH] memcg: add pgfault latency histograms

Ingo Molnar <mingo@xxxxxxx> · Sat, 28 May 2011 12:17:45 +0200

* Ying Han <yinghan@xxxxxxxxxx> wrote:

> After study a bit on perf, it is not feasible in this casecase. The 
> cpu & memory overhead of perf is overwhelming.... Each page fault 
> will generate a record in the buffer and how many data we can 
> record in the buffer, and how many data will be processed later.. 
> Most of the data that is recorded by the general perf framework is 
> not needed here.
>
> 
> On the other hand, the memory consumption is very little in this 
> patch. We only need to keep a counter of each bucket and the 
> recording can go on as long as the machine is up. As also measured, 
> there is no overhead of the data collection :)
> 
> So, the perf is not an option for this purpose.

It's not a fundamental limitation in perf though.

The way i always thought perf could be extended to support heavy-duty 
profiling such as your patch does would be along the following lines:

Right now perf supports three output methods:

           'full detail': per sample records, recorded in the ring-buffer
  'filtered full detail': per sample records filtered, recorded in the ring-buffer
          'full summary': the count of all samples (simple counter), no recording

What i think would make sense is to introduce a fourth variant, which 
is a natural intermediate of the above output methods:

       'partial summary': partially summarized samples, record in an 
                          array in the ring-buffer - an extended 
                          multi-dimensional 'count'.

A histogram like yours would be one (small) sub-case of this new 
model.

Now, to keep things maximally flexible we really do not want to hard 
code histogram summary functions: i.e. we do not want to hardcode 
ourselves to 'latency histograms' or 'frequency histograms'.

To achieve that flexibility we could define the histogram function as 
a simple extension to filters: filters that evaluate to an integer 
value.

For example, if we defined the following tracepoint in 
arch/x86/mm/fault.c:

TRACE_EVENT(mm_pagefault,

       TP_PROTO(u64 time_start, u64 time_end, unsigned long address, int error_code, unsigned long ip),

       TP_ARGS(time_start, time_end, address, error_code, ip),

       TP_STRUCT__entry(
               __field(u64,           time_start)
               __field(u64,           time_end)
               __field(unsigned long, address)
               __field(unsigned long, error_code)
               __field(unsigned long, ip)
       ),

       TP_fast_assign(
               __entry->time_start     = time_start;
               __entry->time_end       = time_end;
               __entry->address        = address;
               __entry->error_code     = error_code;
               __entry->ip             = ip;
       ),

       TP_printk("time_start=%uL time_end=%uL address=%lx, error code=%lx, ip=%lx",
               __entry->time_start, __entry->time_end,
               __entry->address, __entry->error_code, __entry->ip)

Then the following filter expressions could be used to calculate the 
histogram index and value:

	   index: "(time_end - time_start)/1000"
	iterator: "curr + 1"

The /1000 index filter expression means that there is one separate 
bucket per microsecond of delay.

The "curr + 1" iterator filter expression would represent that for 
every bucket an event means we add +1 to the current bucket value.

Today our filter expressions evaluate to a small subset of integer 
numbers: 0 or 1 :-)

Extending them to integer calculations is possible and would be 
desirable for other purposes as well, not just histograms. Adding 
integer operators in addition to the logical and bitwise operators 
the filter engine supports today would be useful as well. (See 
kernel/trace/trace_events_filter.c for the current filter engine.)

This way we would have the equivalent functionality and performance 
of your histogram patch - and it would also open up many, *many* 
other nice possibilities as well:

 - this could be used with any event, anywhere - could even be used
   with hardware events. We could sample with an NMI every 100 usecs 
   and profile with relatively small profiling overhead.

 - arbitrarily large histograms could be created: need a 10 GB
   histogram on a really large system? No problem, create such
   a big ring-buffer.

 - many different types of summaries are possible as well:

    - we could create a histogram over *which* code pagefaults, via
      using the "ip" (faulting instruction) address and a 
      sufficiently large ring-buffer.

    - histogram over the address space (which vmas are the hottest ones),
      by changing the first filter to "address/1000000" to have per
      megabyte buckets.

    - weighted histograms: for example if the histogram iteration 
      function is "curr + (time_end-time_start)/1000" and the
      histogram index is "address/1000000", then we get an 
      address-indexed histogram weighted by length of latency: the 
      higher latencies a given area of memory causes, the hotter the
      bucket.

 - the existing event filter code can be used to filter the incoming
   events to begin with: for example an "error_code = 1" filter would
   limit the histogram to write faults (page dirtying).

So instead of adding just one hardcoded histogram type, it would be 
really nice to work on a more generic solution!

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxxx  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>