On Sat, May 28, 2011 at 3:17 AM, Ingo Molnar <mingo@xxxxxxx> wrote: > > * Ying Han <yinghan@xxxxxxxxxx> wrote: > >> After study a bit on perf, it is not feasible in this casecase. The >> cpu & memory overhead of perf is overwhelming.... Each page fault >> will generate a record in the buffer and how many data we can >> record in the buffer, and how many data will be processed later.. >> Most of the data that is recorded by the general perf framework is >> not needed here. >> >> >> On the other hand, the memory consumption is very little in this >> patch. We only need to keep a counter of each bucket and the >> recording can go on as long as the machine is up. As also measured, >> there is no overhead of the data collection :) >> >> So, the perf is not an option for this purpose. > > It's not a fundamental limitation in perf though. > > The way i always thought perf could be extended to support heavy-duty > profiling such as your patch does would be along the following lines: > > Right now perf supports three output methods: > > 'full detail': per sample records, recorded in the ring-buffer > 'filtered full detail': per sample records filtered, recorded in the ring-buffer > 'full summary': the count of all samples (simple counter), no recording > > What i think would make sense is to introduce a fourth variant, which > is a natural intermediate of the above output methods: > > 'partial summary': partially summarized samples, record in an > array in the ring-buffer - an extended > multi-dimensional 'count'. > > A histogram like yours would be one (small) sub-case of this new > model. > > Now, to keep things maximally flexible we really do not want to hard > code histogram summary functions: i.e. we do not want to hardcode > ourselves to 'latency histograms' or 'frequency histograms'. > > To achieve that flexibility we could define the histogram function as > a simple extension to filters: filters that evaluate to an integer > value. > > For example, if we defined the following tracepoint in > arch/x86/mm/fault.c: > > TRACE_EVENT(mm_pagefault, > > TP_PROTO(u64 time_start, u64 time_end, unsigned long address, int error_code, unsigned long ip), > > TP_ARGS(time_start, time_end, address, error_code, ip), > > TP_STRUCT__entry( > __field(u64, time_start) > __field(u64, time_end) > __field(unsigned long, address) > __field(unsigned long, error_code) > __field(unsigned long, ip) > ), > > TP_fast_assign( > __entry->time_start = time_start; > __entry->time_end = time_end; > __entry->address = address; > __entry->error_code = error_code; > __entry->ip = ip; > ), > > TP_printk("time_start=%uL time_end=%uL address=%lx, error code=%lx, ip=%lx", > __entry->time_start, __entry->time_end, > __entry->address, __entry->error_code, __entry->ip) > > > Then the following filter expressions could be used to calculate the > histogram index and value: > > index: "(time_end - time_start)/1000" > iterator: "curr + 1" > > The /1000 index filter expression means that there is one separate > bucket per microsecond of delay. > > The "curr + 1" iterator filter expression would represent that for > every bucket an event means we add +1 to the current bucket value. > > Today our filter expressions evaluate to a small subset of integer > numbers: 0 or 1 :-) > > Extending them to integer calculations is possible and would be > desirable for other purposes as well, not just histograms. Adding > integer operators in addition to the logical and bitwise operators > the filter engine supports today would be useful as well. (See > kernel/trace/trace_events_filter.c for the current filter engine.) > > This way we would have the equivalent functionality and performance > of your histogram patch - and it would also open up many, *many* > other nice possibilities as well: > > - this could be used with any event, anywhere - could even be used > with hardware events. We could sample with an NMI every 100 usecs > and profile with relatively small profiling overhead. > > - arbitrarily large histograms could be created: need a 10 GB > histogram on a really large system? No problem, create such > a big ring-buffer. > > - many different types of summaries are possible as well: > > - we could create a histogram over *which* code pagefaults, via > using the "ip" (faulting instruction) address and a > sufficiently large ring-buffer. > > - histogram over the address space (which vmas are the hottest ones), > by changing the first filter to "address/1000000" to have per > megabyte buckets. > > - weighted histograms: for example if the histogram iteration > function is "curr + (time_end-time_start)/1000" and the > histogram index is "address/1000000", then we get an > address-indexed histogram weighted by length of latency: the > higher latencies a given area of memory causes, the hotter the > bucket. > > - the existing event filter code can be used to filter the incoming > events to begin with: for example an "error_code = 1" filter would > limit the histogram to write faults (page dirtying). > > So instead of adding just one hardcoded histogram type, it would be > really nice to work on a more generic solution! > > Thanks, > > Ingo Hi Ingo, Thank you for the detailed information. This patch is used to evaluating the memcg reclaim patch and I have got some interesting results. I will post the next version of the patch which made couple of improvement based on the comments from the thread. Meantime, I will need to study more on your suggestion :) Thanks --Ying > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxxx For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href