* Ying Han <yinghan@xxxxxxxxxx> wrote: > After study a bit on perf, it is not feasible in this casecase. The > cpu & memory overhead of perf is overwhelming.... Each page fault > will generate a record in the buffer and how many data we can > record in the buffer, and how many data will be processed later.. > Most of the data that is recorded by the general perf framework is > not needed here. > > > On the other hand, the memory consumption is very little in this > patch. We only need to keep a counter of each bucket and the > recording can go on as long as the machine is up. As also measured, > there is no overhead of the data collection :) > > So, the perf is not an option for this purpose. It's not a fundamental limitation in perf though. The way i always thought perf could be extended to support heavy-duty profiling such as your patch does would be along the following lines: Right now perf supports three output methods: 'full detail': per sample records, recorded in the ring-buffer 'filtered full detail': per sample records filtered, recorded in the ring-buffer 'full summary': the count of all samples (simple counter), no recording What i think would make sense is to introduce a fourth variant, which is a natural intermediate of the above output methods: 'partial summary': partially summarized samples, record in an array in the ring-buffer - an extended multi-dimensional 'count'. A histogram like yours would be one (small) sub-case of this new model. Now, to keep things maximally flexible we really do not want to hard code histogram summary functions: i.e. we do not want to hardcode ourselves to 'latency histograms' or 'frequency histograms'. To achieve that flexibility we could define the histogram function as a simple extension to filters: filters that evaluate to an integer value. For example, if we defined the following tracepoint in arch/x86/mm/fault.c: TRACE_EVENT(mm_pagefault, TP_PROTO(u64 time_start, u64 time_end, unsigned long address, int error_code, unsigned long ip), TP_ARGS(time_start, time_end, address, error_code, ip), TP_STRUCT__entry( __field(u64, time_start) __field(u64, time_end) __field(unsigned long, address) __field(unsigned long, error_code) __field(unsigned long, ip) ), TP_fast_assign( __entry->time_start = time_start; __entry->time_end = time_end; __entry->address = address; __entry->error_code = error_code; __entry->ip = ip; ), TP_printk("time_start=%uL time_end=%uL address=%lx, error code=%lx, ip=%lx", __entry->time_start, __entry->time_end, __entry->address, __entry->error_code, __entry->ip) Then the following filter expressions could be used to calculate the histogram index and value: index: "(time_end - time_start)/1000" iterator: "curr + 1" The /1000 index filter expression means that there is one separate bucket per microsecond of delay. The "curr + 1" iterator filter expression would represent that for every bucket an event means we add +1 to the current bucket value. Today our filter expressions evaluate to a small subset of integer numbers: 0 or 1 :-) Extending them to integer calculations is possible and would be desirable for other purposes as well, not just histograms. Adding integer operators in addition to the logical and bitwise operators the filter engine supports today would be useful as well. (See kernel/trace/trace_events_filter.c for the current filter engine.) This way we would have the equivalent functionality and performance of your histogram patch - and it would also open up many, *many* other nice possibilities as well: - this could be used with any event, anywhere - could even be used with hardware events. We could sample with an NMI every 100 usecs and profile with relatively small profiling overhead. - arbitrarily large histograms could be created: need a 10 GB histogram on a really large system? No problem, create such a big ring-buffer. - many different types of summaries are possible as well: - we could create a histogram over *which* code pagefaults, via using the "ip" (faulting instruction) address and a sufficiently large ring-buffer. - histogram over the address space (which vmas are the hottest ones), by changing the first filter to "address/1000000" to have per megabyte buckets. - weighted histograms: for example if the histogram iteration function is "curr + (time_end-time_start)/1000" and the histogram index is "address/1000000", then we get an address-indexed histogram weighted by length of latency: the higher latencies a given area of memory causes, the hotter the bucket. - the existing event filter code can be used to filter the incoming events to begin with: for example an "error_code = 1" filter would limit the histogram to write faults (page dirtying). So instead of adding just one hardcoded histogram type, it would be really nice to work on a more generic solution! Thanks, Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxxx For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>