The goal of the present document is to propose a trace format that will suit the needs of the embedded, telecom, high-performance and kernel communities. It starts by doing an overview of the trace format, tracer and trace analyzer requirements to consider for a Common Trace Format proposal. This document includes requirements from: Steven Rostedt <rostedt@xxxxxxxxxxx> Dominique Toupin <dominique.toupin@xxxxxxxxxxxx> Aaron Spear <aaron_spear@xxxxxxxxxx> Philippe Maisonneuve <Philippe.Maisonneuve@xxxxxxxxxxxxx> Burton, Felix <Felix.Burton@xxxxxxxxxxxxx> Andrew McDermott <Andrew.McDermott@xxxxxxxxxxxxx> * Trace Format Requirements These are requirements on the trace format itself. This section discusses the layout of data in the trace, explaining the rationale behind the choices. The rationale for the trace format choices may refer to the tracer and trace analyzer requirements stated below. This section starts by presenting the common trace model, and then specifies the requirements of an instance of this model specifically tailored to efficient kernel- and user-space tracing requirements. Given that we initially target Linux kernel and user-space tracing use-cases, we tag requirements that might be kept for a later version of the trace format with "(v2)". The "(v1)" tag indicates requirements for v1 that might be changed for v2, replaced by more specific requirements. 1) Common Model This high-level model is meant to be an industry-wide, common model, fulfilling the tracing requirements. It is meant to be application-, architecture-, and language-agnostic. - Event - All events recorded in a trace ought to be orderable - Mandatory ordering identifier - Either timestamp-based or based on unique sequence numbers - Optional timestamp in addition to sequence number (v2) - Event type (numeric identifier: maps to metadata) - Unique ID assigned within a section. - Optional context (thread id, virtual cpu id, execution mode (irq/bh/thread), CPU/board/node id) (v2) (context can be encoded in event payload for v1) - Payload (event specific) - Variable event size - Size limitations: maximum event size should be configurable. - Size information available through metadata (and optionally in event header) - Support various data alignment for architectures, standards, and languages: - Natural alignment of data for architectures with slow non-aligned writes. - Packed layout of headers for architecture with efficient non-aligned writes. - Section (similar to ELF sections) - Contains a subset of event types - Optional context applying to all events contained in that sections (thread id, virtual cpu id, execution mode (irq/bh/thread), CPU/board/node id) (v2) - Metadata - Description of event context fields (per section) - Types available: integer, strings, arrays, sequence, floats, structures, maps (aka enumerations), bitfields, ... - Describe type alignment. - Mapping to event types with: ( section identifier, event numerical identifier ) - Architecture-agnostic (ascii) - Ought to be parsed with a regular grammar - Can be streamed along with the trace files - Support dynamic addition of events while trace is active (module loading) - Report target bitness - Report target byte order - Contain unique address-space identifier (kernel, process ID and timestamp, hypervisor) 2) Efficient Serialization File Format Designed to be Optimally Generated by Systems Written in C. Instance of the model specifically tailored to the Linux kernel and C programs/libraries requirements. Allows for either packed events, or events aligned following the ISO/C standard. - Event - Payload - Initially support ISO C naturally aligned and packed type layouts. - Each section represented as a group of trace files (typically 1 trace file per cpu per section) to allow the tracer to easily append to these sections. - Trace file - Should have no hard coded limit on file size (64 bit file position is fine) - Event lost count should be localized. It should apply to a limited time interval and to a tracefile, hence to a specific section, so the trace analyzer can provide basic information about what kind of events were lost and where they were lost in the trace. - Should be optionally compressible piece-wise. - Compact representation - Minimize the overhead in terms of disk/network/serial port/memory bandwidth. - A compact representation can keep more information in smaller buffers, thus needs less memory to keep the same amount of information around. Also useful to improve cache locality in flight recorder mode. - Natural alignment of headers for architectures with slow non-aligned writes. - Packed layout of headers for architecture with efficient non-aligned writes. - Should have a 1 to 1 mapping between the memory buffers and the generated trace files: allows zero-copy with splice(). - Use target endianness - Target OS independent - Portable across different host target (tracer)/host (analyzer) architectures - Optionally compressible - Optional checksum on the sub-buffer content (except sub-buffer header), with a selection of checksum algorithms. - It should be possible to generate metadata from descriptions written in header files (extraction with C preprocessor macros is one solution). * Tracer Requirements Higher-level tracer requirements that seem appropriate to support some of the trace format requirements stated above. *Fast* - Low-overhead - Handle large trace throughput (multi-GB per minutes) - Scalable to high number of cores - Per-cpu memory buffers - Scalability and performance-aware synchronization *Compact* - Environments without filesystem - Buffered in target RAM to a host for analysis - Ability to tune the size of buffers and transmission medium to minimize the impact on the traced system. - Streaming (live monitoring) - Through sockets (USB, network) - Through serial ports - There must be a related protocol for streaming this event data. - Availability of flight recorder (synonym: overwrite) mode - Exclusive ownership of reader data. - Buffer size should be per group of events. - Output trace to disk - Trace buffers available in crash dump to allow post-mortem analysis - Fine-grained timestamps - Lockless (lock-free, ideally wait-free; aka starvation-free) - Buffer introspection: event written, read and lost counts. - Ability to iteratively narrow the level of details and traced time window following an initial high level "state" overview provided by an initial trace collecting everything. - Support kernel module instrumentation - Standard way(s) for a host to upload/access trace log data from a target/JTAG device/simulator/etc. - Conditional tracing in kernel space. - Compatibility with power management subsystem (trace collection shall not be a reason for waking up a device) - Well defined and stable trace configuration and control API across kernel versions. - Create and run more than one trace session in parallel at the same time - monitoring from system administrators - field engineered to troubleshoot a specific problem * Trace Analyzer Requirements - Ability to cope with huge traces (> 10 GB) - Should be possible to do a binary search on the file to find events by time at least. (combined with smart indexing/ summary data perhaps) - File format should be as dense as possible, but not at the expense of analysis performance (faster is more important than bigger since disks are getting cheaper) - Must not be required to scan through all events in order to start analyzing (by time anyway) - Support live viewing of trace streams - Standard description of a trace event context. (PERI-XML calls it "Dimensions") - Manage system-wide event scoping with the following hierarchy: (address space identifier, section name, event name) -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com -- To unsubscribe from this list: send the line "unsubscribe linux-trace-users" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html