[RFC] Common Trace Format Requirements (v1.2)

Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> · Wed, 28 Apr 2010 14:34:31 -0400

  The goal of the present document is to propose a trace format that will suit
the needs of the embedded, telecom, high-performance and kernel communities. It
starts by doing an overview of the trace format, tracer and trace analyzer
requirements to consider for a Common Trace Format proposal.

This document includes requirements from:

Steven Rostedt <rostedt@xxxxxxxxxxx>
Dominique Toupin <dominique.toupin@xxxxxxxxxxxx>
Aaron Spear <aaron_spear@xxxxxxxxxx>
Philippe Maisonneuve <Philippe.Maisonneuve@xxxxxxxxxxxxx>
Burton, Felix <Felix.Burton@xxxxxxxxxxxxx>
Andrew McDermott <Andrew.McDermott@xxxxxxxxxxxxx>

* Trace Format Requirements

  These are requirements on the trace format itself. This section discusses the
layout of data in the trace, explaining the rationale behind the choices. The
rationale for the trace format choices may refer to the tracer and trace
analyzer requirements stated below. This section starts by presenting the common
trace model, and then specifies the requirements of an instance of this model
specifically tailored to efficient kernel- and user-space tracing requirements.

Given that we initially target Linux kernel and user-space tracing use-cases, we
tag requirements that might be kept for a later version of the trace format with
"(v2)". The "(v1)" tag indicates requirements for v1 that might be changed for
v2, replaced by more specific requirements.

1) Common Model

This high-level model is meant to be an industry-wide, common model, fulfilling
the tracing requirements. It is meant to be application-, architecture-, and
language-agnostic.

- Event
  - All events recorded in a trace ought to be orderable
  - Mandatory ordering identifier
    - Either timestamp-based or based on unique sequence numbers
  - Optional timestamp in addition to sequence number (v2)
  - Event type (numeric identifier: maps to metadata)
    - Unique ID assigned within a section.
  - Optional context (thread id, virtual cpu id, execution mode (irq/bh/thread),
                      CPU/board/node id) (v2)
    (context can be encoded in event payload for v1)
  - Payload (event specific)
    - Variable event size
    - Size limitations: maximum event size should be configurable.
    - Size information available through metadata (and optionally in event
      header)
    - Support various data alignment for architectures, standards, and
      languages:
      - Natural alignment of data for architectures with slow non-aligned
        writes.
      - Packed layout of headers for architecture with efficient non-aligned
        writes.

- Section (similar to ELF sections)
  - Contains a subset of event types
  - Optional context applying to all events contained in that sections
    (thread id, virtual cpu id, execution mode (irq/bh/thread), CPU/board/node
     id) (v2)

- Metadata
  - Description of event context fields (per section)
  - Types available: integer, strings, arrays, sequence, floats, structures,
    maps (aka enumerations), bitfields, ...
    - Describe type alignment.
  - Mapping to event types with:
      ( section identifier, event numerical identifier )
  - Architecture-agnostic (ascii)
  - Ought to be parsed with a regular grammar
  - Can be streamed along with the trace files
  - Support dynamic addition of events while trace is active (module loading)
  - Report target bitness
  - Report target byte order
  - Contain unique address-space identifier (kernel, process ID and timestamp,
    hypervisor)

2) Efficient Serialization File Format Designed to be Optimally Generated by
   Systems Written in C.

Instance of the model specifically tailored to the Linux kernel and C
programs/libraries requirements. Allows for either packed events, or events
aligned following the ISO/C standard.

- Event
  - Payload
    - Initially support ISO C naturally aligned and packed type layouts.

- Each section represented as a group of trace files (typically 1 trace file per
  cpu per section) to allow the tracer to easily append to these sections.

- Trace file
  - Should have no hard coded limit on file size (64 bit file position is fine)
  - Event lost count should be localized. It should apply to a limited time
    interval and to a tracefile, hence to a specific section, so the trace
    analyzer can provide basic information about what kind of events were lost
    and where they were lost in the trace.
  - Should be optionally compressible piece-wise.

- Compact representation
  - Minimize the overhead in terms of disk/network/serial port/memory bandwidth.
  - A compact representation can keep more information in smaller buffers,
    thus needs less memory to keep the same amount of information around.
    Also useful to improve cache locality in flight recorder mode.

- Natural alignment of headers for architectures with slow non-aligned writes.

- Packed layout of headers for architecture with efficient non-aligned writes.

- Should have a 1 to 1 mapping between the memory buffers and the generated
  trace files: allows zero-copy with splice().

- Use target endianness

- Target OS independent

- Portable across different host target (tracer)/host (analyzer) architectures

- Optionally compressible

- Optional checksum on the sub-buffer content (except sub-buffer header), with
  a selection of checksum algorithms.

- It should be possible to generate metadata from descriptions written in header
  files (extraction with C preprocessor macros is one solution).

* Tracer Requirements

Higher-level tracer requirements that seem appropriate to support some of the
trace format requirements stated above.

*Fast*
- Low-overhead
- Handle large trace throughput (multi-GB per minutes)
- Scalable to high number of cores
  - Per-cpu memory buffers
  - Scalability and performance-aware synchronization

*Compact*
- Environments without filesystem
  - Buffered in target RAM to a host for analysis
- Ability to tune the size of buffers and transmission medium to minimize the
  impact on the traced system.
- Streaming (live monitoring)
  - Through sockets (USB, network)
  - Through serial ports
  - There must be a related protocol for streaming this event data.

- Availability of flight recorder (synonym: overwrite) mode
  - Exclusive ownership of reader data.
  - Buffer size should be per group of events.

- Output trace to disk
- Trace buffers available in crash dump to allow post-mortem analysis
- Fine-grained timestamps

- Lockless (lock-free, ideally wait-free; aka starvation-free)

- Buffer introspection: event written, read and lost counts.

- Ability to iteratively narrow the level of details and traced time window
  following an initial high level "state" overview provided by an initial trace
  collecting everything.

- Support kernel module instrumentation

- Standard way(s) for a host to upload/access trace log data from a
  target/JTAG device/simulator/etc.

- Conditional tracing in kernel space.

- Compatibility with power management subsystem (trace collection shall not be a
  reason for waking up a device)

- Well defined and stable trace configuration and control API across kernel
  versions.

- Create and run more than one trace session in parallel at the same time
  - monitoring from system administrators
  - field engineered to troubleshoot a specific problem

* Trace Analyzer Requirements

- Ability to cope with huge traces (> 10 GB)
- Should be possible to do a binary search on the file to find events by time
  at least. (combined with smart indexing/ summary data perhaps)
- File format should be as dense as possible, but not at the expense of
  analysis performance (faster is more important than bigger since disks are
  getting cheaper)
- Must not be required to scan through all events in order to start
  analyzing (by time anyway)
- Support live viewing of trace streams
- Standard description of a trace event context.
  (PERI-XML calls it "Dimensions")
- Manage system-wide event scoping with the following hierarchy:
  (address space identifier, section name, event name)

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-trace-users" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html