From: Jeff Hostetler <jeffhost@xxxxxxxxxxxxx> Create design documentation to describe the telemetry feature. Signed-off-by: Jeff Hostetler <jeffhost@xxxxxxxxxxxxx> --- Documentation/technical/telemetry.txt | 475 ++++++++++++++++++++++++++++++++++ 1 file changed, 475 insertions(+) create mode 100644 Documentation/technical/telemetry.txt diff --git a/Documentation/technical/telemetry.txt b/Documentation/technical/telemetry.txt new file mode 100644 index 0000000..0a708ad --- /dev/null +++ b/Documentation/technical/telemetry.txt @@ -0,0 +1,475 @@ +Telemetry Design Notes +====================== + +The telemetry feature allows Git to generate structured telemetry data +for executed commands. Data includes command line arguments, execution +times, error codes and messages, and information about child processes. + +Structued data is produced in a JSON-like format. (See the UTF-8 related +"limitations" described in json-writer.h) + +Telemetry data can be written to a local file or sent to a dynamically +loaded shared library via a plugin API. + +The telemetry feature is similar to the existing trace API (defined in +Documentation/technical/api-trace.txt). Telemetry events are generated +thoughout the life of a Git command just like trace messages. But where +as trace messages are essentially developer debug messages, telemetry +events are intended for logging and automated analysis. + +The goal of the telemetry feature is to be able to gather usage data across +a group of production users to identify real-world performance problems in +production. Additionally, it might help identify common user errors and +guide future user training. + +By default, telemetry is disabled. Telemetry is controlled using config +settings (see "telemetry.*" in Documentation/config.txt). + + +Telemetry Events +================ + +Telemetry data is generated as a series of events. Each event is written +as a self-describing JSON object. + +Events: cmd_start and cmd_exit +------------------------------ + +The `cmd_start` event is emitted the very beginning of the git.exe process +in cmd_main() and `cmd_exit` event is emitted at the end of the process in +the atexit cleanup routine. + +For example, running "git version" produces: + +{ + "event_name": "cmd_start", + "argv": [ + "C:\\work\\gfw\\git.exe", + "version" + ], + "clock": 1525978509976086000, + "pid": 25460, + "git_version": "2.17.0.windows.1", + "telemetry_version": "1", + "session_id": "1525978509976086000-25460" +} +{ + "event_name": "cmd_exit", + "argv": [ + "C:\\work\\gfw\\git.exe", + "version" + ], + "clock": 1525978509980903391, + "pid": 25460, + "git_version": "2.17.0.windows.1", + "telemetry_version": "1", + "session_id": "1525978509976086000-25460", + "is_interactive": false, + "exit_code": 0, + "elapsed_time_core": 0.004814, + "elapsed_time_total": 0.004817, + "builtin": { + "name": "version" + } +} + +Fields common to all events: + * `event_name` is the name of the event. + * `argv` is the array of command line arguments. + * `clock` is the time of the event in nanoseconds since the epoch. + * `pid` is the process id. + * `git_version` is the git version string. + * `telemetry_version` is the version of the telemetry format. + * `session_id` is described in a later section. + +Additional fields in cmd_exit: + * `is_interactive` is true if git.exe spawned an interactive child process, + such as a pager, editor, prompt, or gui tool. + * `exit_code` is the value passed to exit() from main(). + * `error_message` (not shown) is the array of error messages. + * `elapsed-core-time` measures the time in seconds until exit() was called. + * `elapsed-total-time` measures the time until the atexit() routine starts + (which will include time spend in other atexit() routines cleaning up + child processes and etc.). + * `alias` (not shown) the updated argv after alias expansion. + * `builtin.name` is the canonical command name (from the cmd_struct[] + table) of a builtin command. + * `builtin.mode` (not shown) is shown for some commands that have different + major modes and performance times. For example, checkout can switch + branches or repair a single file. + * `child_summary` (not shown) is described in a later section. + * `timers` (not shown) is described in a later section. + * `aux` (not shown) is described in a later section. + + +Events: child_start and child_exit +---------------------------------- + +The child-start event is emitted just before a child process is started. +It includes a unique child-id and the child's command line arguments. + +The child-exit event is emitted after a child process exits and has +been reaped. This event extends the start event with the child's exit +status and elapsed time. + +For example, during a "git fetch origin", git.exe runs gc in the background +and these events are emitted by the fetch process before and after the +child gc process: + +{ + "event_name": "child_start", + "argv": [ + "C:\\work\\gfw\\git.exe", + "fetch", + "origin" + ], + "clock": 1525979478738132887, + "pid": 18332, + "git_version": "2.17.0.windows.1", + "telemetry_version": "1", + "session_id": "1525979470792747000-18332", + "child_detail": { + "number": 3, + "class": "gc", + "argv": [ + "gc", + "--auto" + ] + } +} +{ + "event_name": "child_exit", + "argv": [ + "C:\\work\\gfw\\git.exe", + "fetch", + "origin" + ], + "clock": 1525979479024707085, + "pid": 18332, + "git_version": "2.17.0.windows.1", + "telemetry_version": "1", + "session_id": "1525979470792747000-18332", + "child_detail": { + "number": 3, + "class": "gc", + "argv": [ + "gc", + "--auto" + ], + "pid": 19608, + "exit_code": 0, + "elapsed_time": 0.286574 + } +} + +The common fields (`event_name` through `session_id`) are the same as +in the `cmd_start` and `cmd_exit` events and refer to the parent process. + +The `child_detail` structure describes the child process: + * `number` is a simple counter incremented for each child event. + * `class` is a rough characterization of the type of child process. Child + class is described in a later section. + * `argv` is the child's command line. + * `pid` is the child's process id. + * `exit_code` is the exit code of the child process. + * `elapsed_time` measures the time in seconds observed by the parent process + between the child_start and child_exit events. This will be greater + than the elapsed time that the child internally observes because of + process startup and shutdown overhead. For synchronous child processes, + this is the time that the parent spent waiting for the child. + + +Event: perf +----------- + +Perf events are a debugging aid to report on suspected hot spots in the +code and collect data from production users. This is intended to be a +generic message with context-specific data. New messages may be added +in the future as the need arises to help with debugging. + +Perf events are organized by category, much like the various GIT_TRACE_* +environment variables. The "telemetry.perf" config setting can be set to +true or to a string of the perf categories that should be enabled. + +Currently, the categories "index" and "status" are defined. Others may +be added later. + +For example, could be used to instrument read_index_from(): + +{ + "event_name": "perf", + "argv": [ + "C:\\work\\gfw\\git.exe", + "fetch", + "origin" + ], + "clock": 1525979478735438090, + "pid": 18332, + "git_version": "2.17.0.windows.1", + "telemetry_version": "1", + "session_id": "1525979470792747000-18332", + "category": "index", + "label": "read_index_from", + "elapsed_time": 0.001536, + "data": { + "path": ".git/index", + "cache_nr": 3311 + } +} + +The common fields (`event_name` through `session_id`) are the same as +in the `cmd_start` and `cmd_exit` events. + +All `perf` events also have: + * `category` is descriptive category and used like different GIT_TRACE_* + variables. + * `label` is the name of a function or region of interest. + * `elapsed_time` measures the time in seconds spent in the function or + region. + * `data` is an optional structure of context-specific (debug) data. + + +More Details for Event Fields +============================= + +Field: session_id +----------------- + +A session_id (SID) is a cheap, unique-enough string to associate all of +the events generated by a single process. They incorporate the inherited +SID from their parent process. + +SIDs should be considerd opaque data, but are constructed as: + + [<parent_sid>]/<start_time>-<pid> + +This scheme is used rather than a simple PID or {PPID, PID} because PIDs +are recycled by the OS (after sufficient time). Also, if telemetry data +is aggregated from multiple systems, PIDs are not sufficient. + +This also has the advantage of allowing telemetry analysis to associate +Git child processes with their Git parent process even if there are +intermediate shell processes. + +Note: we could use UUIDs or GUIDs for this, but that seemed overkill at +this point. It also required platform-specific code to generate which +muddied up the code. + + +Field: child_details.class +-------------------------- + +enum telemetry_class contains a set of classification values. These attempt +to roughly classify a child process from the point of view of the parent +process. + * unclass: unclassified + * unclass-async: unclassified asynchronous child (see sub-process.c) + * alias: an alias expansion using a child process + * hook: a hook process that may do anything (including prompting, scanning, + and network operations) and wildly affect command run times. + * pager: a pager (indicating an interactive command) + * editor: an editor (indicating an interactive command) + * prompt: a prompt or credential or askpass process (also interactive) + * network: a command that might do network operations + * convert: an attribute filter process such as LFS or CRLF + * tool: a tool, such as difftool or mergetool, that may be interactive + * gc: an auto gc process + +struct child_process has been extended to have a telemetry_class field. Some +callers of start_command() and/or run_command() have been updated to suggest +a classification when appropriate. For example, child processes created by +launch_editor() are marked with TELEMETRY_CLASS__EDITOR. + +The primary intent is to identify which child processes are likely to block +on the user or network. For example, "git commit" and "git commit -m <msg>" +will have different performance characteristics because the former has to +launch an editor and wait for the user to compose a message. The former will +have a child event which child_detail.class=editor and its exit event will +have child_summary.editor.count=1 and child_summary.editor.elapsed_time=<t>. +Analysis tools can choose to report average commit time for non-interactive +commands or subtract the editor elapsed time from the commit elapsed time. + +For example, fetch runs rev-list, ssh, index-pack, and maybe (auto) gc. The +ssh child is marked as TELEMETRY_CLASS__NETWORK and the gc child is marked +as TELEMETRY_CLASS__GC (since it is optional and possibly time consuming). +The others are left unclassified (TELEMETRY_CLASS__UNCLASS) since we don't +expect blocking operations. + + +Field: child_summary +-------------------- + +The `child_summary` structure within the `cmd_exit` event summarizes the +child processes created by the parent process. + +For example, "git fetch origin" spawns 4 child processes: + +{ + "event_name": "cmd_exit", + "argv": [ + "C:\\work\\gfw\\git.exe", + "fetch", + "origin" + ], + ... + "child_summary": { + "unclass": { + "count": 2, + "elapsed_time": 0.496387 + }, + "network": { + "count": 1, + "elapsed_time": 7.712466 + }, + "gc": { + "count": 1, + "elapsed_time": 0.286574 + } + }, + "exit_code": 0, + "elapsed_time_core": 8.232965, + "elapsed_time_total": 8.232968, + "builtin": { + "name": "fetch" + } +} + +Within each `child_summary[<class>]` is a count of the number of child +processes and the cummulative elapsed time. + +Analysis tools interested in a net-elapsed-time of the parent process may +want to subtract the elapsed time of the child processes. This approach is +mostly valid, since most child processes are run synchronously. However, +some processes are run asynchronously, such as the pager and processes in +the unclass-async pool, so care should be taken. + + +Field: timers +------------- + +A "telemetry timer" is a stopwatch-like timer with a counter. It can be +used to time a specific region of code, such as an expensive computation +within the body of a larger loop. It defines a generic way to collect +perf data without causing an telemetry perf event to be fired on each +iteration. Instead, a timer is registered with the telemetry layer and +the data will be included in a "timers" sub-section in the `cmd_exit` event. + +For example, a timer was added to do_read_index() and do_write_index() +to measure the time spent reading and writing the index. + +{ + "event_name": "cmd_exit", + "argv": [ + "C:\\work\\gfw\\git.exe", + "status" + ], + ... + "timers": { + "do_read_index": { + "count": 1, + "total": 0.000740, + "min": 0.000740, + "max": 0.000740, + "avg": 0.000740 + }, + "do_write_index": { + "count": 1, + "total": 0.004724, + "min": 0.004724, + "max": 0.004724, + "avg": 0.004724 + } + }, + "exit_code": 0, + "elapsed_time_core": 0.049865, + "elapsed_time_total": 0.049867, + "builtin": { + "name": "status" + } +} + +The `timers` structure contains a named member for each defined timer. +Within each individual timer, we have: + * `count` is the number of times it was started/stopped. + * `total` is the total time the timer was running. + * `min` is the shortest interval. + * `max` is the longest interval. + * `avg` is the average interval. + + +Field: aux +--------------- + +The `aux` structure within the `cmd_exit` event contains additional +information about the process. This is intended as a generic container for +various fields, such as important config settings or repo data shape that +may affect performance or help identify the repository for aggregation +purposes. + +{ + "event_name": "cmd_exit", + ... + "aux": { + "remote_origin_url": "git@xxxxxxxxxx:git/git.git", + "index_count": 3311, + "sparse_checkout_count": 3 + }, + ... +} + +Other fields (and even sub-structures) can be added to this container +as needed. + + +Telemetry Destination +===================== + +Telemetry events are sent to a "destination". This can be a file or a +plugin. Telemetry is disabled if a destination is not set. + +telemetry.path +-------------- + +If the config setting "telemetry.path" contains a pathname, telemetry +events will be appended to that file using the builtin destination +handler. (File rotation is beyond the scope of this design.) + +Events are written as a series of JSON records. When "telemetry.pretty" +is false, each event record will be written on one line. + +(All of the examples in this document were prepared with "telemetry.pretty" +set to true.) + +telemetry.plugin +---------------- + +If the config setting "telemetry.plugin" contains the pathname to a shared +library, the library will be dynamically loaded during start up and events +will be sent to it using the plugin API. + +This plugin model allows an organization to define a custom or private +telemetry solution while using a stock version of Git. + +For example, on Windows, it allows telemetry events to go directly to the +kernel via the plugin using the high performance Event Tracing for Windows +(ETW) facility. + +The contrib/telemetry-plugin-examples directory contains two example +plugins: + * A trivial log to stderr + * A trivial ETW writer + + +GDPR and Privacy +================ + +The telemetry feature can log possibly sensitive user information (such as +command line arguments, which may contain URLs, user names, and file names). + +The base telemetry feature can write telemetry data to a file on the system. + +The plugin facility can be used to publish the telemetry data to more general +destinations (such as ETW or the network). + +In both cases, it is up to the user or system administrator to decide what +is appropriate and sanitize the data accordingly before broadcasting it. -- 2.9.3