On Fri, May 21 2021, Emily Shaffer wrote: > On Fri, May 21, 2021 at 03:15:16PM -0400, Jeff Hostetler wrote: >> On 5/20/21 5:05 PM, Emily Shaffer wrote: >> > - I took a look at Jeff H's advice on using a "data_json" event to log >> > this and decided it would be a little more flexible to add a new event >> > instead. If we want, it'd be feasible to then shoehorn the GfW parent >> > tree stuff into this new event too. Doing it this way is definitely >> > easier to parse for Google's trace analysis system (which for now >> > completely skips "data_json" as it's polymorphic), and also - I think >> > - means that we can add more fields later on if we need to (thread >> > info, different fields than just /proc/n/comm like exec path, argv, >> > whatever). >> >> I could argue both sides of this, so I guess it is fine either way. >> >> In GFW I log a array of argv[0] strings in a generic "data_json" event. >> I could also log additional "data_json" events with more structured >> data if needed. >> >> On the other hand, you're proposing a "cmd_ancestry" event with a >> single array of strings. You would have to expand the call signature >> of the trace2_cmd_ancestry() API to add additional data and inside >> tr2_tgt_event.c add additional fields to the JSON being composed. >> >> So both are about equal. >> >> (I'll avoid the temptation to make a snarky comment about fixing >> your post processing. :-) :-) :-) ) > > ;P > > (I don't have much to add - this is an accurate summary of what I > thought about, too. Thanks for writing it out.) > >> >> It really doesn't matter one way or the other. >> >> > - Jonathan N also pointed out to me that /proc/n/comm exists, and logs >> > the "command name" - excluding argv, excluding path, etc. It seems >> >> So you're trying to log argv[0] of the process and not the full >> command line. That's what I'm doing. > > It's close to argv[0], yeah. POSIX docs indicate it might be truncated > in a way that argv[0] hasn't been, but it also doesn't include the > leading path (as far as I've seen). For example, a long-running helper > script I use with mutt, right now (gaffing on line length in email to > help with argv clarity, sorry): > > $ ps aux | grep mutt > emilysh+ 4119883 0.0 0.0 6892 3600 pts/6 S+ 12:44 0:00 /bin/bash > /usr/local/google/home/emilyshaffer/dotfiles/open-vim-in-new-split.sh > /var/tmp/mutt-podkayne-413244-1263002-7433772284891386689 > # comm is truncated to 15ch, except apparently in the cases of some > # kernel worker processes I saw with much longer names? > $ cat /proc/4119883/comm > open-vim-in-new > # exe is a link to the executable, which means bash as this is a > # script > $ ls -lha /proc/4119883/exe > lrwxrwxrwx 1 emilyshaffer primarygroup 0 May 21 12:44 > /proc/4119883/exe -> /usr/bin/bash > # cmdline has the whole argv, separated on NUL so it runs together in > # editor > $ cat /proc/4119883/cmdline > /bin/bash/usr/local/google/home/emilyshaffer/dotfiles/open-vim-in-new-split.sh/var/tmp/mutt-podkayne-413244-1263002-7433772284891386689 > > Jonathan N pointed out that the process name (the thing in 'comm') can > also be manually manipulated by the process itself, and 'man procfs' > also talks about 'PR_SET_NAME' and 'PR_GET_NAME' operations in > 'prctl()', so that tracks. (It doesn't look like we can use prctl() to > find out the names of processes besides the current process, though, so > the procfs stuff is still needed. Dang.) > >> >> > like this is a little more safe about excluding personal information >> > from the traces which take the form of "myscript.sh >> > --password=hunter2", but would still be worrisome for something like >> > "mysupersecretproject.sh". I'm not sure whether that means we still >> > want to guard it with a config flag, though. >> >> You might check whether you get the name of the script or just get >> a lot of entries with just "/usr/bin/bash". > > See above :) > >> There's lots of PII in the data stream to worry about. >> The name of the command is just one aspect, but I digress. > > Yes, that's what we've noticed too, so a process name isn't worrying us > that much more. > >> >> > - I also added a lot to the commit message; hopefully it's not too >> > rambly, but I hoped to explain why just setting GIT_TRACE2_PARENT_SID >> > wasn't going to cut it. >> > - As for testing, I followed the lead of GfW's parentage info - "this >> > isn't portable so writing tests for it will suck, just scrub it from >> > the tests". Maybe it makes sense to do some more >> > platform-specific-ness in the test suite instead? I wasn't sure. >> >> yeah, that's probably best. Unless you can tokenize it properly >> so that you can predict the results in a HEREDOC in the test source. >> >> For example, you might try to test tracing a command (where a top-level >> "git foo" (SPACE form) spawns a "git-foo" (DASHED form) and check the >> output for the child. > > Yeah, I had trouble with even deciding when to attempt such a check or > not. >> > + if (reason == TRACE2_PROCESS_INFO_STARTUP) >> > + { >> > + /* >> > + * NEEDSWORK: we could do the entire ptree in an array instead, >> > + * see compat/win32/trace2_win32_process_info.c. >> > + */ >> > + char *names[2]; >> > + names[0] = get_process_name(getppid()); >> > + names[1] = NULL; >> >> You're only logging 1 parent. That's fine to get started. >> >> I'm logging IIRC 10 parents on GFW. That might seem overkill, >> but there are lots of intermediate parents that hide what is >> happening. For example, a "git push" might spawn "git remote-https" >> which spawns "git-remote-https" which spawn "git send-pack" which >> spawns "git pack-objects". >> >> And that doesn't include who called push. >> >> And it's not uncommon to see 2 or 3 "bash" entries in the array >> because of the bash scripts being run. > > Agree. But it's expensive - I didn't find a handy library call to find > "parent ID of given process ID", so I think we'd have to manipulate > procfs; and so far I only see parent ID in summary infos like > /proc/n/status or /proc/n/stat, which contain lots of other info too and > would need parsing. It sounds a bit like you're fumbling your way towards (re?)discovering: pstree -s <pid> You can look at its implementation (or strace it) to see what it does, and yes, on Linux there's no handy C library for this, iterating over procfs is the library. Aside from the privacy, PII, usefulness of this data etc. discussions in this & related threads I don't think that per-se should be an issue on a modern Linux system. After all we'd just need to do it once on startup. For any sub-process we spawn we'd carry it forward after the initial /usr/bin/git invocation.