Re: [PATCH v2] tr2: log parent process name

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Tue, 25 May 2021 01:33:27 +0200

On Fri, May 21 2021, Emily Shaffer wrote:

> On Fri, May 21, 2021 at 03:15:16PM -0400, Jeff Hostetler wrote:
>> On 5/20/21 5:05 PM, Emily Shaffer wrote:
>> > - I took a look at Jeff H's advice on using a "data_json" event to log
>> >    this and decided it would be a little more flexible to add a new event
>> >    instead. If we want, it'd be feasible to then shoehorn the GfW parent
>> >    tree stuff into this new event too. Doing it this way is definitely
>> >    easier to parse for Google's trace analysis system (which for now
>> >    completely skips "data_json" as it's polymorphic), and also - I think
>> >    - means that we can add more fields later on if we need to (thread
>> >    info, different fields than just /proc/n/comm like exec path, argv,
>> >    whatever).
>> 
>> I could argue both sides of this, so I guess it is fine either way.
>> 
>> In GFW I log a array of argv[0] strings in a generic "data_json" event.
>> I could also log additional "data_json" events with more structured
>> data if needed.
>> 
>> On the other hand, you're proposing a "cmd_ancestry" event with a
>> single array of strings.  You would have to expand the call signature
>> of the trace2_cmd_ancestry() API to add additional data and inside
>> tr2_tgt_event.c add additional fields to the JSON being composed.
>> 
>> So both are about equal.
>> 
>> (I'll avoid the temptation to make a snarky comment about fixing
>> your post processing. :-) :-) :-) )
>
> ;P
>
> (I don't have much to add - this is an accurate summary of what I
> thought about, too. Thanks for writing it out.)
>
>> 
>> It really doesn't matter one way or the other.
>> 
>> > - Jonathan N also pointed out to me that /proc/n/comm exists, and logs
>> >    the "command name" - excluding argv, excluding path, etc. It seems
>> 
>> So you're trying to log argv[0] of the process and not the full
>> command line.  That's what I'm doing.
>
> It's close to argv[0], yeah. POSIX docs indicate it might be truncated
> in a way that argv[0] hasn't been, but it also doesn't include the
> leading path (as far as I've seen). For example, a long-running helper
> script I use with mutt, right now (gaffing on line length in email to
> help with argv clarity, sorry):
>
>   $ ps aux | grep mutt
>   emilysh+ 4119883 0.0 0.0 6892 3600 pts/6 S+ 12:44 0:00 /bin/bash
> /usr/local/google/home/emilyshaffer/dotfiles/open-vim-in-new-split.sh
> /var/tmp/mutt-podkayne-413244-1263002-7433772284891386689
>   # comm is truncated to 15ch, except apparently in the cases of some
>   # kernel worker processes I saw with much longer names?
>   $ cat /proc/4119883/comm
>   open-vim-in-new
>   # exe is a link to the executable, which means bash as this is a
>   # script
>   $ ls -lha /proc/4119883/exe
>   lrwxrwxrwx 1 emilyshaffer primarygroup 0 May 21 12:44
>   /proc/4119883/exe -> /usr/bin/bash
>   # cmdline has the whole argv, separated on NUL so it runs together in
>   # editor
>   $ cat /proc/4119883/cmdline
>   /bin/bash/usr/local/google/home/emilyshaffer/dotfiles/open-vim-in-new-split.sh/var/tmp/mutt-podkayne-413244-1263002-7433772284891386689
>
> Jonathan N pointed out that the process name (the thing in 'comm') can
> also be manually manipulated by the process itself, and 'man procfs'
> also talks about 'PR_SET_NAME' and 'PR_GET_NAME' operations in
> 'prctl()', so that tracks. (It doesn't look like we can use prctl() to
> find out the names of processes besides the current process, though, so
> the procfs stuff is still needed. Dang.)
>
>> 
>> >    like this is a little more safe about excluding personal information
>> >    from the traces which take the form of "myscript.sh
>> >    --password=hunter2", but would still be worrisome for something like
>> >    "mysupersecretproject.sh". I'm not sure whether that means we still
>> >    want to guard it with a config flag, though.
>> 
>> You might check whether you get the name of the script or just get
>> a lot of entries with just "/usr/bin/bash".
>
> See above :)
>
>> There's lots of PII in the data stream to worry about.
>> The name of the command is just one aspect, but I digress.
>
> Yes, that's what we've noticed too, so a process name isn't worrying us
> that much more.
>
>> 
>> > - I also added a lot to the commit message; hopefully it's not too
>> >    rambly, but I hoped to explain why just setting GIT_TRACE2_PARENT_SID
>> >    wasn't going to cut it.
>> > - As for testing, I followed the lead of GfW's parentage info - "this
>> >    isn't portable so writing tests for it will suck, just scrub it from
>> >    the tests". Maybe it makes sense to do some more
>> >    platform-specific-ness in the test suite instead? I wasn't sure.
>> 
>> yeah, that's probably best.  Unless you can tokenize it properly
>> so that you can predict the results in a HEREDOC in the test source.
>> 
>> For example, you might try to test tracing a command (where a top-level
>> "git foo" (SPACE form) spawns a "git-foo" (DASHED form) and check the
>> output for the child.
>
> Yeah, I had trouble with even deciding when to attempt such a check or
> not.
>> > +	if (reason == TRACE2_PROCESS_INFO_STARTUP)
>> > +	{
>> > +		/*
>> > +		 * NEEDSWORK: we could do the entire ptree in an array instead,
>> > +		 * see compat/win32/trace2_win32_process_info.c.
>> > +		 */
>> > +		char *names[2];
>> > +		names[0] = get_process_name(getppid());
>> > +		names[1] = NULL;
>> 
>> You're only logging 1 parent.  That's fine to get started.
>> 
>> I'm logging IIRC 10 parents on GFW.  That might seem overkill,
>> but there are lots of intermediate parents that hide what is
>> happening.  For example, a "git push" might spawn "git remote-https"
>> which spawns "git-remote-https" which spawn "git send-pack" which
>> spawns "git pack-objects".
>> 
>> And that doesn't include who called push.
>> 
>> And it's not uncommon to see 2 or 3 "bash" entries in the array
>> because of the bash scripts being run.
>
> Agree. But it's expensive - I didn't find a handy library call to find
> "parent ID of given process ID", so I think we'd have to manipulate
> procfs; and so far I only see parent ID in summary infos like
> /proc/n/status or /proc/n/stat, which contain lots of other info too and
> would need parsing.

It sounds a bit like you're fumbling your way towards (re?)discovering:

    pstree -s <pid>

You can look at its implementation (or strace it) to see what it does,
and yes, on Linux there's no handy C library for this, iterating over
procfs is the library.

Aside from the privacy, PII, usefulness of this data etc. discussions in
this & related threads I don't think that per-se should be an issue on a
modern Linux system. After all we'd just need to do it once on
startup. For any sub-process we spawn we'd carry it forward after the
initial /usr/bin/git invocation.