Re: [RFC PATCH] grep: provide sane default to grep_source struct

Jonathan Nieder <jrnieder@xxxxxxxxx> · Wed, 15 May 2019 20:11:52 -0700

Hi,

Emily Shaffer wrote:

> grep_buffer creates a struct grep_source gs and calls grep_source()
> with it. However, gs.name is null, which eventually produces a
> segmentation fault in
> grep_source()->grep_source_1()->show_line() when grep_opt.status_only is
> not set.

Thanks for catching it.  Taking a step back, I think the problem is in
the definition of "struct grep_source":

	struct grep_source {
		char *name;

		enum grep_source_type {
			GREP_SOURCE_OID,
			GREP_SOURCE_FILE,
			GREP_SOURCE_BUF,
		} type;
		void *identifier;

		...
	};

What is the difference between a 'name' and an 'identifier'?  Who is
responsible for free()ing them?  Can they be NULL?  This is pretty
underdocumented for a public type.

If we take the point of view that 'name' should always be non-NULL,
then I wonder:

- can we document that?
- can grep_source_init enforce that?
- can we take advantage of that in grep_source as well, as a sanity
  check that the grep_source has been initialized?
- while we're here, can we describe what the field is used for
  (prefixing output with context before a ":", I believe)?

[...]
> This seems to be unreachable from existing commands but is reachable in
> the event that someone rolls their own revision walk and neglects to set
> rev_info->grep_filter->status_only.

init_revisions sets that field, so a caller would have to replace
grep_filter or unset status_only to trigger this.

>                                     Conceivably, someone might want to
> print all changed lines of all commits which match some regex.

Hm, does that work well?  The caller of grep_buffer in revision.c
doesn't seem to be careful about where in the output hits would be
printed, but I might be missing something.

[...]
> I ran into this issue while I was working on a tutorial on rolling your
> own revision walk. I didn't want to print all the lines associated with
> a matching change, but it didn't seem good that a seemingly-sane
> grep_filter config was segfaulting.

Good find, and thanks for taking the time to make it easier to debug for
the next person.

[...]
> Jonathan Nieder proposed alternatively adding some check to grep_source()
> to ensure that if opt->status_only is unset, gs->name must be non-NULL
> (and yell about it if not), as well as some extra comments indicating
> what assumptions are made about the data coming into functions like
> grep_source(). I'm fine with that as well (although I'm not sure it
> makes sense semantically to require a name which the user probably can't
> easily set, or else ban the user from printing LOC during grep). Mostly
> I'm happy with any solution besides a segfault with no error logging :)

Let's compare the two possibilities.

The advantage of "(in memory)" is that it Just Works, which should
make a nicer development experience with getting a new caller mostly
working on the way to getting them working just the way you want.

The disadvantage is that if we start outputting that in production, we
and static analyzers are less likely to notice.  In other words,
printing "(in memory)" is leaking details to the end user that do not
match what the end user asked for.  NULL would instead produce a
crash, prompting the author of the caller to fix it.

What was particularly pernicious about this example is that the NULL
dereference only occurs if the grep has a match.  So I suppose I'm
leaning toward (in addition to adding some comments to document the
struct) adding a check like

	if (!gs->name && !opt->status_only)
		BUG("grep calls that could print name require name");

to grep_source.

That would also sidestep the question of whether this debugging aid
should be translated. :)

Sensible?

Thanks,
Jonathan