Re: [PATCH] grep: Add option --max-line-len

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



[second try, now with text format]

Thanks a lot for the reviews. Replying to both.

If I send a follow up, I'll fix the commit description and the help
string, remove the shorthand -M, write a more sensible test.


2017-11-23 14:24 GMT-05:00 Eric Sunshine <sunshine@xxxxxxxxxxxxxx>:
>> diff --git a/Documentation/git-grep.txt b/Documentation/git-grep.txt
>> @@ -127,6 +128,10 @@ OPTIONS
>> +-M<num>::
>> +--max-line-len<num>::
>> +       Match the pattern only for line shorter or equal to this length.
>
> This documentation doesn't explain what it means by a line's length.
> This implementation seems to take into consideration only the line's
> byte count, however, given that displayed lines are normally
> tab-expanded, people might intuitively expect that expansion to count
> toward the length. A similar question arises for Unicode characters.
>
> Should this option take tab-expansion and Unicode into account?
> Regardless of the answer to that question, the documentation should
> make it clear what "line length" means.

Excellent question. I don't have an immediate answer.


>> diff --git a/grep.c b/grep.c
>> @@ -1273,6 +1275,8 @@ static int match_line(struct grep_opt *opt, char *bol, char *eol,
>> +       if (opt->max_line_length > 0 && eol-bol > opt->max_line_length)
>> +               return 0;
>
> If the user specifies "-M0", should that error out or at least warn
> the user that the value is non-sensical? What about -1, etc.? (These
> are UX-related questions; the implementation obviously doesn't care
> one way or the other.)

Precedent with -A is to ignore the negative value. I don't have a
strong opinion.


2017-11-23 20:44 GMT-05:00 Junio C Hamano <gitster@xxxxxxxxx>:
>
> Marc-Antoine Ruel <maruel@xxxxxxxxxxxx> writes:
>
> > This tells git grep to skip files longer than a specified length,
> > which is often the result of generators and not actual source files.
> >
> > ...
> > +-M<num>::
> > +--max-line-len<num>::
> > +     Match the pattern only for line shorter or equal to this length.
> > +
>
> All the excellent review comments from Eric I agree with.
>
> With the name of the option and the above end-user facing
> description, it is very clear that the only thing this feature does
> is to declare that an overlong line does _not_ match when trying to
> check against any pattern.
>
> That is a much clearer definition and description than random new
> features people propose here (and kicked back by reviewers, telling
> them to make the specification clearer), and I'd commend you for that.
>
> But it still leaves at least one thing unclear.  How should it
> interact with "-v"?  If we consider an overlong line never matches,
> would "git grep -v <pattern>" should include the line in its output?

Ah! No idea. :/

> Speaking of the output, it also makes me wonder if the feature
> really wants to include an overlong line as a context line when
> showing a near-by line that matches the pattern when -A/-B/-C/-W
> options are in use. Even though it is clear that it does from the
> above description, is it really the best thing the feature can do to
> help the end users?
>
> Which leads me to suspect that this "feature" might not be the ideal
> you wanted to achive, but is an approximate substitution that you
> found is "good enough" to simulate what the real thing you wanted to
> do, especially when I go back and read the justfication in the
> proposed log message that talks about "result of generators".
>
> Isn't it a property of the entire file, not individual lines, if you
> find it uninteresting to see reported by "git grep"?  I cannot shake
> the suspicion that this feature happened to have ended up in this
> shape, instead of "ignore a file with a line this long", only
> because your starting point was to use "has overlong lines" as the
> heuristic for "not interesting", and because "git grep" code is not
> structured to first scan the entire file to decide if it is worth
> working on it, and it is extra work to restructure the codeflow to
> make it so (which you avoided).
>
> If your real motivation was either
>
>  (1) whether the file has or does not have the pattern for certain
>      class of files are uninteresting; do not even run "grep"
>      processing for them; or
>
>  (2) hits or no-hits may be intereseting but output of overlong
>      lines from certain class of files I do not wish to see;
>
> then I can think of two alternatives.
>
> For (1), can't we tell "result of generators" and other files with
> pathspec?  If so, perhaps a negative pathspec can rescue.  e.g.
>
>     git grep <pattern> -- '*.cc' ':!*-autogen.cc'
>
> For (2), can't we model this after how users can tell "git diff"
> that certain paths are not worth computing and showing textual
> patches for, which is to Unset the 'diff' attribute?  When you have
>
>     *-autogen.cc        -diff
>
> in your .gitattributes, "git diff" would say "Binary files A and B
> differ" instead of explaining line-by-line differences in the patch
> form.  Perhaps we can also have a 'grep' attribute and squelch the
> output if it is Unset?
>
> It is debatable but one could propose extending the use of existing
> 'diff' attribute to cover 'grep' too, with an argument that anything
> not worth showing patch (i.e. 'diff' attribute is Unset) is not
> worth showing grep hits from.

Thanks for the thoughtful analysis. My main motivation was (1), thus
filtering with a pathspec is much better than trying to work around
the issue. The issues raised in the review are significant enough that
committing this patch could cause significant issues; I don't know how
to resolve handling with -v and how to handle tabs.

After further thinking, what I'd like is a smarter version of the
git-gs shortcut wrapper that limits the search space on well known
extensions but I'd like it to also limit itself to "source-like"
files, similar in some ways to the -I flag. So in some ways this could
be better served as a git config but I'm not even sure what kind of
heuristic would be generic enough to be valuable to a large number of
users.

As such I'll drop the patch as I don't see a clear path forward with
the current one.

Thanks,

M-A



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux