Re: [PATCH 0/7] Generalize 'scalar diagnose' into 'git bugreport --diagnose'

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Wed, 03 Aug 2022 14:34:04 +0200

On Tue, Aug 02 2022, Victoria Dye wrote:

> Ævar Arnfjörð Bjarmason wrote:
>> 
>> On Mon, Aug 01 2022, Victoria Dye via GitGitGadget wrote:
>> 
>>> [...] I didn't see a major UX benefit of 'git diagnose' vs 'git
>>> bugreport --diagnose', so I went with the latter, simpler approach.
>> 
>> I really wanted to like this, but I find the end result here really
>> confusing from a UX perspective.
>> 
>> You can now run "git bugreport --diagnose", which creates a giant *.zip
>> file to go along with your *.txt, but your *.txt makes no reference to
>> it.
>> 
>> Should you ... attach it to your bug report to this mailing list, do
>> something else?
>> 
>> The documentation doesn't offer much in the way of hints, other than
>> suggesting (with --no-report) that this --diagnose is for something
>> entirely different (and that's how "scalar" uses it).
>> 
>> I know what it's really for after reading this series, but for "git
>> bugreport" in particular we should be really careful about not making
>> the UX confusing.
>> 
>> The generated *.zip contains some really deep info about your repo (and
>> not just metadata, e.g. copies of the index, various logs etc.), someone
>> e.g. in a proprietary setting really doesn't want to be sharing that
>> info.
>> 
>> So I would like to see real integration into "git bugreport", i.e. for
>> us to smartly report more repository metrics, e.g. approx number of
>> loose objects, the sort of state "__git_ps1" might report, etc.
>> 
>> But I think the end-state here makes things much more confusing for
>> users.
>> 
>
> The "confusing UX" you describe here doesn't seem to be an inherent issue
> with the implementation (nor is it as insurmountable as you're implying),

I'm not implying or saying that it's insurmountable.

I think in principle having such a mode in "git bugreport" would make
sense.

But the UX here seems to be an afterthought. So I wonder if we shouldn't
hold off on it and just have a new *--helper or something instead.

> it
> largely appears to be an issue of under-documentation. I'll improve that in
> V2 [1], but I want clarify what I was/am going for here as well.
>
> In the context of a bug report, the diagnostics are intended to be used as
> supplemental information to aid in debugging (i.e., attached with the report
> in the email to the list).

Per https://lore.kernel.org/git/?q=n%3Azip we don't block *.zip
attachments, but we have fairly low size limits (to the point of
blocking some large patches).

E.g. e448263716f (po/git.pot: this is now a generated file, 2022-05-26)
(https://lore.kernel.org/git/20220526145035.18958-7-worldhello.net@xxxxxxxxx/)
was blocked, and it's just over 500k.

Aside from the data-sharing issues that seems like a good addition to
git-bugreport, i.e. tell the user if the attachment would be blocked due
to its size...

> They're especially valuable when a bug reporter
> isn't very familiar with Git internals and they can't reproduce the issue. A
> lot of bugs can be investigated without those diagnostics, though, which is
> why '--diagnose' isn't "on" by default.
>
> There are also valid use-cases (beyond 'scalar diagnose') for '--no-report':
> someone requests more information after looking into an already-generated
> report, or a developer wants to investigate a bug on their own and use the
> diagnostics as a "starting point" to guide more in-depth debugging. 

Yes, it's useful in a lot of cases. I'm just saying that we really need
to bridge the gap of telling the user what they should be doing with
this new file....

> As for the proprietary data issue, I'd be open to having an option to
> configure which diagnostics a user wants (either something like '--diagnose
> <level>' or a separate option entirely). I'm pretty indifferent on the UI,
> though, so I'll defer to other contributors on 1) if they want that feature,
> and 2) what they think that should look like.

I think it's really important that a bug report feature doesn't submit
private data by default. The name of a "--diagnose" option heavily
implies some aggregate metrics etc.

We then attach what to the user are opaque binary files, but which
contain even the file contants of the repository (index files).

I think scalar was developer for "internal" use-cases where such sharing
wasn't an issue for anyone, but I don't want anyone to get in trouble
because they shared their proprietary source code on the ML while trying
to file a bug report, so defaults really matter.

Especially because if there's one thing we've learned from bug reports &
questions to this & git-users is that the one thing you can rely on is
that users routinely don't carefully scrutinize documentation before
trying out various features...

> [1] https://lore.kernel.org/git/f3235afe-25cc-21a4-fc35-56e35d6be0ce@xxxxxxxxxx/
>
>>> An alternative implementation considered was creating a new 'git diagnose'
>>> builtin, but the new command would end up duplicating much of
>>> 'builtin/bugreport.c'.
>> 
>> It seems we always "return" from cmd_bugreport() quite quickly, and we
>> basically only share the code to create the output directory. Just
>> duplicating or sharing that seems like a much better approach for now
>> than creating the above UX confusion.
>> 
>> Note that you can also share code between multiple built-ins, even in
>> the same file (see e.g. builtin/{checkout,log}.c). So we could even
>> share something like the safe_create_leading_directories() calling code
>> in bugreport.c without libifying it.
>> 
>
> You deleted the part where I addressed this suggestion directly:

Yes, the "Although that..." sentence, but I commented on the UX
trade-off elsewhere.

>> Although that issue could be overcome with refactoring, I didn't see a
>> major UX benefit of 'git diagnose' vs 'git bugreport --diagnose', so I
>> went with the latter, simpler approach.
>
> And, in the process of writing down my thoughts on the UX above, I've become
> more convinced that including '--diagnose' in 'git bugreport' is the better
> way to present this functionality to users.

We're on the same page there, we're just discussing if/how to make the
end-to-end process clearer to users.

I don't think it's clear in v1, and is sorely missing something like the
discussion we have around "ANONYMIZING" in git-fast-export(1), and we
really should have a "safe by default". Everything I'm noting here would
be addressed by e.g.:

 * The *.txt output would say "you can additionally attach a diagnostic
   *.zip" file etc. etc., noting if/when the user would do that.

 * We'd have that --diagnostics be safe by default in the way
   --anonymize is, e.g. including stats about number of refs & the like,
   not their contents.

 * We could also have e.g. "--diagnostics --no-anonymize", or
   "--diagnostics=full" or whatever, which would be some approximation
   of the current output.

>>> Finally, despite 'scalar diagnose' now being nothing more than a wrapper for
>>> 'git bugreport --diagnose', it is not being deprecated in this series.
>>> Although deprecation -> removal could be a future cleanup effort, 'scalar
>>> diagnose' is kept around for now as an alias for users already accustomed to
>>> using it in 'scalar'.
>> 
>> We don't have a "make install" to get a "scalar" onto user's systems
>> yet, do we really need to worry about those users?
>> 
>> Or is this a reference to the out-of-tree version of "scalar", not
>> git.git's?
>> 
>
> In practice, it's the "out-of-tree Scalar" users that would care the most.
> That said, with Scalar in the Git tree (albeit 'contrib/' and not built by
> default), I think it's reasonable to want to avoid breaking changes if
> possible. The continued existence of 'scalar diagnose' wouldn't really be
> hurting anyone anyway, so there's no pressing need to deprecate it now.

I'm fine with keeping it, but just found the juxtaposition of doing that
& previous discussions about scalar to be surprising.

When it was submitted (from my recollection, I see
Documentation/technical/scalar.txt is rather light on the topic) the
idea was (and this is from my own recollection, I haven't dug up ML
references) that scalar would be "going away", and that the reason to
have it in-tree was to inform the design of the main "git" tooling for
the use-cases scalar is addressing.

So, a bit similar to Cogito (although that was never in-tree).

I'm fine with just keeping it as an alias for "git bugreport
--whatever", but think it would make even more sense to change the
interface a bit as we integrate it to the rest of git, and then "die()"
saying "'scalar xyz' is now 'git abc', mostly....".

Existing scalar users used to 'scalar xyz' would then try the new 'git
abc', and report if/how that fits their use-case.

Then at some point if we do that with all of scalar's features we'd be
at the point of 'git rm'-ing it, as it would have served its stated
purpose.

Just my 0.02, I think in any case having the scalar.txt doc's "Roadmap"
updated to address this would be very useful.