[Summit topic] Crazy (and not so crazy) ideas

Johannes Schindelin <Johannes.Schindelin@xxxxxx> · Thu, 21 Oct 2021 13:55:21 +0200 (CEST)

This session was led by Elijah Newren. Supporting cast: Johannes "Dscho"
Schindelin, Jonathan Tan, Jonathan "jrnieder" Nieder, brian m. carlson,
Jeff "Peff" King, Ævar Arnfjörð Bjarmason, Emily Shaffer, CB Bailey,
Taylor Blau, and Philip Oakley.

Notes:

* sent my idea for rebase merges on-list

* Test suite is slow. Shell scripts and process forking.

   * What if we had a special shell that interpreted the commands in a
     single process?

   * Even Git commands like rev-parse and hash-object, as long as that’s
     not the command you’re trying to test

   * Dscho wants to slip in a C-based solution

   * Jonathan tan commented: going back to your custom shell for tests
     idea, one thing we could do is have a custom command that generates
     the repo commits that we want (and that saves process spawns and
     might make the tests simpler too)

      * We could replace several “setup repo” steps with “git fast-import”
        instead.

   * Dscho measured: 0.5 sec - 30 sec in setup steps. Can use fast-import,
     or can make a new format that helps us set up the test scenario

   * Elijah: test-lib-functions helpers could be built ins

* Biggest idea: there are a lot of people who version control things via
  tarballs or .zip files per version. This prevents history from
  compressing well. Some people check in those compressed files into Git
  for purposes of history.

   * In particular, .jar files or npm packages. Initial testing showed
     that you can expand .jar files in a way that creates source-like
     files.

   * Jonathan Nieder points out that “pristine-tar” exists to do similar
     ideas: https://joeyh.name/code/pristine-tar/

   * Others use “git archive” for this purpose to mixed success.

   * jars and npm packages compress better if you store them in expanded
     form instead of compressed form

   * So many tools are used to using the end-archive, so while it’s
     tempting to have the build system be responsible for this, being able
     to “git add” the archive and have the right thing happen behind the
     scenes would be nice for ease of use

   * Goal here isn’t bit-for-bit reproducibility, just semantic
     reproducibility

   * What about other file formats that use zips, such as LibreOffice?

   * Git Merge 2018: Designers Git-It; A unified design system workflow
     did something similar, except made the tool understand the “exploded”
     file view.

   * Jonathan Tan mentions that smudge/clean filters can help, except this
     is about tree<->blob instead of blob<->blob

   * brian m. carlson mentions “git archive” output isn’t stable across
     Git versions. Should we have a canonical tar format that provides
     reproducibility?

   * Peff: tree<->blob filters can get confusing in the
     tree<->index<->worktree mapping. Possible, but requires careful
     thought about the details about when each spot

   * Old suggestion of a “blob-tree” type that allows storing a single
     index entry that corresponds to multiple trees and blobs in the
     background, possibly.

   * One long-term dream (inspired by Avery Pennarun’s “bup” tool) is to
     store large binary files in a tree-structured way that can store
     common regions as deltas, improve random access, parallelized
     hashing. Involves a consistent way to split the file into stable
     pieces, like --rsyncable uses (based on a rolling hash being zero).

   * Peff: you can do that at the object model layer or at the storage
     layer. The latter is less invasive.

   * jrnieder: The benefits of blobtree are greater at the object model
     layer --- e.g. not having to transmit chunks over the wire that you
     already have. I think the main obstacle has been that the benefits
     haven’t been enough to be worth the complexity. If that changes, we
     can imagine bundling it with some other object format changes, e.g.
     putting blob sizes in tree objects, and rolling it out as a new
     object-format.

   * Ævar: can we do this in a simpler manner, without deep technical
     changes? (Context: was thinking about this in the context of some
     $id$ questions.) Clean/smudge filters have some significant UX
     drawbacks. Has experience helping users trying to commit .jar files.
     Some simple advice saying “maybe you don’t want to commit this file
     type, here are some ways to expand it to a committable format…” based
     on patterns such as .gitignore or .gitattributes. We don’t have ways
     to indicate “this repo uses Git LFS, but you don’t have the plugin.”

* Emily: If I could rewrite the commit object format, I would change some
  things

   * Allow multiple authors

   * Add a layer of indirection to author name

      * brian has thought about this too: replace name with email address
        + some ssh key or something and use something mailmap-like to map
        it. Could be a backward-compatible approach

   * CB has been thinking about these problems in the background. Could
     randomly generate an identifier when you commit your first patch, an
     @example.com address to avoid conflicting with any real address.
     Mailmap can be a blob maintained by the project

      * In the process can get first-class multiple authors

      * If I have this id representing this particular pair of authors,
        can update what the id points to

      * Cool stuff but gets complicated

   * Just getting mailmap applied to trailers in “git log” would be huge

      * CB: main reason I don’t put myself in mailmap is that it’s not
        worth bothering without that feature

      * Ævar: “git log --author” would want the mapping, too. (and ‘git
        shortlog --group’) Do we do this only at the presentation layer or
        if we do it at a lower layer do we get such things for free?

      * If anyone’s interested, I might know where the dragons are hiding,
        happy to give advice

      * Peff: “git shortlog” already knows how to parse it out so this
        seems very possible

      * Taylor:
        https://lore.kernel.org/git/YW8A5FznqLYs7MqH@xxxxxxxxxxxxxxxxxxxxxxx/T/

   * Generation number was discussed ~2011(?)

   * Ævar: does this really need a format change? Two “author” fields
     would break things, but could have “author” and “x-author” header

   * General principle when changing formats: teasing apart where it’s
     possible to achieve what you want backward compatibility

* Philip Oakley would like a commit id referring to an unborn branch as a
  proper id

   * brian: empty tree works for what you’re talking about when you want a
     diff

   * Philip: motivating example was “first parent is going nowhere, but
     you have a second parent”

   * jrnieder: I see, you want the --first-parent history of your
     published branch to match the reflog. As a workaround, you’re able to
     use an empty initial commit and use --no-ff merges whenever you pull
     things in, but you’re referring to wishing you didn’t have to make
     that empty initial commit

   * Ævar: reminds me of the discussion in
     https://www.fossil-scm.org/home/doc/trunk/www/fossil-v-git.wiki of
     commit/branch relationships