[PATCH v4 0/4] Warn about git-filter-branch usage and avoid it

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Here's a series warns about git-filter-branch usage and avoids it
ourselves.

Changes since v3
  * Incorporated Eric's detailed feedback on the git-filter-branch
    manpage, some notes:
      * s/filter-branch/git-&/ (and similar for ls-files)
      * Multiple sections removed (and existing sections had a
        number of sentences removed)
      * I ended up not linking to the original html, but just added
        a small "Side Note" in a sub-bullet to address how fixing the
	written-in-shell attribute of git-filter-branch would do less
	than proponents expect.
      * ...and lots of other miscellaneous wording fixes and cleanups
  * The full range-diff is below, but it's kinda hard to read due to
    line wrapping and such.

Elijah Newren (4):
  t6006: simplify and optimize empty message test
  t3427: accelerate this test by using fast-export and fast-import
  Recommend git-filter-repo instead of git-filter-branch
  t9902: use a non-deprecated command for testing

 Documentation/git-fast-export.txt   |   6 +-
 Documentation/git-filter-branch.txt | 272 +++++++++++++++++++++++++---
 Documentation/git-gc.txt            |  17 +-
 Documentation/git-rebase.txt        |   3 +-
 Documentation/git-replace.txt       |  10 +-
 Documentation/git-svn.txt           |  10 +-
 Documentation/githooks.txt          |  10 +-
 contrib/svn-fe/svn-fe.txt           |   4 +-
 git-filter-branch.sh                |  13 ++
 t/t3427-rebase-subtree.sh           |  24 ++-
 t/t6006-rev-list-format.sh          |   5 +-
 t/t9902-completion.sh               |  12 +-
 12 files changed, 309 insertions(+), 77 deletions(-)

Range-diff:
1:  7ddbeea2ca = 1:  7ddbeea2ca t6006: simplify and optimize empty message test
2:  e1e63189c1 = 2:  e1e63189c1 t3427: accelerate this test by using fast-export and fast-import
3:  59c7446927 ! 3:  ed6505584f Recommend git-filter-repo instead of git-filter-branch
    @@ Documentation/git-filter-branch.txt: SYNOPSIS
     +carefully read <<SAFETY>> (and <<PERFORMANCE>>) to learn about the land
     +mines of filter-branch, and then vigilantly avoid as many of the hazards
     +listed there as reasonably possible.
    -+
    -+https://public-inbox.org/git/CABPp-BEDOH-row-hxY4u_cP30ptqOpcCvPibwyZ2wBu142qUbA@xxxxxxxxxxxxxx/[detailing
    -+the land mines of filter-branch]
     +
      DESCRIPTION
      -----------
    @@ Documentation/git-filter-branch.txt: warned.
     +PERFORMANCE
     +-----------
     +
    -+The performance of filter-branch is glacially slow; its design makes it
    ++The performance of git-filter-branch is glacially slow; its design makes it
     +impossible for a backward-compatible implementation to ever be fast:
     +
     +* In editing files, git-filter-branch by design checks out each and
    @@ Documentation/git-filter-branch.txt: warned.
     +git-filter-branch will make you do 10\^10 modifications, despite only
     +having (at most) 5*10^5 unique blobs.
     +
    -+* If you try and cheat and try to make filter-branch only work on
    ++* If you try and cheat and try to make git-filter-branch only work on
     +files modified in a commit, then two things happen
     +
    -+  . you run into problems with deletions whenever the user is simply
    -+    trying to rename files (because attempting to delete files that
    -+    don't exist looks like a no-op; it takes some chicanery to remap
    -+    deletes across file renames when the renames happen via arbitrary
    -+    user-provided shell)
    ++  ** you run into problems with deletions whenever the user is simply
    ++     trying to rename files (because attempting to delete files that
    ++     don't exist looks like a no-op; it takes some chicanery to remap
    ++     deletes across file renames when the renames happen via arbitrary
    ++     user-provided shell)
     +
    -+  . even if you succeed at the map-deletes-for-renames chicanery, you
    -+    still technically violate backward compatibility because users are
    -+    allowed to filter files in ways that depend upon topology of commits
    -+    instead of filtering solely based on file contents or names (though
    -+    I have never seen any user ever do this).
    ++  ** even if you succeed at the map-deletes-for-renames chicanery, you
    ++     still technically violate backward compatibility because users are
    ++     allowed to filter files in ways that depend upon topology of
    ++     commits instead of filtering solely based on file contents or names
    ++     (though this has not been observed in the wild).
     +
     +* Even if you don't need to edit files but only want to e.g. rename or
     +remove some and thus can avoid checking out each file (i.e. you can use
     +--index-filter), you still are passing shell snippets for your filters.
     +This means that for every commit, you have to have a prepared git repo
    -+where users can run git commands.  That's a lot of setup.  It also means
    -+you have to fork at least one process to run the user-provided shell
    -+snippet, and odds are that the user's shell snippet invokes lots of
    -+commands in some long pipeline, so you will have lots and lots of forks.
    -+For every. single. commit.  That's a massive amount of overhead to
    -+rename a few files.
    -+
    -+* filter-branch is written in shell, which is kind of slow.  Naturally,
    -+it makes sense to want to rewrite that in some other language.  However,
    -+filter-branch documentation states that several additional shell
    -+functions are provided for users to call, e.g. 'map', 'skip_commit',
    -+'git_commit_non_empty_tree'.  If filter-branch itself isn't a shell
    -+script, then in order to make those shell functions available to the
    -+users' shell snippets you have to prepend the shell definitions of these
    -+functions to every one of the users' shell snippets and thus make these
    -+special shell functions be parsed with each and every commit.
    -+
    -+* filter-branch provides a --setup option which is a shell snippet that
    -+can be sourced to make shell functions and variables available to all
    -+other filters.  If filter-branch is a shell script, it can simply eval
    -+this shell snippet once at the beginning.  If you try to fix performance
    -+by making filter-branch not be a shell script, then you have to prepend
    -+the setup shell snippet to all other filters and parse it with every
    -+single commit.
    -+
    -+* filter-branch writes lots of files to $workdir/../map/ to keep a
    -+mapping of commits, which it uses pruning commits and remapping to
    -+ancestors and the map() command more generally.  Other files like
    -+$tempdir/backup-refs, $tempdir/raw-refs, $tempdir/heads,
    -+$tempdir/tree-state are all created internally too.  It is possible
    -+(though strongly discouraged) that users could have accessed any of
    -+these directly.  Users even had a pointer to follow in the form of
    -+Documentation that the 'map' command existed, which naturally uses the
    -+$workdir/../map/* files.  So, even if you don't have to edit files, for
    -+strict backward compatibility you need to still write a bunch of files
    -+to disk somewhere and keep them updated for every commit.  You can claim
    -+it was an implementation detail that users should not have depended
    -+upon, but the truth is they've had a decade where they could so.  So, if
    -+you want full compatibility, it has to be there.  Besides, the
    -+regression tests depend on at least one of these details, specifying an
    -+--index-filter that reaches down and grabs backup-refs from $tempdir,
    -+and thus provides resourceful users who do google searches an example
    -+that there are files there for them to read and grab and use.  (And if
    -+you want to pass the existing regression tests, you have to at least put
    -+the backup-refs file there even if it's irrelevant to your
    -+implementation otherwise.)
    -+
    -+All of that said, performance of filter-branch could be improved by
    -+reimplementing it in a non-shell language and taking a couple small
    -+liberties with backward compatibility (such as having it only run
    -+filters on files changed within each commit).  filter-repo provides a
    -+demo script named
    -+https://github.com/newren/git-filter-repo/blob/master/contrib/filter-repo-demos/filter-lamely[filter-lamely]
    -+which does exactly that and which passes all the git-filter-branch
    -+regression tests.  It's much faster than git-filter-branch, though it
    -+suffers from all the same safety issues as git-filter-branch, and is
    -+still glacially slow compared to
    -+https://github.com/newren/git-filter-repo/[git filter-repo].
    ++where those filters can be run.  That's a significant setup.
    ++
    ++* Further, several additional files are created or updated per commit by
    ++git-filter-branch.  Some of these are for supporting the convenience
    ++functions provided by git-filter-branch (such as map()), while others
    ++are for keeping track of internal state (but could have also been
    ++accessed by user filters; one of git-filter-branch's regression tests
    ++does so).  This essentially amounts to using the filesystem as an IPC
    ++mechanism between git-filter-branch and the user-provided filters.
    ++Disks tend to be a slow IPC mechanism, and writing these files also
    ++effectively represents a forced synchronization point between separate
    ++processes that we hit with every commit.
    ++
    ++* The user-provided shell commands will likely involve a pipeline of
    ++commands, resulting in the creation of many processes per commit.
    ++Creating and running another process takes a widely varying amount of
    ++time between operating systems, but on any platform it is very slow
    ++relative to invoking a function.
    ++
    ++* git-filter-branch itself is written in shell, which is kind of slow.
    ++This is the one performance issue that could be backward-compatibly
    ++fixed, but compared to the above problems that are intrinsic to the
    ++design of git-filter-branch, the language of the tool itself is a
    ++relatively minor issue.
    ++
    ++  ** Side note: Unfortunately, people tend to fixate on the
    ++     written-in-shell aspect and periodically ask if git-filter-branch
    ++     could be rewritten in another language to fix the performance
    ++     issues.  Not only does that ignore the bigger intrinsic problems
    ++     with the design, it'd help less than you'd expect: if
    ++     git-filter-branch itself were not shell, then the convenience
    ++     functions (map(), skip_commit(), etc) and the `--setup` argument
    ++     could no longer be executed once at the beginning of the program
    ++     but would instead need to be prepended to every user filter (and
    ++     thus re-executed with every commit).
    ++
    ++The https://github.com/newren/git-filter-repo/[git filter-repo] tool is
    ++an alternative to git-filter-branch which does not suffer from these
    ++performance problems or the safety problems (mentioned below). For those
    ++with existing tooling which relies upon git-filter-branch, 'git
    ++repo-filter' also provides
    ++https://github.com/newren/git-filter-repo/blob/master/contrib/filter-repo-demos/filter-lamely[filter-lamely],
    ++a drop-in git-filter-branch replacement (with a few caveats).  While
    ++filter-lamely suffers from all the same safety issues as
    ++git-filter-branch, it at least ameloriates the performance issues a
    ++little.
     +
     +[[SAFETY]]
     +SAFETY
     +------
     +
    -+filter-branch is riddled with gotchas resulting in various ways to
    ++git-filter-branch is riddled with gotchas resulting in various ways to
     +easily corrupt repos or end up with a mess worse than what you started
     +with:
     +
    @@ Documentation/git-filter-branch.txt: warned.
     +history is in use for quite a while, at which point it's really hard to
     +justify another flag-day for another rewrite.)
     +
    -+* Filenames with spaces (which are rare) are often mishandled by shell
    -+snippets since they cause problems for shell pipelines.  Not everyone is
    -+familiar with find -print0, xargs -0, ls-files -z, etc.  Even people who
    -+are familiar with these may assume such needs are not relevant because
    ++* Filenames with spaces are often mishandled by shell snippets since
    ++they cause problems for shell pipelines.  Not everyone is familiar with
    ++find -print0, xargs -0, git-ls-files -z, etc.  Even people who are
    ++familiar with these may assume such needs are not relevant because
     +someone else renamed any such files in their repo back before the person
     +doing the filtering joined the project.  And, often, even those familiar
     +with handling arguments with spaces my not do so just because they
     +aren't in the mindset of thinking about everything that could possibly
     +go wrong.
     +
    -+* Non-ascii filenames (which are rare) can be silently removed despite
    -+being in a desired directory.  The desire to select paths to keep often
    -+use pipelines like `git ls-files | grep -v ^WANTED_DIR/ | xargs git rm`.
    -+ls-files will only quote filenames if needed so folks may not notice
    -+that one of the files didn't match the regex, again until it's much too
    -+late.  Yes, someone who knows about core.quotePath can avoid this
    -+(unless they have other special characters like \t, \n, or "), and
    -+people who use ls-files -z with something other than grep can avoid
    -+this, but that doesn't mean they will.
    ++* Non-ascii filenames can be silently removed despite being in a desired
    ++directory.  The desire to select paths to keep often use pipelines like
    ++`git ls-files | grep -v ^WANTED_DIR/ | xargs git rm`.  ls-files will
    ++only quote filenames if needed so folks may not notice that one of the
    ++files didn't match the regex, again until it's much too late.  Yes,
    ++someone who knows about core.quotePath can avoid this (unless they have
    ++other special characters like \t, \n, or "), and people who use ls-files
    ++-z with something other than grep can avoid this, but that doesn't mean
    ++they will.
     +
     +* Similarly, when moving files around, one can find that filenames with
     +non-ascii or special characters end up in a different directory, one
    @@ Documentation/git-filter-branch.txt: warned.
     +that it can and has manifested as a problem.)
     +
     +* It's far too easy to accidentally mix up old and new history.  It's
    -+still possible with any tool, but filter-branch almost invites it.  If
    -+we're lucky, the only downside is users getting frustrated that they
    -+don't know how to shrink their repo and remove the old stuff.  If we're
    -+unlucky, they merge old and new history and end up with multiple
    -+"copies" of each commit, some of which have unwanted or sensitive files
    -+and others which don't.  This comes about in multiple different ways:
    ++still possible with any tool, but git-filter-branch almost invites it.
    ++If lucky, the only downside is users getting frustrated that they don't
    ++know how to shrink their repo and remove the old stuff.  If unlucky,
    ++they merge old and new history and end up with multiple "copies" of each
    ++commit, some of which have unwanted or sensitive files and others which
    ++don't.  This comes about in multiple different ways:
     +
     +  ** the default to only doing a partial history rewrite ('--all' is not
    -+     the default and over 80% of the examples in the manpage don't use
    -+     it)
    ++     the default and few examples show it)
     +
     +  ** the fact that there's no automatic post-run cleanup
     +
     +  ** the fact that --tag-name-filter (when used to rename tags) doesn't
    -+     remove the old tags but just adds new ones with the new name (this
    -+     manpage has documented this for a long time so it's presumably not
    -+     a "bug" even though it feels like it)
    ++     remove the old tags but just adds new ones with the new name
     +
     +  ** the fact that little educational information is provided to inform
     +     users of the ramifications of a rewrite and how to avoid mixing old
    @@ Documentation/git-filter-branch.txt: warned.
     +* Annotated tags can be accidentally converted to lightweight tags, due
     +to either of two issues:
     +
    -+  . Someone can do a history rewrite, realize they messed up, restore
    -+    from the backups in refs/original/, and then redo their
    -+    filter-branch command.  (The backup in refs/original/ is not a real
    -+    backup; it dereferences tags first.)
    ++  ** Someone can do a history rewrite, realize they messed up, restore
    ++     from the backups in refs/original/, and then redo their
    ++     git-filter-branch command.  (The backup in refs/original/ is not a
    ++     real backup; it dereferences tags first.)
     +
    -+  . Running filter-branch with either --tags or --all in your <rev-list
    -+    options>.  In order to retain annotated tags as annotated, you must
    -+    use --tag-name-filter (and must not have restored from
    -+    refs/original/ in a previously botched rewrite).
    ++  ** Running git-filter-branch with either --tags or --all in your
    ++     <rev-list options>.  In order to retain annotated tags as
    ++     annotated, you must use --tag-name-filter (and must not have
    ++     restored from refs/original/ in a previously botched rewrite).
     +
     +* Any commit messages that specify an encoding will become corrupted
    -+by the rewrite; filter-branch ignores the encoding, takes the original
    ++by the rewrite; git-filter-branch ignores the encoding, takes the original
     +bytes, and feeds it to commit-tree without telling it the proper
    -+encoding.  (This happens whether or not --msg-filter is used, though I
    -+suspect --msg-filter provides additional ways to really mess things
    -+up).
    ++encoding.  (This happens whether or not --msg-filter is used.)
     +
     +* Commit messages (even if they are all UTF-8) by default become
     +corrupted due to not being updated -- any references to other commit
    @@ Documentation/git-filter-branch.txt: warned.
     +authors and committers, missing taggers.
     +
     +* If the user provides a --tag-name-filter that maps multiple tags to
    -+the same name, no warning or error is provided; filter-branch simply
    ++the same name, no warning or error is provided; git-filter-branch simply
     +overwrites each tag in some undocumented pre-defined order resulting in
    -+only one tag at the end.  If you try to "fix" this bug in filter-branch
    -+and make it error out and warn the user instead, one of the
    -+filter-branch regression tests will fail.  (So, if you are trying to
    -+make a backward compatible reimplementation you have to add extra code
    -+to detect collisions and make sure that only the lexicographically last
    -+one is rewritten to avoid fast-import from seeing both since fast-import
    -+will naturally do the sane thing and error out if told to write the same
    -+tag more than once.)
    ++only one tag at the end.  (A git-filter-branch regression test requires
    ++this.)
     +
    -+Also, the poor performance of filter-branch often leads to safety issues:
    ++Also, the poor performance of git-filter-branch often leads to safety issues:
     +
     +* Coming up with the correct shell snippet to do the filtering you want
     +is sometimes difficult unless you're just doing a trivial modification
    -+such as deleting a couple files.  People have often come to me for help,
    -+so I should be practiced and an expert, but even for fairly simple cases
    -+I still sometimes take over 10 minutes and several iterations to get
    -+the right commands -- and that's assuming they are working on a tiny
    -+repository.  Unfortunately, people often learn if the snippet is right
    -+or wrong by trying it out, but the rightness or wrongness can vary
    -+depending on special circumstances (spaces in filenames, non-ascii
    -+filenames, funny author names or emails, invalid timezones, presence of
    -+grafts or replace objects, etc.), meaning they may have to wait a long
    -+time, hit an error, then restart.  The performance of filter-branch is
    -+so bad that this cycle is painful, reducing the time available to
    -+carefully re-check (to say nothing about what it does to the patience of
    -+the person doing the rewrite even if they do technically have more time
    -+available).  This problem is extra compounded because errors from broken
    -+filters may not be shown for a long time and/or get lost in a sea of
    -+output.  Even worse, broken filters often just result in silent
    -+incorrect rewrites.
    ++such as deleting a couple files.  Unfortunately, people often learn if
    ++the snippet is right or wrong by trying it out, but the rightness or
    ++wrongness can vary depending on special circumstances (spaces in
    ++filenames, non-ascii filenames, funny author names or emails, invalid
    ++timezones, presence of grafts or replace objects, etc.), meaning they
    ++may have to wait a long time, hit an error, then restart.  The
    ++performance of git-filter-branch is so bad that this cycle is painful,
    ++reducing the time available to carefully re-check (to say nothing about
    ++what it does to the patience of the person doing the rewrite even if
    ++they do technically have more time available).  This problem is extra
    ++compounded because errors from broken filters may not be shown for a
    ++long time and/or get lost in a sea of output.  Even worse, broken
    ++filters often just result in silent incorrect rewrites.
     +
     +* To top it all off, even when users finally find working commands, they
     +naturally want to share them.  But they may be unaware that their repo
4:  1dbca82408 = 4:  ca8e124cb3 t9902: use a non-deprecated command for testing
5:  762d63d6a5 < -:  ---------- Remove git-filter-branch, it is now external to git.git
-- 
2.23.0.38.g892688c90e




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux