Re: GSoC 2016: applications open, deadline = Fri, 19/2

Thomas Gummerer <t.gummerer@xxxxxxxxx> · Fri, 19 Feb 2016 12:46:57 +0100

On 02/18, Lars Schneider wrote:
>
> On 17 Feb 2016, at 19:58, Matthieu Moy <Matthieu.Moy@xxxxxxxxxxxxxxx> wrote:
>
> > Lars Schneider <larsxschneider@xxxxxxxxx> writes:
> >
> >> Coincidentally I started working on similar thing already (1) and I have
> >> lots of ideas around it.
> >
> > I guess it's time to start sharing these ideas then ;-).
> >
> > I think there's a lot to do. If we want to push this idea as a GSoC
> > project, we need:
> >
> > * A rough plan. We can't expect students to read a vague text like
> >  "let's make Git safer" and write a real proposal out of it.
> >
> > * A way to start this rough plan incrementally (i.e. first step should
> >  be easy and mergeable without waiting for next steps).
> >
> > Feel free to start writting an idea for
> > http://git.github.io/SoC-2016-Ideas/. It'd be nice to have a few more
> > ideas before Friday. We can polish them later if needed.
>
> I published my ideas here:
> https://github.com/git/git.github.io/pull/125/files

Sorry for posting my idea so late, but it took me a while to write
this all up, and life has a habit of getting in the way.  My idea goes
into a different direction than yours.

I do like the remote whitelist/blacklist project.

Junio pointed out to me off list that this is to complicated for a
GSoC project.  I kind of agree with that, but I wanted to see how this
could be split up, to completely convince myself as well.  And indeed,
the more I think about it the more risky it seems.

Below there are some thoughts on a potential design, in case someone
is interested, no code to back any of this up, sorry.

Everything proposed below should be hidden behind some configuration
variable, potentially one per command (?)

- start with git-clean.  It's well defined which files are cleaned
  from a repository when running the command.  Add them to a commit on
  the tip of the current branch.

  Start a new branch (or use the existing one if applicable) in
  refs/restore/history, and add a commit including a notes file.  The
  commit message contains the operation that was executed (clean in
  this case), and the hash of the commit we created which includes the
  cleaned files.

  Add a note to the commit, detailing from which command we come from,
  which files we added (not strictly necessary, as we can infer it
  from the parent commit).

  Useful in itself as the user can recover the files manually if
  needed, and can be sent as separate patch series.

  Potential problems:  Git has no way to track directories.  This can
  be mitigated by keeping the list of directories in the attached
  note.

- add a git recover command.  The command looks at This would look like `git recover
  <commit>`, where commit is the hash of the commit we saved before.

  This works by reading the note attached to the commit, figuring out
  which command was run before, and restoring the state we were in
  before.

  Potential problems: conflicts, but I think this can be solved by
  simply erroring out, at least in the first iteration.

- the next command could be git mv -f, git reset -f and friends.  It
  gets more tricky here, as we'll have to deal with the state of the
  files in the index.

  Analogous to git clean, the changes in the working tree are all
  staged and added to a new commit on the tip of the current branch.

  The note on this commit needs to contain the necessary data to
  rebuild the state in the index.  The format is more closely
  specified below.  We also need the corresponding changes in the
  git restore command.

  Restored files will be written to disk as racily smudged, so the
  contents are checked by git, as we lost the meta-data anyway.  This
  comes at a slight performance impact, but I think that's okay as we
  potentially saved the user a lot of time re-doing all the changes.

- git branch/tag --force.  Store the name and the old location of the
  branch in refs/restore/history.  There are no files lost with this
  operation, so no additional commits as for git clean or git reset
  etc. are needed.  The format of the commit depends on the exact
  operation that was forced, for exact format see below.

This treatment can't make all operations safe.  Any operation that
touches the remote is hard to undo as some users already might have
fetched the new state of the remote (e.g. git push -f).  Others such
as git-gc will inevitably delete information from the disk, but
changing that

There's more, but I don't think just writing up all commands without
any code would make any sense.

Formats:
- commits in refs/restore/history:
empty commits with the following commit message format for git-clean
and git-reset and friends:
$versionnumber\n
$command\n
$branchname\n
$sha1ofreferencedcommit\n

empty commits with the following commit message format for git branch
and friends
$versionnumber\n
$command\n (this includes the exact operation that was forced
(e.g. move, delete etc.)
$branchname\n
$sha1ThatWasReferencedByTheBranch\n
$overwrittenbranchname\n (this and the sha1 below are only used for
--move)
$sha1ReferencedByOverwrittenBranch\n

- notes file: The format can be different for different commands, as
  they all have different needs

  - git clean:
    list of affected files and directories separated by '\0'.
    I think we could get away with only the directories, but adding
    the filenames as well might make the recovery part simpler.

  - git reset, etc.:
    the following info is stored for each file that is modified by the
    original command.

    32-bit signature
    32-bit number of index entries
    32-bit mode (object type + unix permissions)
    160-bit SHA-1
    16-bit flags (extra careful here what we want to do with the
                  assume valid flag)
    path name (variable length)

    resolve-undo extension (same format as in the index)

Alternatives:
- Have a history for each branch in refs/restore/$branchname.
  * Advantages:
    Each branch has its own history, which can lead to fewer conflicts
    when restoring (e.g. user uses `git reset --hard` on one branch,
    switches to another branch works (potentially adds more stuff to
    this branch), later goes back to the old branch and discovers `git
    reset --hard` was actually the wrong thing to do and would like
    the data back.
  * Disadvantages:
    It is harder for the user to intuitively know what git restore
    will do exactly.
    It's much more limited when we want to extend it to branch
    removals, etc.

- Storing additional information in the refs/restore/history ref
  * Advantages:
    No need for extra notes
  * Disadvantages:
    Data doesn't get garbage collected without user interaction,
    potentially blowing up the repository size.  Especially using `git
    clean`, where binary files might be involved.

- Store the whole index in the note
  * Advantages:
    Simpler way of restoring the index (including all of the
    extensions)
  * Disadvantages:
    Need to take care of both the index and the split index.
    Will consume a lot more disk space in the normal case (only a few
    of the files in the repository are changed, while the majority
    remains unchanged).

- Store the changed files in refs/restore/history instead of a new
  commit on the tip of the current branch.
  * Advantages:
    All the information is in one place.
    Data will not be garbage collected.
  * Disadvantages:
    Data will not be garbage collected. (Repository size is probably
    going to blow up after a while)
    It takes more effort to find the parent and diff against it.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html