Re: [PATCH v3] teach fast-export an --anonymize option

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Jeff King <peff@xxxxxxxx> writes:

> Subject: docs/fast-export: explain --anonymize more completely
>
> The original commit made mention of this option, but not why
> one might want it or how they might use it. Let's try to be
> a little more thorough, and also explain how to confirm that
> the output really is anonymous.
>
> Signed-off-by: Jeff King <peff@xxxxxxxx>
> ---
>  Documentation/git-fast-export.txt | 63 ++++++++++++++++++++++++++++++++++++---
>  1 file changed, 59 insertions(+), 4 deletions(-)
>
> diff --git a/Documentation/git-fast-export.txt b/Documentation/git-fast-export.txt
> index 52831fa..dbe9a46 100644
> --- a/Documentation/git-fast-export.txt
> +++ b/Documentation/git-fast-export.txt
> @@ -106,10 +106,9 @@ marks the same across runs.
>  	different from the commit's first parent).
>  
>  --anonymize::
> -	Replace all refnames, paths, blob contents, commit and tag
> -	messages, names, and email addresses in the output with
> -	anonymized data, while still retaining the shape of history and
> -	of the stored tree.
> +	Anonymize the contents of the repository while still retaining
> +	the shape of the history and stored tree.  See the section on
> +	`ANONYMIZING` below.

Technically s/tree/trees/, I would think.  For a repository with
multiple branches, perhaps s/history/histories/, too, but I would
not insist on that ;-).

> +ANONYMIZING
> +-----------
> +
> +If the `--anonymize` option is given, git will attempt to remove all
> +identifying information from the repository while still retaining enough
> +of the original tree and history patterns to reproduce some bugs. The
> +goal is that a git bug which is found on a private repository will
> +persist in the anonymized repository, and the latter can be shared with
> +git developers to help solve the bug.
> +
> +With this option, git will replace all refnames, paths, blob contents,
> +commit and tag messages, names, and email addresses in the output with
> +anonymized data.  Two instances of the same string will be replaced
> +equivalently (e.g., two commits with the same author will have the same
> +anonymized author in the output, but bear no resemblance to the original
> +author string). The relationship between commits, branches, and tags is
> +retained, as well as the commit timestamps (but the commit messages and
> +refnames bear no resemblance to the originals). The relative makeup of
> +the tree is retained (e.g., if you have a root tree with 10 files and 3
> +trees, so will the output), but their names and the contents of the
> +files will be replaced.

While I do not think I or anybody who would ask other people to use
this option would be confused, the phrase "the same string" may risk
unnecessary worries from those who are asked to trust this option.

I am not yet convinced that it is unlikely for the reader to read
the above and imagine that the anonymiser may go word by word,
replacing "the same string" with the same anonymised gibberish
(which would be susceptible to old-school cryptoanalysis
techniques).

Among the ones that listed, refnames, blob contents, commit messages
and tag messages are converted as a single "string" and I wish I
could think of phrasing to stress that point somehow.

Each path component in paths is converted as a single "string", so
we can read from two anonymised paths if they refer to blobs in the
same directory in the original.  This is a good thing, of course,
but it shows that among those listed in "refnames, paths, blob
contents, ..." in a flat sentence, some are treated as a single
token for replacement but not others, and it is hard to tell for a
reader which one is which, unless the reader knows the internals of
Git, i.e. what kind of things we as the debuggers-of-Git would want
to preserve.

Isn't the unit for human identity anonymisation even more coarse?
If it is not should it?

In other words, do "Junio C Hamano <junio@xxxxxxxxx>" and "Junio C
Hamano <gitster@xxxxxxxxx>" map to one gibberish human readable name
with two gibberish e-mail addresses, or 2 "User$n <user$n>"?  Is the
fact that this organization seems to allocate two e-mails to each
developer something this organization may want to hide from the
public (and something we as the Git debuggers would not benefit from
knowing)?


--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]