Re: GDPR compliance best practices?

Peter Backes <rtc@xxxxxxxxxxxxxxxxxxx> · Sun, 3 Jun 2018 22:24:39 +0200

On Sun, Jun 03, 2018 at 09:48:16PM +0200, Ævar Arnfjörð Bjarmason wrote:
> Sure, but what I'm pointing out is a) you can't focus on git as the
> technology because it tells you nothing about what's being done with it
> (e.g. the log file case I mentioned b) nobody who came up with the GDPR
> was concerned with some free software projects or the SCM used by
> companies, so this is very unlikely to be enforced.

As I already said, the GDPR refers to the state of the art in 
technology, without defining it.

The GDPR provides a generic framework. It covers everyone. From a 
single person running a small blog to a S&P500 enterprise. It also 
covers non-profits and state authorities. Everyone is covered. 
Including SCM used.

The GDPR will be enforced against SCMs. The question is just who will 
be the first to be affected. I suspect it will be a mega-corporation 
who fired one of their developers who wants to fight back and exercise 
his right to be forgotten against the company's public git repos.

> So nobody can be GDPR compliant in the face of archive.org and the like?

The GDPR has special exceptions for archives and the like.

> It does if you've got the ref. Maybe I just don't get your proposal,
> quote:
> 
>     Do not hash anything directly to obtain the commit ID. Instead, hash a
>     list of hashes of [$random_number, $information] pairs. $information
>     could be an author id, a commit date, a comment, or anything else. Then
>     store the commit id, the list of hashes, and the list of pairs to form
>     the commit.
> 
> You're just proposing (if I've read this correctly) that the commit
> object should have some list of headers pointing to other SHA1s, and
> that fsck and the like be OK with these going away. Right?

Certainly not SHA1. SHA1 is completely broken. I know Linus has a bit 
of a different opinion. But there's really no defense for SHA1. It's an 
utterly broken algorithm and should not be used at all anymore.

> How is this intrinsically different from referring to something in the
> ref namespace that may be deleted in the future?

I guess I am partly repeating myself, but:

1. Having fsck be OK with erasure is not enough. It tells you nothing 
about anonymization. If the hash is the same in 5000 instances that's 
pseudonymization, not anonymization. You need to ensure a different 
hash in each instance, and you need to ensure there's no easy way to 
reconstruct the data from its hash. Hence $random_number (or let's call 
it $huge_random_number, it should have x bits if the hash has x bits). 
If you have the SHA1 64ca93f83bb29b51d8cbd6f3e6a8daff2e08d3ec it's too 
easy to figure out the plaintext (it's "Peter" BTW).

2. If you use a random UUID you cannot reconstruct the data from its 
hash, but you have the same issue about UUID reuse. Plus, you lose the 
ability to verify the author's name as part of the commit.

> Okey, so you're not reading the GDPR in some literal sense, but you're
> coming to a conclusion that's supported by ... what? To echo Theodore
> Y. Ts'o E-Mail have you consulted with someone who's an actual lawyer on
> this subject?

I'm replying in private conversation about this one. It's not relevant 
for this discussion.

Best wishes
Peter

-- 
Peter Backes, rtc@xxxxxxxxxxxxxxxxxxx