Re: F40 Change: Privacy-preserving Telemetry for Fedora Workstation (System-Wide)

Michal Domonkos <mdomonko@xxxxxxxxxx> · Fri, 7 Jul 2023 23:47:57 +0200

On Thu, Jul 06, 2023 at 08:08:05PM -0500, Michael Catanzaro wrote:
> But remember we do not want to keep information about individuals in the
> data set in the first place. It's easier to dodge privacy concerns if we
> just don't store such associations at all.

Sure, but the data still needs to leave a user's system at some point and
that's where you have to trust the aggregator (the Fedora project in this case,
I suppose) that it's not stored verbatim.

Or, apply a DP technique locally, before it leaves the system.  Randomized
response, which you mentioned, is actually one such technique.

In a way, you already trust the distribution by the very nature of it, e.g.
the signatures in packages you install.  DP just provides a framework in which
you can formally quantify the risk of de-masking an individual user from a
given data set, and concrete strategies to employ to minimize that risk.

Actually this exact problem is discussed in the blog post series I shared,
specifically in this part:

https://desfontain.es/privacy/local-global-differential-privacy.html

> As for differential privacy, I'm quite unfamiliar with this topic so I don't
> know to what extent it could be useful, but Endless is interested in adding
> randomized response [1], where say 50% of the data sent is fake and the
> other half is accurate. This only works for boolean and possibly integer
> data, but it would make it even harder to deanonymize reporterd data. But
> that is not supported yet.

Indeed, randomized response is one of the DP-aware techniques (it's also
mentioned in that blog series) :)  And RAPPOR is basically just randomized
response but generalized to arbitrary strings (using this fancy thing called
Bloom filters [1]).

> I will add that to my reading list. Certainly it seems a lot less
> intimidating than the Wikipedia article. ;)

Yup, the Wikipedia article isn't very helpful.  There are much better
resources, including a bunch of talks on YouTube from the researchers
themselves (e.g. Cynthia Dwork).

> Wow. I'll add this to my reading list too, although remains to be seen
> whether I'll be able to understand it. :D

Yeah, the RAPPOR paper is an interesting read but pretty dense and math-heavy
(although not as much as it might seem at first glance).  I did *try* to read
it at some point and actually managed to understand the key concepts which
aren't *that* complicated.  But I can't blame anybody for not wanting to go
down that path after they skim through it and see those formulas and charts,
really :D

I went into this DP rabbit hole myself when I was working on the DNF Countme
[2] implementation a few years back, and even if it wasn't directly applicable
in the end, it did inspire me to add a form of "randomized response" there, to
spread the countme events from a single system randomly across a week's time
window so that no usage patterns of that particular system (e.g. the typical
uptime hours) could emerge if someone were to inspect the HTTP requests with
the countme flag coming from the same system aggregated over a long period of
time.  Pretty theoretical and, in retrospect, rather unlikely and paranoid, but
it was easy to add that logic so I did, just for the peace of mind :)

I haven't kept up with the latest developments in DP since then, though, and
have blissfully forgotten most of it, too.  But it sparked my interest back
then and I certainly thought that if Fedora ever decides that it wants some
kind of "telemetry", *this* is the (only acceptable) way to do it.

Which doesn't mean there aren't other ways, or that the approach taken by
Endless (which you'd like to adopt) is wrong, of course.  These were just my 2
cents :)

FWIW, it seems like various tech companies and software project make use of DP
(at least that's what the Wikipedia article claims).  Google Chrome and MS
Windows are among those, amusingly, despite their reputation.

[1] https://en.wikipedia.org/wiki/Bloom_filter
[2] https://fedoraproject.org/wiki/Changes/DNF_Better_Counting

-- 
Michal Domonkos / RPM dev team / Red Hat, Inc.
_______________________________________________
devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx
Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue