Collecting data from Fedora user community

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



We had several discussions recently that could use some real-world data on e.g.:

- x86_64-v2 prevalence

- GUILE usage in make/gdb

- count of systems with UEFI/GPT vs BIOS/MBR

- debugd server usage

- etc

The common thread is that some sort of measurement would help figuring out the best technical solution for the Fedora community, but such measurement would require transmitting and collecting data in the Fedora infrastructure.  Such telemetry is usually criticized from two related angles: a general objection to online telemetry, and a practical argument about the complicated legal ramifications of online data collection in multiple jurisdictions. Stephen recently responded with an eloquent argument why it's very hard to come up with an acceptable collection scheme (see below).

This problem is of course not unique to Fedora---everyone is in the same boat of finding a middle ground between anonymity and indiscriminate data collection. I think most people would agree that data-driven decisions are better than gut feeling-based ones, so it is to our benefit to control and possibly allow _some_ data to be used for that purpose.

Few days ago I attended a talk by a practicing data scientist working with social data where the privacy issues are even more important: some of the data can literally have life-altering consequences to people covered by the collection. I asked him about the best practices and guidelines for responsible data collection, and he directed me to this presentation:

https://the-engine-room.github.io/responsible-data-handbook/pages/slides.html

My personal take-aways from reading this material are:

- we're not alone in this---other people have thought through these issues and came up with workable ideas

- it helps to keep things in perspective----the data about people's computers is less consequential than e.g data about their ethnicity or politics

- there are legal requirements for Consent (notice/disclosure): they are workable, though

- it matters how the data is used: publishing full logs vs. using the aggregated data for internal improvement

Anyway, this is my personal, uneducated take on it. Hope it is helpful.


On 6/18/21 9:39 AM, Stephen John Smoogen wrote:
On Fri, 18 Jun 2021 at 01:51, Gerd Hoffmann <kraxel@xxxxxxxxxx> wrote:
   Hi,

The problems with this is that we are taking a fairly fuzzy data set
and making it much easier to track individual users in ways seen as
problematic by various laws and regulations.
Well, depends on how you store the data.  You can store one record per
machine (with all properties in there), or you can store one record per
property per machine.

With the latter you basically kill query on subgroups (like "how many
x86_64-v3 machines use UEFI?") because that grouping information is gone
if you store each end every little piece of information in its own
record.  But it'll also much harder to do fingerprinting on such a data
base ...

Standard disclaimer: IANAL.

The problem with IANAL, is that we all come up with great solutions
which seem to match the single document we read. However the law is an
interpreted language where every court is a slightly different
architecture and has different libraries which have to be slowly
interpreted and patched at a top level. This means that you end up
with finding out that the document and 2500 years of law rulings have
to be interpreted together.

The way things are interpreted currently, it doesn't matter that you
stored it differently.. it matters that you collected it... mainly
because there is a long history of people finding ways to de-anonymize
data, people lying about anonymizing it, and people somehow collecting
the data in the middle. Because of that you end up having to delete
all the data when someone asks to be deleted because you can't prove
this record/count was their system or not.

In general we computer people like to dive in and just collect data
and go about doing analysis. The various privacy laws are written to
make us do a LOT of hard work before we start doing that. You end up
spending a lot of time with lawyers versed in European, Brazilian, and
various other countries laws/regulations/past history to figure out
what you can collect, how you can collect it, how you are going to
delete it, how you are going to inform people that things are
happening, and having clear processes that are followed. Then you can
start writing the code.. while doing that you have to review the code
to make sure it is still meeting current rulings.  [Doing it another
way ends up with you writing code and either finding you have to
delete it all or waiting months for an approval before rolling it
out.]

_______________________________________________
devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx
Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure




[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Fedora Announce]     [Fedora Users]     [Fedora Kernel]     [Fedora Testing]     [Fedora Formulas]     [Fedora PHP Devel]     [Kernel Development]     [Fedora Legacy]     [Fedora Maintainers]     [Fedora Desktop]     [PAM]     [Red Hat Development]     [Gimp]     [Yosemite News]

  Powered by Linux