Re: Collecting data from Fedora user community

Stephen John Smoogen <smooge@xxxxxxxxx> · Wed, 7 Jul 2021 17:58:38 -0400

On Wed, 7 Jul 2021 at 16:26, przemek klosowski via devel
<devel@xxxxxxxxxxxxxxxxxxxxxxx> wrote:
>
> We had several discussions recently that could use some real-world data
> on e.g.:
>
> - x86_64-v2 prevalence
>
> - GUILE usage in make/gdb
>
> - count of systems with UEFI/GPT vs BIOS/MBR
>
> - debugd server usage
>
> - etc
>
> The common thread is that some sort of measurement would help figuring
> out the best technical solution for the Fedora community, but such
> measurement would require transmitting and collecting data in the Fedora
> infrastructure.  Such telemetry is usually criticized from two related
> angles: a general objection to online telemetry, and a practical
> argument about the complicated legal ramifications of online data
> collection in multiple jurisdictions. Stephen recently responded with an
> eloquent argument why it's very hard to come up with an acceptable
> collection scheme (see below).
>
> This problem is of course not unique to Fedora---everyone is in the same
> boat of finding a middle ground between anonymity and indiscriminate
> data collection. I think most people would agree that data-driven
> decisions are better than gut feeling-based ones, so it is to our
> benefit to control and possibly allow _some_ data to be used for that
> purpose.
>
> Few days ago I attended a talk by a practicing data scientist working
> with social data where the privacy issues are even more important: some
> of the data can literally have life-altering consequences to people
> covered by the collection. I asked him about the best practices and
> guidelines for responsible data collection, and he directed me to this
> presentation:
>
> https://the-engine-room.github.io/responsible-data-handbook/pages/slides.html
>
> My personal take-aways from reading this material are:

Thank you for doing this research. One of the items towards doing any
sort of 'data analysis' is working through what is currently done in
the field.

>
> - we're not alone in this---other people have thought through these
> issues and came up with workable ideas
>
> - it helps to keep things in perspective----the data about people's
> computers is less consequential than e.g data about their ethnicity or
> politics
>

People choose things based on the social groups they are 'bound' with
so you will find that people with XYZ brand computers trend more
towards being some ethnicity or politics. It isn't an absolute rule,
but it tends towards a large enough that 'possessions' are considered
PII in some places. [There was a strong link between people having
systems with multiple nvidia cards and libertarian politics because a
lot of people in both groups mined bitcoin. There is a correlation
between languages installed on a system and the ethnicity/demographics
of the person. ]

It is also important to plan the research with the following in mind:
1. No matter how hard you try it is hard to anonymize data.
2. No matter how hard you try, you will be collecting items which can
'tell' about people.
3. People have the right to have their data deleted anytime/anywhere.
[There are exceptions to this but when you are getting volunteer data
and not part of maintaining legal records for business transactions,
it will need to be deleted.]
4. You need to design the collection to 'mine' gross items from the
data and then destroy that data as quickly as possible.
5. You need to design the collection with the expectation that it will
be something people want to steal and so needs to be kept secure.
6. When you are collecting data, there is a segment of the population
who will lie/cheat/steal as much as they can for various personal and
ideological reasons. You need to make the experiment expect that and
try to determine when it is happening. In either case, the data you
have will be noisy.]

> - there are legal requirements for Consent (notice/disclosure): they are
> workable, though
>
> - it matters how the data is used: publishing full logs vs. using the
> aggregated data for internal improvement
>
> Anyway, this is my personal, uneducated take on it. Hope it is helpful.
>
>
> On 6/18/21 9:39 AM, Stephen John Smoogen wrote:
> > On Fri, 18 Jun 2021 at 01:51, Gerd Hoffmann <kraxel@xxxxxxxxxx> wrote:
> >>    Hi,
> >>
> >>> The problems with this is that we are taking a fairly fuzzy data set
> >>> and making it much easier to track individual users in ways seen as
> >>> problematic by various laws and regulations.
> >> Well, depends on how you store the data.  You can store one record per
> >> machine (with all properties in there), or you can store one record per
> >> property per machine.
> >>
> >> With the latter you basically kill query on subgroups (like "how many
> >> x86_64-v3 machines use UEFI?") because that grouping information is gone
> >> if you store each end every little piece of information in its own
> >> record.  But it'll also much harder to do fingerprinting on such a data
> >> base ...
> >>
> >> Standard disclaimer: IANAL.
> >>
> > The problem with IANAL, is that we all come up with great solutions
> > which seem to match the single document we read. However the law is an
> > interpreted language where every court is a slightly different
> > architecture and has different libraries which have to be slowly
> > interpreted and patched at a top level. This means that you end up
> > with finding out that the document and 2500 years of law rulings have
> > to be interpreted together.
> >
> > The way things are interpreted currently, it doesn't matter that you
> > stored it differently.. it matters that you collected it... mainly
> > because there is a long history of people finding ways to de-anonymize
> > data, people lying about anonymizing it, and people somehow collecting
> > the data in the middle. Because of that you end up having to delete
> > all the data when someone asks to be deleted because you can't prove
> > this record/count was their system or not.
> >
> > In general we computer people like to dive in and just collect data
> > and go about doing analysis. The various privacy laws are written to
> > make us do a LOT of hard work before we start doing that. You end up
> > spending a lot of time with lawyers versed in European, Brazilian, and
> > various other countries laws/regulations/past history to figure out
> > what you can collect, how you can collect it, how you are going to
> > delete it, how you are going to inform people that things are
> > happening, and having clear processes that are followed. Then you can
> > start writing the code.. while doing that you have to review the code
> > to make sure it is still meeting current rulings.  [Doing it another
> > way ends up with you writing code and either finding you have to
> > delete it all or waiting months for an approval before rolling it
> > out.]
> >
> _______________________________________________
> devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx
> To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx
> Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
> List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx
> Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure

-- 
Stephen J Smoogen.
I've seen things you people wouldn't believe. Flame wars in
sci.astro.orion. I have seen SPAM filters overload because of Godwin's
Law. All those moments will be lost in time... like posts on  BBS...
time to reboot.
_______________________________________________
devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx
Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure