Re: Reviving the hardware census

Nathaniel McCallum <npmccallum@xxxxxxxxxx> · Thu, 9 Nov 2017 16:15:53 -0500

On Thu, Nov 9, 2017 at 3:26 PM, Chris Murphy <lists@xxxxxxxxxxxxxxxxx> wrote:
> On Thu, Nov 9, 2017 at 1:11 PM, Nathaniel McCallum
> <npmccallum@xxxxxxxxxx> wrote:
>> Turning it into a hash doesn't solve the tracking problem. It only
>> prevents the attacker from knowing a list of serial numbers. I suspect
>> keeping hashes of identifying information will likely cause
>> controversy.
>
> What is the nature of the tracking problem? A single entry for a
> single machine is not tracking to me. Tracking requires at least two
> points in space-time. What's being stored by the Fedora Project? IP,
> Geolocation, date and time? Those are the things I associate with
> tracking more than a serial number or a hash of a serial number.

I'm an attacker. I observe the serial number for someone's laptop
("Hey! Nice laptop! I'm thinking about getting one of those.How heavy
is it?" <flip it over look at bottom>). I hash it. I search the Fedora
database. I have a list of every package ever installed and all the
hardware he's ever plugged in. The opportunity for compromise just
grew exponentially.

We need to never collect these things. It is part of communicating
trust to our users.

> Let's say you don't store serial number or a hash, but you do store
> model information, date/time, and an IP address.

We won't collect or store IP addresses. We will, of necessity, have
the IP address of the connection. But we should avoid logging this.

> If there's no
> mechanism to avoid duplicate entries, you've got a bigger tracking
> problem the less common that particular model is.

Less common models are still relatively common. I don't understand the
"duplicate entries" problem you are posing. Individual hardware
devices should be deduplicated (via a UNIQUE constraint). Everything
will be tied via a join table to a master checkin table where the
installation is uniquely identified by a UUID. This way we can see the
hardware associated with a particular checkin (and view changes over
time).

> More models will
> make the data noisy. But if it's a sufficiently rare model, the
> duplicate entries can be assumed to be representing just a few
> distinct machines or even just one machine, and now you can track a
> person even if you don't have any serial number or hashing.

Yes, the problem of very unique hardware combinations is known. It is
not solved even for browsers. Users concerned about this should
disable reporting.

> So I think necessarily you need a way to eliminate duplicates from
> entering the data set. Some way of anonymizing the entry in the Fedora
> Project's data, but also a way to track duplicates.
>
> How about two different data sets stored by the Fedora Project?
> Dataset 1 contains only the hash of the serial number of the device.
> If that hash is not present in dataset 1, then sanitized device data
> is added to dataset 2. If the hash is found in dataset 1, then it's
> not added to dataset 2. But there is no correlation between dataset 1
> and dataset 2?

It looks to me like you're trying to correlate a single hardware
configuration to provide consistency across checkins and you call this
process "deduplication." Census already provides functionality to do
that. Census is not designed just for hardware. It is designed for
gathering general Linux distro statistics, of which hardware is one
component. So we solve this problem once for all reporting modules (of
which hardware is one).

All checkins report a UUID inserted into the master checkins table.
The checkin entry is UNIQUE(uuid, time). All other data points are
correlated to this checkin entry. Let's walk through an example.

A user reports a single PCI device with census. This results in:
1. a new row in the checkin table
2. a possibly new row in the pci table (if this is the first time
we've ever seen this device for all users)
3. a new row in the checkin_pci table joining the checkin to the pci device

As data usage becomes an issue, we can periodically purge old checkins
and their joins. But once a PCI device is seen for the first time, it
is never deleted. And we will never have duplicate entries for a
single PCI device.

This is just theoretical, because we haven't actually designed the
hardware side of the database. But it is an example.
_______________________________________________
kernel mailing list -- kernel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to kernel-leave@xxxxxxxxxxxxxxxxxxxxxxx