I've spent quite a bit of time over the last week fixing up the scripts that generate Fedora's worldwide user maps [1] including the worldwide map for all Fedora versions currently in use [2] as determined by yum requests for mirrorlists. One thing that's painfully obvious is that the "Unique IP addresses" method of counting the number of installations [3] is woefully under-counting the actual number of installs. Looking at a single day's worth of checkins (over 3 million), we see ~40k unique IP addresses checking in twice a day, another 40k checking in between 4x/day and up to say 20x/day, and then a long tail, fairly evenly distributed, where a small number of single IPs are checking in up to 2000x/day. It takes quite a bit of effort to cause yum to make that many mirrorlist requests using a single machine and a single IP address - but it's highly likely there are 1000-2000 machines behind a NAT making those requests. This just shows that we currently have no way to know, within even a 2-4x margin of error, how many current installs of Fedora there are. But this number, and it's growth (positive, or negative), would be interesting to know, if only it were more accurate. [4] To this end, I would like to see yum enhanced to provide information which we can use to more accurately count the number of installed Fedora systems. This has been discussed before, and documented on the wiki [5], but for various reasons never been acted upon. While I'll leave the implementation details to the appropriate teams, I think including some form of UUID in yum mirrorlist queries would be both appropriate, and safe. The biggest concern people have with using any UUID in any form is the "trackability" that comes inherent with it. Given enough log data that includes UUIDs, one could potentially use it to understand something about a user that they otherwise wouldn't want you to know. For example: if I have the public IP address and UUID for a system, and if I have the HTTP/FTP logfiles from _all_ our mirrors which includes public IP addresses (which I don't have today), I could potentially guess at which RPMs one system at that IP address has installed. If there is only one system at that IP address, I'd have even more certainty as to what they have installed. Personally, I don't think this is a big problem. Maybe it is. If it were, the entire industry which uses cookies exactly for such tracking (and even more so) would have huge security, privacy, and other lawsuit concerns which I just don't hear about. Whatever we do will have to run past Legal. For implementation details, I suggest yum create and persist a single UUID for each installed system. This UUID would be separate from any smolt UUID. Yum would include this UUID in HTTP requests. Yum would only provide this UUID when making mirrorlist requests, not when downloading content (from mirrors or other). All yumlib-using applications such as PackageKit would then inherit this capability. On the back end, Fedora Infrastructure would add capability to log this UUID for each request, just as it logs mirrorlist requests today. FI scripts would then use this UUID to accurately count the number of installed instances over time, recognizing that systems can get re-installed (and thus get new yum UUIDs), but over time can provide more accurate trending than we can get today. I'd like to hear your thoughts. Thanks, Matt [1] http://fedoraproject.org/maps/ [2] http://fedoraproject.org/maps/all.png [3] http://fedoraproject.org/wiki/Statistics [4] http://fedoraproject.org/wiki/Infrastructure/Metrics#Metrics_are_actually_important [5] http://fedoraproject.org/wiki/Infrastructure/Metrics#Unique_Identifiers _______________________________________________ advisory-board mailing list advisory-board@xxxxxxxxxxxxxxxxxxxxxxx https://admin.fedoraproject.org/mailman/listinfo/advisory-board