As part of the statistics++ project [1] it is Infrastructure's plan to make data about visits to Fedora Project web servers public, in order to automate the information made available on the Statistics wiki page.
The httpd logs currently contain personally-identifiable information: the IP address the request originated from and the user agent header.
We think that at an absolute minimum we need to hash the IP address (with a seed, obviously) and leave the user agent header as is. But we wanted to make sure we got legal's opinion on this.
[1]: https://fedoraproject.org/wiki/User:Ianweller/statistics_plus_plus
On 04/17/2012 01:15 PM, Ian Weller wrote:
As part of the statistics++ project [1] it is Infrastructure's plan to make data about visits to Fedora Project web servers public, in order to automate the information made available on the Statistics wiki page.
The httpd logs currently contain personally-identifiable information: the IP address the request originated from and the user agent header.
We think that at an absolute minimum we need to hash the IP address (with a seed, obviously) and leave the user agent header as is. But we wanted to make sure we got legal's opinion on this.
Can you show me a hypothetical "before" and "after" example?
~tom
== Fedora Project
On Tue, Apr 17, 2012 at 01:36:50PM -0400, Tom Callaway wrote:
On 04/17/2012 01:15 PM, Ian Weller wrote:
As part of the statistics++ project [1] it is Infrastructure's plan to make data about visits to Fedora Project web servers public, in order to automate the information made available on the Statistics wiki page.
The httpd logs currently contain personally-identifiable information: the IP address the request originated from and the user agent header.
We think that at an absolute minimum we need to hash the IP address (with a seed, obviously) and leave the user agent header as is. But we wanted to make sure we got legal's opinion on this.
Can you show me a hypothetical "before" and "after" example?
Assuming we treat the logs as described above:
Before example (private, on Fedora log servers):
66.391.22.111 - - [15/Apr/2012:04:02:56 +0000] "GET /static/css/fedora960-lang.css HTTP/1.0" 200 233 "http://start.fedoraproject.org/index.html.en" "Mozilla/5.0 (X11; Linux x86_64; rv:11.0) Gecko/20100101 Firefox/11.0"
After example (available to public via statistics++ project):
c9326fa15a1d8a773386ddcdc16132f8 - - [15/Apr/2012:04:02:56 +0000] "GET /static/css/fedora960-lang.css HTTP/1.0" 200 233 "http://start.fedoraproject.org/index.html.en" "Mozilla/5.0 (X11; Linux x86_64; rv:11.0) Gecko/20100101 Firefox/11.0"
Requests from the same IP address would have the same hash so that we could run scripts that count unique visitors.
On 04/17/2012 01:56 PM, Ian Weller wrote:
On Tue, Apr 17, 2012 at 01:36:50PM -0400, Tom Callaway wrote:
On 04/17/2012 01:15 PM, Ian Weller wrote:
As part of the statistics++ project [1] it is Infrastructure's plan to make data about visits to Fedora Project web servers public, in order to automate the information made available on the Statistics wiki page.
The httpd logs currently contain personally-identifiable information: the IP address the request originated from and the user agent header.
We think that at an absolute minimum we need to hash the IP address (with a seed, obviously) and leave the user agent header as is. But we wanted to make sure we got legal's opinion on this.
Can you show me a hypothetical "before" and "after" example?
Assuming we treat the logs as described above:
Before example (private, on Fedora log servers):
66.391.22.111 - - [15/Apr/2012:04:02:56 +0000] "GET /static/css/fedora960-lang.css HTTP/1.0" 200 233 "http://start.fedoraproject.org/index.html.en" "Mozilla/5.0 (X11; Linux x86_64; rv:11.0) Gecko/20100101 Firefox/11.0"
After example (available to public via statistics++ project):
c9326fa15a1d8a773386ddcdc16132f8 - - [15/Apr/2012:04:02:56 +0000] "GET /static/css/fedora960-lang.css HTTP/1.0" 200 233 "http://start.fedoraproject.org/index.html.en" "Mozilla/5.0 (X11; Linux x86_64; rv:11.0) Gecko/20100101 Firefox/11.0"
Requests from the same IP address would have the same hash so that we could run scripts that count unique visitors.
As long as we're not somehow embedding any geo-ip information in those hashes, I think we're okay.
~tom
== Fedora Project