Hello Legal,
My name is Michael Winters, typically known here as @mwinters. I have some questions about Fedora's data privacy policies, which I'll provide a bit of context to first.
There has been a long-standing desire within Fedora for better tools with which to analyze our user data and understand our community so that we can improve it. To this end, I have recently created a "Data Lakehouse" proof of concept known as "Hatlas", available at https://hatlas.mwinters.net . This technology consolidates data from existing public Fedora datasets and provides simplified tools to facilitate public access and analysis.
Since these datasets were previously quite difficult to access, I believe that most people are unaware of what data exists about them within Fedora and/or the fact that it's being published publicly. I expect that the announcement of easy access to this data will raise some community concerns about data privacy, so this email is in anticipation of those concerns. I wish to have clear resources to refer people to, and current resources such as https://docs.fedoraproject.org/en-US/legal/privacy/ have left some questions open.
In particular, many of these datasets include usernames and records of user activity tied to those usernames, e.g. the contents and exact timing of forum posts, git commits, group membership changes, etc. My current questions are:
1) Does an arbitrary username (not necessarily tied to a real name) constitute PII which must be protected / anonymized? It is not currently anonymized in Fedora datasets.
2) Do current Fedora policies permit collecting user activity tied to usernames? This is not explicitly stated under "Information We Collect", though it is mentioned later under "Using (Processing) Your Personal Data."
3) Do current Fedora policies permit publishing user activity tied to usernames? Section "Sharing Your Personal Data" does mention "For research activities", but it does not specify that data must be shared *only* in aggregate.
4) How does GDPR view downstream users of public data sources, i.e. Hatlas? Is Hatlas a "data processor"? Must Hatlas integrate with Fedora's Personal Data Removal process? We intend to do so, but there seems to be no obligation for either party.
5) Are there any data licenses applicable to downstream users such as Hatlas? I intend to apply one restricting the use of Hatlas data to non-commercial purposes, but there seem to be no restrictions coming from Fedora.
Thanks in advance!
Michael Winters
Hi Michael,
Thanks for raising this. Having looked to your site, I'm a bit unclear as to what dataset you are referring to (what is datanommer, is that an existing set of data or a name you made?) and what exactly is published publicly already?
It sounds like you are potentially pulling data from a variety of different sources, is that right? If so, what are these sources and what is the intent of using this consolidated data?
Thanks, Jilayne
On 11/11/25 10:27 AM, Michael Winters via legal wrote:
Hello Legal,
My name is Michael Winters, typically known here as @mwinters. I have some questions about Fedora's data privacy policies, which I'll provide a bit of context to first.
There has been a long-standing desire within Fedora for better tools with which to analyze our user data and understand our community so that we can improve it. To this end, I have recently created a "Data Lakehouse" proof of concept known as "Hatlas", available at https://hatlas.mwinters.net . This technology consolidates data from existing public Fedora datasets and provides simplified tools to facilitate public access and analysis.
Since these datasets were previously quite difficult to access, I believe that most people are unaware of what data exists about them within Fedora and/or the fact that it's being published publicly. I expect that the announcement of easy access to this data will raise some community concerns about data privacy, so this email is in anticipation of those concerns. I wish to have clear resources to refer people to, and current resources such as https://docs.fedoraproject.org/en-US/legal/privacy/ have left some questions open.
In particular, many of these datasets include usernames and records of user activity tied to those usernames, e.g. the contents and exact timing of forum posts, git commits, group membership changes, etc. My current questions are:
- Does an arbitrary username (not necessarily tied to a real name)
constitute PII which must be protected / anonymized? It is not currently anonymized in Fedora datasets.
- Do current Fedora policies permit collecting user activity tied to
usernames? This is not explicitly stated under "Information We Collect", though it is mentioned later under "Using (Processing) Your Personal Data."
- Do current Fedora policies permit publishing user activity tied to
usernames? Section "Sharing Your Personal Data" does mention "For research activities", but it does not specify that data must be shared *only* in aggregate.
- How does GDPR view downstream users of public data sources, i.e.
Hatlas? Is Hatlas a "data processor"? Must Hatlas integrate with Fedora's Personal Data Removal process? We intend to do so, but there seems to be no obligation for either party.
- Are there any data licenses applicable to downstream users such as
Hatlas? I intend to apply one restricting the use of Hatlas data to non-commercial purposes, but there seem to be no restrictions coming from Fedora.
Thanks in advance!
Michael Winters
Datanommer is the backend for this site: https://apps.fedoraproject.org/datagrepper/
On Tue, Nov 11, 2025 at 7:07 PM Jilayne Lovejoy via legal legal@lists.fedoraproject.org wrote:
Hi Michael,
Thanks for raising this. Having looked to your site, I'm a bit unclear as to what dataset you are referring to (what is datanommer, is that an existing set of data or a name you made?) and what exactly is published publicly already?
It sounds like you are potentially pulling data from a variety of different sources, is that right? If so, what are these sources and what is the intent of using this consolidated data?
Thanks, Jilayne
On 11/11/25 10:27 AM, Michael Winters via legal wrote:
Hello Legal,
My name is Michael Winters, typically known here as @mwinters. I have some questions about Fedora's data privacy policies, which I'll provide a bit of context to first.
There has been a long-standing desire within Fedora for better tools with which to analyze our user data and understand our community so that we can improve it. To this end, I have recently created a "Data Lakehouse" proof of concept known as "Hatlas", available at https://hatlas.mwinters.net . This technology consolidates data from existing public Fedora datasets and provides simplified tools to facilitate public access and analysis.
Since these datasets were previously quite difficult to access, I believe that most people are unaware of what data exists about them within Fedora and/or the fact that it's being published publicly. I expect that the announcement of easy access to this data will raise some community concerns about data privacy, so this email is in anticipation of those concerns. I wish to have clear resources to refer people to, and current resources such as https://docs.fedoraproject.org/en-US/legal/privacy/ have left some questions open.
In particular, many of these datasets include usernames and records of user activity tied to those usernames, e.g. the contents and exact timing of forum posts, git commits, group membership changes, etc. My current questions are:
Does an arbitrary username (not necessarily tied to a real name) constitute PII which must be protected / anonymized? It is not currently anonymized in Fedora datasets.
Do current Fedora policies permit collecting user activity tied to usernames? This is not explicitly stated under "Information We Collect", though it is mentioned later under "Using (Processing) Your Personal Data."
Do current Fedora policies permit publishing user activity tied to usernames? Section "Sharing Your Personal Data" does mention "For research activities", but it does not specify that data must be shared *only* in aggregate.
How does GDPR view downstream users of public data sources, i.e. Hatlas? Is Hatlas a "data processor"? Must Hatlas integrate with Fedora's Personal Data Removal process? We intend to do so, but there seems to be no obligation for either party.
Are there any data licenses applicable to downstream users such as Hatlas? I intend to apply one restricting the use of Hatlas data to non-commercial purposes, but there seem to be no restrictions coming from Fedora.
Thanks in advance!
Michael Winters
-- _______________________________________________ legal mailing list -- legal@lists.fedoraproject.org To unsubscribe send an email to legal-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/legal@lists.fedoraproject.org Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue
Thanks for the quick response Jilayne, and apologies for my verbosity here. I don't know what you know.
what dataset are you referring to
There are actually many different datasets published by Fedora, and much of the task ahead is to bring them under one roof where they can be cross-referenced. But most of them have very similar data, and the same general privacy questions apply.
To answer your question, "Datanommer" is the biggest of these. This is a record of all the messages sent between various Fedora systems.
For example, let's say that system A stores code. When user "bob" changes code for project "foo" which is stored in system A, system A sends a message to all other Fedora systems saying, "user bob made this change to foo's code." Depending on the message, any number of other systems might respond. For example, system B might rebuild the code for project "foo" and release a new version of package "foo" to Fedora end users. Meanwhile, system C might be tracking how many code changes were made by bob this month for the purpose of issuing recognition rewards to incentivize contributions.
These systems might then also issue their own additional messages. Going back to system B, once it has released the new version of package "foo", it might send a message saying that it has done so successfully, and another system responds by announcing the new version somewhere. In general, this is the "event driven" fabric by which most of our applications communicate with each other.
Every message ever sent on this fabric back to 2012 is currently stored in the Datanommer database and is published by Fedora in several places, including on Datagrepper which Neil shared with you. As I've illustrated, these messages include records of user activity tied to a username and a timestamp. Depending on the type of message, additional information is also included, such as "this is the code that bob changed" in my example.
The dataset is enormous and difficult to review. There are over 28,000 different types of messages in the system currently, which we refer to as "topics". The consensus seems to be that this data does not include the IP addresses of users or any other data that could be used to track a user's location or discover their identity beyond username. However, it's worth noting that there also seems to be no single party responsible for ensuring that this is true and that it remains true. In fact, I'm a bit concerned that given the vast quantity and variety of data here, some accidental PII may be discovered given the elevated community scrutiny that I'm anticipating with the elevated ease of access that Hatlas provides.
If I'm wrong about having a responsible party assigned, it would ease some concerns to publish that point of contact somewhere in our documentation.
what is the intent of using this consolidated data?
Our previous Fedora Project Leader established a goal of doubling our number of contributors by 2028, aka "Strategy 2028". However, today we have no way of measuring how many contributors we have, let alone whether that number is growing, or by how much, and which efforts of ours are most successful in growing this number, and where we ought to invest.
This very simple question of how many contributors we have is complex to answer because there are many types of contributions. For example, someone who hosts a regional Fedora conference is certainly a contributor by Strategy 2028 standards, but it's not likely that those activities will require new code, so measuring contributors as "those who contribute code" will miss this type of person and many others.
Additionally, there is a lot of nuance in many cases. Someone who makes a single code contribution of 10,000 lines of code representing hundreds of hours of work is likely more engaged with the project than someone who corrected 2 separate typos in our documentation. We need to be able to combine these many sources of data and all of the nuances within them to ascertain the size and "health" of our community, for the purpose of improving it.
I've developed Hatlas as a downstream product of Fedora data mainly because that was the most expedient way for me to facilitate these analyses. My long-term plan is to bring this project fully into Fedora, running under Fedora's governance and on Fedora's servers.
However, the fact is that anyone could build a similar system with no special access or permissions required, and use this data for very different purposes such as publishing what time of day a specific user is active in order to encourage harassment, or to see which admins change their password the least frequently. (To be clear, the password change example is hyperbole, but data of similar sensitivity could be inadvertantly present in such a vast dataset. In fact, the task of discovering what data we have is the first step of building our analyses, and will likely be an ongoing activity.)
I've already received comments of concern along these lines and I anticipate many more after publicly announcing Hatlas, which is why I'm asking for clarity -- to be able to be able to definitively say, "the Fedora data policies do permit collecting and publishing this data, and it is compliant with all applicable laws." (Which may result in the community asking for a policy change.) Or, alternatively, to figure out what might need to change from the current state to get us to that point.
Thanks again,
Michael Winters
On November 11, 2025 6:07:00 PM CST, Jilayne Lovejoy via legal legal@lists.fedoraproject.org wrote:
Hi Michael,
Thanks for raising this. Having looked to your site, I'm a bit unclear as to what dataset you are referring to (what is datanommer, is that an existing set of data or a name you made?) and what exactly is published publicly already?
It sounds like you are potentially pulling data from a variety of different sources, is that right? If so, what are these sources and what is the intent of using this consolidated data?
Thanks, Jilayne
On 11/11/25 10:27 AM, Michael Winters via legal wrote:
Hello Legal,
My name is Michael Winters, typically known here as @mwinters. I have some questions about Fedora's data privacy policies, which I'll provide a bit of context to first.
There has been a long-standing desire within Fedora for better tools with which to analyze our user data and understand our community so that we can improve it. To this end, I have recently created a "Data Lakehouse" proof of concept known as "Hatlas", available at https://hatlas.mwinters.net . This technology consolidates data from existing public Fedora datasets and provides simplified tools to facilitate public access and analysis.
Since these datasets were previously quite difficult to access, I believe that most people are unaware of what data exists about them within Fedora and/or the fact that it's being published publicly. I expect that the announcement of easy access to this data will raise some community concerns about data privacy, so this email is in anticipation of those concerns. I wish to have clear resources to refer people to, and current resources such as https://docs.fedoraproject.org/en-US/legal/privacy/ have left some questions open.
In particular, many of these datasets include usernames and records of user activity tied to those usernames, e.g. the contents and exact timing of forum posts, git commits, group membership changes, etc. My current questions are:
Does an arbitrary username (not necessarily tied to a real name) constitute PII which must be protected / anonymized? It is not currently anonymized in Fedora datasets.
Do current Fedora policies permit collecting user activity tied to usernames? This is not explicitly stated under "Information We Collect", though it is mentioned later under "Using (Processing) Your Personal Data."
Do current Fedora policies permit publishing user activity tied to usernames? Section "Sharing Your Personal Data" does mention "For research activities", but it does not specify that data must be shared *only* in aggregate.
How does GDPR view downstream users of public data sources, i.e. Hatlas? Is Hatlas a "data processor"? Must Hatlas integrate with Fedora's Personal Data Removal process? We intend to do so, but there seems to be no obligation for either party.
Are there any data licenses applicable to downstream users such as Hatlas? I intend to apply one restricting the use of Hatlas data to non-commercial purposes, but there seem to be no restrictions coming from Fedora.
Thanks in advance!
Michael Winters
I should expand my request here to ask: are there any technical measures / policies / licenses / etc which ought to be in place for Fedorans working on these datasets? (This also brings up the question of "Who *is* Fedora vs. who is *downstream of* Fedora?" Where do we draw the line in an open community?)
I ask this because we are discussing these privacy concerns internally and trying to find the best way forward. A few points here:
- It's fairly straightforward to "pseudonymize" user activity, meaning, we replace their usernames with a number (or similar). - However, *somebody* needs to perform this work. So we need to know under what conditions access can be granted (etc) to the original data.
- Even with pseudonymization, it may be possible to identify individuals by their activity. The only way to truly anonymize these datasets is to aggregate them. - However, we end up in the same position: *somebody* has to perform the aggregation. And this needs to be done very carefully (ideally, collaboratively) so that we can still extract the insights necessary to guide our community management decisions.
Thanks,
Michael Winters
To this end of pseudonymization, we have been discussing in the Data Working Group room on Matrix about strategies to identify cohort groups of contributors, that do not single out a contributor or any individual, but still allow us to draw useful insights and learnings about the cohort groups.
One example I had is about event engagement at an event such as FOSDEM. For people who scanned a Fedora Badge at FOSDEM, were they already contributors or were they encountering Fedora for the first time? After the event, did people generally continue to contribute to the project or did they disappear and we never saw them again? I don't need any individual name or identity of a person, but knowing the general trends of the cohort would be useful and interesting for me.
On Thu, Nov 13, 2025 at 12:51 PM Michael Winters via legal < legal@lists.fedoraproject.org> wrote:
I should expand my request here to ask: are there any technical measures / policies / licenses / etc which ought to be in place for Fedorans working on these datasets? (This also brings up the question of "Who *is* Fedora vs. who is *downstream of* Fedora?" Where do we draw the line in an open community?)
I ask this because we are discussing these privacy concerns internally and trying to find the best way forward. A few points here:
- It's fairly straightforward to "pseudonymize" user activity, meaning,
we replace their usernames with a number (or similar).
- However, *somebody* needs to perform this work. So we need to know
under what conditions access can be granted (etc) to the original data.
- Even with pseudonymization, it may be possible to identify individuals
by their activity. The only way to truly anonymize these datasets is to aggregate them.
- However, we end up in the same position: *somebody* has to perform
the aggregation. And this needs to be done very carefully (ideally, collaboratively) so that we can still extract the insights necessary to guide our community management decisions.
Thanks,
Michael Winters
legal mailing list -- legal@lists.fedoraproject.org To unsubscribe send an email to legal-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/legal@lists.fedoraproject.org Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue
NB, I'm answering simply as a Fedora contributor.
On Tue, Nov 11, 2025 at 11:27:59AM -0600, Michael Winters via legal wrote:
Hello Legal,
My name is Michael Winters, typically known here as @mwinters. I have some questions about Fedora's data privacy policies, which I'll provide a bit of context to first.
There has been a long-standing desire within Fedora for better tools with which to analyze our user data and understand our community so that we can improve it. To this end, I have recently created a "Data Lakehouse" proof of concept known as "Hatlas", available at https://hatlas.mwinters.net . This technology consolidates data from existing public Fedora datasets and provides simplified tools to facilitate public access and analysis.
Looking at the FAQ
https://hatlas.mwinters.net/docs/faq/
one item stands out to me
"If you feel strongly that you want to be erased from Fedora datasets, please work through the existing Fedora Personal Data Removal request process. If you still see your data here after a reasonable amount of time, feel free to contact me."
AFAICT, this is essentially saying that if you don't want your information to be processed by this Hatlas service, you need to cease all participation in the Fedora project, then request removal of your data, so that future Fedora data sources consumed by Hatlas no longer have your info. Urgh :-(
With Hatlas run as a 3rd party service, as opposed to an official Fedora service, I expect it could run into GDPR compliance problems with this attempt to outsource data removal requirements to Fedora.
In particular, many of these datasets include usernames and records of user activity tied to those usernames, e.g. the contents and exact timing of forum posts, git commits, group membership changes, etc. My current questions are:
- Does an arbitrary username (not necessarily tied to a real name)
constitute PII which must be protected / anonymized? It is not currently anonymized in Fedora datasets.
FWIW, the question of ties to a real name is explicitly mentioned in GDPR guidance in the UK[1]
"An individual’s social media ‘handle’ or username, which may seem anonymous or nonsensical, is still sufficient to identify them as it uniquely identifies that individual. The username is personal data if it distinguishes one individual from another regardless of whether it is possible to link the ‘online’ identity with a ‘real world’ named individual."
- How does GDPR view downstream users of public data sources, i.e. Hatlas?
Is Hatlas a "data processor"? Must Hatlas integrate with Fedora's Personal Data Removal process? We intend to do so, but there seems to be no obligation for either party.
If Hatlas is run independently of the Fedora project, my expectation would be that it must directly provide a data removal process, and cannot rely on outsourcing it to "upstream" data sources (Fedora).
With regards, Daniel
[1] https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/personal...
Thanks Daniel. Your concern is entirely valid and shared by me, which is why I've started this thread :)
As far as I can tell, there is no license on this data. I'm not a lawyer, and certainly not one skilled in US + EU law, so I don't know what rights are granted by default. But since this data is currently "published" by Fedora, I believe that any entity is at minimum allowed to "read" this information, and that no obligations exist thereafter regarding what they've "learned". Meaning - any evil entity (especially one outside of GDPR jurisdiction) can currently ingest this data and do whatever they want with it within their own system, and would be under zero obligation to execute PDRs. Ironically, it's the re-publishing that Hatlas does which is most obviously protected by default copyright etc, to my understanding. It's easier to be evil than open, as it stands today.
This is exactly the sort of concern that I'd like clarification on.
I also want people to understand that if they see something in Hatlas they don't like, deleting it from Hatlas does nothing to protect it -- it has to get deleted "upstream". I'll make that more explicit in the FAQ.
Thanks again for raising your concern here. I believe it's helpful for others to see the sort of conversations that Hatlas is spurring.
Michael Winters
On November 12, 2025 3:42:25 AM CST, "Daniel P. Berrangé via legal" legal@lists.fedoraproject.org wrote:
NB, I'm answering simply as a Fedora contributor.
On Tue, Nov 11, 2025 at 11:27:59AM -0600, Michael Winters via legal wrote:
Hello Legal,
My name is Michael Winters, typically known here as @mwinters. I have some questions about Fedora's data privacy policies, which I'll provide a bit of context to first.
There has been a long-standing desire within Fedora for better tools with which to analyze our user data and understand our community so that we can improve it. To this end, I have recently created a "Data Lakehouse" proof of concept known as "Hatlas", available at https://hatlas.mwinters.net . This technology consolidates data from existing public Fedora datasets and provides simplified tools to facilitate public access and analysis.
Looking at the FAQ
https://hatlas.mwinters.net/docs/faq/
one item stands out to me
"If you feel strongly that you want to be erased from Fedora datasets, please work through the existing Fedora Personal Data Removal request process. If you still see your data here after a reasonable amount of time, feel free to contact me."
AFAICT, this is essentially saying that if you don't want your information to be processed by this Hatlas service, you need to cease all participation in the Fedora project, then request removal of your data, so that future Fedora data sources consumed by Hatlas no longer have your info. Urgh :-(
With Hatlas run as a 3rd party service, as opposed to an official Fedora service, I expect it could run into GDPR compliance problems with this attempt to outsource data removal requirements to Fedora.
In particular, many of these datasets include usernames and records of user activity tied to those usernames, e.g. the contents and exact timing of forum posts, git commits, group membership changes, etc. My current questions are:
- Does an arbitrary username (not necessarily tied to a real name)
constitute PII which must be protected / anonymized? It is not currently anonymized in Fedora datasets.
FWIW, the question of ties to a real name is explicitly mentioned in GDPR guidance in the UK[1]
"An individual’s social media ‘handle’ or username, which may seem anonymous or nonsensical, is still sufficient to identify them as it uniquely identifies that individual. The username is personal data if it distinguishes one individual from another regardless of whether it is possible to link the ‘online’ identity with a ‘real world’ named individual."
- How does GDPR view downstream users of public data sources, i.e. Hatlas?
Is Hatlas a "data processor"? Must Hatlas integrate with Fedora's Personal Data Removal process? We intend to do so, but there seems to be no obligation for either party.
If Hatlas is run independently of the Fedora project, my expectation would be that it must directly provide a data removal process, and cannot rely on outsourcing it to "upstream" data sources (Fedora).
With regards, Daniel
[1] https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/personal...
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
-- _______________________________________________ legal mailing list -- legal@lists.fedoraproject.org To unsubscribe send an email to legal-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/legal@lists.fedoraproject.org Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue
On Wed, Nov 12, 2025 at 08:49:29AM -0600, Michael Winters wrote:
But since this data is currently "published" byFedora, I believe that any entity is at minimum allowed to "read" this information, and that no obligations exist thereafter regarding what they've "learned". Meaning - any evil entity (especially one outside of GDPR jurisdiction) can currently ingest this data and do whatever they want with it within their own system, and would be under zero obligation to execute PDRs. Ironically, it's the re-publishing that Hatlas does which is most obviously protected by default copyright etc, to my understanding. It's easier to be evil than open, as it stands today.
NB, wrt jurisdiction, the important criteria is the location of the person whose data is being processed, not the location of the entity doing the processing.
IOW, if the data processor is in the US, but are handling PII related to a person in the EU, the GDPR applies. How violations can be enforced is more questionable, but the rules are none the less intended to apply. IIUC the GDPR would even apply to any data about non-EU citizens for periods when they are travelling in the EU.
I also want people to understand that if they see something in Hatlas they don't like, deleting it from Hatlas does nothing to protect it -- it has to get deleted "upstream". I'll make that more explicit in the FAQ.
That is certanly true, but at the same time, I don't find that to be a particularly compelling rationale to put forward to justify Hatlas continue to hold the data. It comes across badly as a message IMHO.
Even when all the source data is publically available, there is a material difference between that data being spread around 100's of individual systems, vs a system which proactively collects & aggregates the data from 100's systems into 1 place, and provides a data mining frontend.
In the former case one has privacy-through-obscurity. Not perfect & vulnerable to malicious exploitation, but none the less a meaningful level of privacy for many people, much of the time.
In the latter case one potentially has a form of dragnet surveillence in the extreme case. NB I'm not saying that's what Hatlas is, just talking in general terms about data aggregation & mining systems that process public data.
People can quite reasonably be ok with the former situation, but be unhappy with the latter situation.
There is data privacy precedent here with search engines. They can be required to remove results that are personally related to individuals, even if the article(s) indexed by the search engine were all public & continue to remain public & could in theory be indexed by a different search engine.
With regards, Daniel
I plea completely ignorant to the international legalities and look forward to RedHat's guidance, but as a former infosec professional I find negative comfort in security / privacy through security. It is a *false* sense of safety, making people perceive safety where there is none and avoid action where it is warranted. In other words, it is a dangerous lie that we tell ourselves.
If the reality of the situation makes people uncomfortable then they should change that reality, rather than pretend that it is something else. Deleting Hatlas would be the equivalent of choosing anaesthesia without actually healing the wound. (And inviting others to create more unfelt wounds.) The wound would only fester, and the harm would spread.
I apologize that this is deeply unsatisfying. Discomfort is motivating -- that is why it exists, and it's motivating me to focus on solving the root of the issue. I hope that others are able to see that I'm asking the legal experts here to help with that diagnosis, and I ask for patience as we work through it.
Michael Winters
On November 12, 2025 9:57:56 AM CST, "Daniel P. Berrangé" berrange@redhat.com wrote:
On Wed, Nov 12, 2025 at 08:49:29AM -0600, Michael Winters wrote:
But since this data is currently "published" byFedora, I believe that any entity is at minimum allowed to "read" this information, and that no obligations exist thereafter regarding what they've "learned". Meaning - any evil entity (especially one outside of GDPR jurisdiction) can currently ingest this data and do whatever they want with it within their own system, and would be under zero obligation to execute PDRs. Ironically, it's the re-publishing that Hatlas does which is most obviously protected by default copyright etc, to my understanding. It's easier to be evil than open, as it stands today.
NB, wrt jurisdiction, the important criteria is the location of the person whose data is being processed, not the location of the entity doing the processing.
IOW, if the data processor is in the US, but are handling PII related to a person in the EU, the GDPR applies. How violations can be enforced is more questionable, but the rules are none the less intended to apply. IIUC the GDPR would even apply to any data about non-EU citizens for periods when they are travelling in the EU.
I also want people to understand that if they see something in Hatlas they don't like, deleting it from Hatlas does nothing to protect it -- it has to get deleted "upstream". I'll make that more explicit in the FAQ.
That is certanly true, but at the same time, I don't find that to be a particularly compelling rationale to put forward to justify Hatlas continue to hold the data. It comes across badly as a message IMHO.
Even when all the source data is publically available, there is a material difference between that data being spread around 100's of individual systems, vs a system which proactively collects & aggregates the data from 100's systems into 1 place, and provides a data mining frontend.
In the former case one has privacy-through-obscurity. Not perfect & vulnerable to malicious exploitation, but none the less a meaningful level of privacy for many people, much of the time.
In the latter case one potentially has a form of dragnet surveillence in the extreme case. NB I'm not saying that's what Hatlas is, just talking in general terms about data aggregation & mining systems that process public data.
People can quite reasonably be ok with the former situation, but be unhappy with the latter situation.
There is data privacy precedent here with search engines. They can be required to remove results that are personally related to individuals, even if the article(s) indexed by the search engine were all public & continue to remain public & could in theory be indexed by a different search engine.
With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
On 11/12/25 11:47 AM, Michael Winters via legal wrote:
I find negative comfort in security / privacy through security.
Clarification: I find negative comfort in security / privacy through *obscurity*.
Don't juggle legal threads and toddlers before your coffee is empty, folks.
(Thanks Joe for pointing this out.)
MW
Hi all,
I'm wading in with my Fedora Council hat on. I don't have much to add, but I want to acknowledge the work that Michael is doing as part of the Community Ops Team[1] / Data Working Group. It extends from work that began with me and Robert Wright in 2024[2] to organize a community analytics team to help us better understand our contributor community and better measure whether we are growing our community or not through investments like events that we fund or campaigns that we run to onboard new contributors.
The work that Michael is doing is strategically important to Fedora. While it is true that Michael has built a new way to access this data, anyone who has a Fedora Account System account, is familiar with a Fedora Account System account, knowledge of data concepts, Python or R experience, and has a lot of patience could also do the things he is already doing. I believe there is also some historical context in this from ten-ish years ago when GDPR first came into effect, about how the data we collect to make and build an operating system in the public is essential. However, I am not a lawyer and cannot give legal advice here. Mostly, I want to acknowledge this as useful, helpful work, and if we can provide a sanctioned pathway for Michael to move forward, it would take some anxiety and stress off his and others' shoulders.
Thanks!
[1] — https://docs.fedoraproject.org/en-US/commops/ [2] — https://fedoraproject.org/wiki/Initiatives/Community_Ops_2024_Reboot
On Wed, Nov 12, 2025 at 12:48 PM Michael Winters via legal < legal@lists.fedoraproject.org> wrote:
I plea completely ignorant to the international legalities and look forward to RedHat's guidance, but as a former infosec professional I find negative comfort in security / privacy through security. It is a *false* sense of safety, making people perceive safety where there is none and avoid action where it is warranted. In other words, it is a dangerous lie that we tell ourselves.
If the reality of the situation makes people uncomfortable then they should change that reality, rather than pretend that it is something else. Deleting Hatlas would be the equivalent of choosing anaesthesia without actually healing the wound. (And inviting others to create more unfelt wounds.) The wound would only fester, and the harm would spread.
I apologize that this is deeply unsatisfying. Discomfort is motivating -- that is why it exists, and it's motivating me to focus on solving the root of the issue. I hope that others are able to see that I'm asking the legal experts here to help with that diagnosis, and I ask for patience as we work through it.
Michael Winters
On November 12, 2025 9:57:56 AM CST, "Daniel P. Berrangé" < berrange@redhat.com> wrote:
On Wed, Nov 12, 2025 at 08:49:29AM -0600, Michael Winters wrote:
But since this data is currently "published" byFedora, I believe that any entity is at minimum allowed to "read" this information, and that no obligations exist thereafter regarding what they've "learned". Meaning - any evil entity (especially one outside of GDPR jurisdiction) can currently ingest this data and do whatever they want with it within their own system, and would be under zero obligation to execute PDRs. Ironically, it's the re-publishing that Hatlas does which is most obviously protected by default copyright etc, to my understanding. It's easier to be evil than open, as it stands today.
NB, wrt jurisdiction, the important criteria is the location of the person whose data is being processed, not the location of the entity doing the processing.
IOW, if the data processor is in the US, but are handling PII related to a person in the EU, the GDPR applies. How violations can be enforced is more questionable, but the rules are none the less intended to apply. IIUC the GDPR would even apply to any data about non-EU citizens for periods when they are travelling in the EU.
I also want people to understand that if they see something in
Hatlas they don't like, deleting it from Hatlas does nothing to protect it -- it has to get deleted "upstream". I'll make that more explicit in the FAQ.
That is certanly true, but at the same time, I don't find that to be a particularly compelling rationale to put forward to justify Hatlas continue to hold the data. It comes across badly as a message IMHO.
Even when all the source data is publically available, there is a material difference between that data being spread around 100's of individual systems, vs a system which proactively collects & aggregates the data from 100's systems into 1 place, and provides a data mining frontend.
In the former case one has privacy-through-obscurity. Not perfect & vulnerable to malicious exploitation, but none the less a meaningful level of privacy for many people, much of the time.
In the latter case one potentially has a form of dragnet surveillence in the extreme case. NB I'm not saying that's what Hatlas is, just talking in general terms about data aggregation & mining systems that process public data.
People can quite reasonably be ok with the former situation, but be unhappy with the latter situation.
There is data privacy precedent here with search engines. They can be required to remove results that are personally related to individuals, even if the article(s) indexed by the search engine were all public & continue to remain public & could in theory be indexed by a different search engine.
With regards, Daniel
--
legal mailing list -- legal@lists.fedoraproject.org To unsubscribe send an email to legal-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/legal@lists.fedoraproject.org Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue
Sorry for the delayed reply here, I have been swamped with other work and life items, as well as pondering on what I can and should say here about all this.
First, some disclaimers: I am definitely not a lawyer, although I was present when we setup our GDPR handling process 7+ years ago. This process was created in consultation with legal folks as well as infrastructure folks for implementation. 7 years is a long while, so I could be misremembering or not being as detailed as perhaps I might wish I am.
First, the data we are talking about here is indeed completely public. You can subscribe to a mailing list and get all the posts locally, you can pull a git repo and have all the commits. You can subscribe to the message bus and see messages going accross it. I understand the point downthread about different aggregations of data, but just wanted to make clear that this data is publicly available and anyone can get it.
My recollection of things is that we determined that the fedora project had a legitimate business interest in maintaining the integrety of this data. It's part of our core mission to create a community of open source developers that collaborate, organize, discuss and make changes in releasing a collection of open source software. If we remove chunks of data, our mission is compromised, we can no longer see how something was proposed, discussed, decided and then how the changes were made.
So, except in exceptional cases, we do not remove this public data. (Mailing lists posts have been deleted in exceptional cases). For applications/things that we have that allow to anonomize users, we do that on request. The only application I know of that handles this is discourse.
...snip...
In particular, many of these datasets include usernames and records of user activity tied to those usernames, e.g. the contents and exact timing of forum posts, git commits, group membership changes, etc. My current questions are:
- Does an arbitrary username (not necessarily tied to a real name)
constitute PII which must be protected / anonymized? It is not currently anonymized in Fedora datasets.
My understanding: no. username is public. Other information attached to an account may be PII and can be removed on request, leaving the username as part of our legitimite business needs.
- Do current Fedora policies permit collecting user activity tied to
usernames? This is not explicitly stated under "Information We Collect", though it is mentioned later under "Using (Processing) Your Personal Data."
Yes. This could be more clear/much more explicit.
- Do current Fedora policies permit publishing user activity tied to
usernames? Section "Sharing Your Personal Data" does mention "For research activities", but it does not specify that data must be shared *only* in aggregate.
IMHO, yes. Should be more clear/explicit.
- How does GDPR view downstream users of public data sources, i.e. Hatlas?
Is Hatlas a "data processor"? Must Hatlas integrate with Fedora's Personal Data Removal process? We intend to do so, but there seems to be no obligation for either party.
I don't know the answer here.
- Are there any data licenses applicable to downstream users such as
Hatlas? I intend to apply one restricting the use of Hatlas data to non-commercial purposes, but there seem to be no restrictions coming from Fedora.
Or here.
However, there's some semantics here: Is this a seperate project? You are working on this in the context of fedora with fedora resources (once the POC is done), so a good argument could be made that it's just another fedora application run by fedora. Probibly still doesn't answer your questions above, but thought I would mention that.
Thanks for opening this discussion.
I think we could definitely clarify things in our privacy policy and confirm other things. Unfortunately, I don't think that work can happen here, we will need to discuss it with internal legal folks.
Thanks for the response Kevin. The implicit points about "legitimate business purposes" are exactly the sort of thing that lay people like myself need to understand. And yes, everything should be made much more explicit.
One request if you have time (hah!): Any old emails or other materials you can dig up and share on the topic would be super helpful.
However, there's some semantics here: Is this a seperate project? You are working on this in the context of fedora with fedora resources (once the POC is done), so a good argument could be made that it's just another fedora application run by fedora.
That's a great question, and I don't know the answer from a legal standpoint. (Who is "in" Fedora vs. "collaborating with" vs "in communication with"? If Meta makes a FAS account, are they "in" Fedora?)
There are probably only two answers: this is either a separate project, or it isn't. Judging from the concerned comments I've been getting, we need to provide answers for both scenarios.
FYI I have generally portrayed Hatlas as "downstream" and separate mainly because I don't want people to feel a false sense of security with, "it's ok, it's an official Fedora project". I would feel disingenuous allaying those concerns when the reality is that anyone else could do whatever they want with this data today. (At least as far as: there are no technical measures preventing it, and seemingly none legal either.)
Of course, I'd *really* like to not get sued if the obscure legalities shake out against me :). So let me say again: my intention here *is* for this to be a POC for a Fedora project. (With presumably a new non-"hat" name when it moves.) But I think it's in the casual visitors' best interest to conceive of it as separate and treat it accordingly.
I don't think that work can happen here, we will need to discuss it with internal legal folks.
That's frustrating, but totally understandable.
Hatlas has generated just as much excitement in our CommOps community as it has concern outside of it. We're able to make real progress on our goals, far more quickly than ever before. And we are able to invite into the work those who have been previously offered to help but been stuck on the sidelines.
I've been holding off on adding more data to Hatlas in the hopes of quick answers here, but it seems clearer by the day that there are none.
In lieu of an authoritative legal response, yourself and others in Fedora leadership have all stated that you believe this data to be freely available for any purpose, and especially those serving the Fedora community. As such, I plan to thread that needle of excitement and concern and try to add to Hatlas the highest-value subsets of data that I can, while still not providing the full historical dataset. I've shared that plan and thought process elsewhere, but thought I'd share it here too for visibility.
Thanks again for your leadership Kevin. I hope you get to sleep someday.
On Fri, Nov 14, 2025 at 03:23:25PM -0600, Michael Winters via legal wrote:
Thanks for the response Kevin. The implicit points about "legitimate business purposes" are exactly the sort of thing that lay people like myself need to understand. And yes, everything should be made much more explicit.
One request if you have time (hah!): Any old emails or other materials you can dig up and share on the topic would be super helpful.
I don't think I have much. We worked on all this at a face to face meeting (I think it was before a flock/fudcon) where we had a bunch of people in person. There might be some notes, but I don't know that I have them. I can ask around.
However, there's some semantics here: Is this a seperate project? You are working on this in the context of fedora with fedora resources (once the POC is done), so a good argument could be made that it's just another fedora application run by fedora.
That's a great question, and I don't know the answer from a legal standpoint. (Who is "in" Fedora vs. "collaborating with" vs "in communication with"? If Meta makes a FAS account, are they "in" Fedora?)
There are probably only two answers: this is either a separate project, or it isn't. Judging from the concerned comments I've been getting, we need to provide answers for both scenarios.
FYI I have generally portrayed Hatlas as "downstream" and separate mainly because I don't want people to feel a false sense of security with, "it's ok, it's an official Fedora project". I would feel disingenuous allaying those concerns when the reality is that anyone else could do whatever they want with this data today. (At least as far as: there are no technical measures preventing it, and seemingly none legal either.)
I was sort of thinking of it as a 'proof of concept out something' and then propose it be added as a 'fedora' thing, but yeah, best to be careful in early days for sure.
Of course, I'd *really* like to not get sued if the obscure legalities shake out against me :). So let me say again: my intention here *is* for this to be a POC for a Fedora project. (With presumably a new non-"hat" name when it moves.) But I think it's in the casual visitors' best interest to conceive of it as separate and treat it accordingly.
Sure.
I don't think that work can happen here, we will need to discuss it with internal legal folks.
That's frustrating, but totally understandable.
Hatlas has generated just as much excitement in our CommOps community as it has concern outside of it. We're able to make real progress on our goals, far more quickly than ever before. And we are able to invite into the work those who have been previously offered to help but been stuck on the sidelines.
yes! Thanks for driving this forward!
I've been holding off on adding more data to Hatlas in the hopes of quick answers here, but it seems clearer by the day that there are none.
In lieu of an authoritative legal response, yourself and others in Fedora leadership have all stated that you believe this data to be freely available for any purpose, and especially those serving the Fedora community. As such, I plan to thread that needle of excitement and concern and try to add to Hatlas the highest-value subsets of data that I can, while still not providing the full historical dataset. I've shared that plan and thought process elsewhere, but thought I'd share it here too for visibility.
ok. I will try and ask questions as I can and will let you know answers as much and as soon as I can.
Thanks again for your leadership Kevin. I hope you get to sleep someday.
I have finally. Things are looking up!
kevin