On Tue, Jan 18, 2022 at 10:42 PM Jilayne Lovejoy jlovejoy@redhat.com wrote:
Part of this is figuring out what we want the identifiers to signify and how much deviation from reality we want to tolerate. We really ought to start out figuring out this question and only then develop practical guidelines around it.
For example, existing Fedora guidance indicates that the License: tag should reflect the license of the appropriate binary (a very interesting convention which I've often thought is beneficial), which among other things implies that merely scanning source code will sometimes give "incorrect" results even if the scanner is somehow perfect.
The Fedora policy of license-reflects-binary is indeed a monkey wrench in using a license scanner to scan the source code. Would it be possible to change the question or analysis instead to: the License: tag should reflect the license for the code (whatever format, source or binary) that is actually distributed in Fedora. ??
To be clear, Fedora has "binary RPMs" and corresponding "source RPMs". The binary RPM, the thing you install on your system so that you can run some executable or library or whatever, is built from the source code that ends up being packaged in the source RPM. The binary RPM doesn't necessarily contain any object code, though usually it does, and might have source files. So that's one thing. I think the policy doesn't refer to binary (object) code per se but rather attempting to describe the licensing of all the code distributed in a binary/installable RPM, which implies some extra degree of legal analysis you wouldn't need to undertake if you are just looking at, say, a source repository or the equivalent in a package format. Fedora distributes both things (though normally when you install a package you are not also getting a copy of the source RPM).
A couple of issues here:
One is that the corresponding source tarball will (depending on the technology) very often have files under a larger set of licenses than what's in the binary. The upshot is that lots of packages in Fedora that might just say "MIT" today (and let's assume for sake of argument that Callaway™ MIT is SPDX MIT in all these cases) would have to say something like "MIT AND GPL-2.0-or-later AND GPL-2.0-or-later WITH Autoconf-exception-2.0 AND FSFAP AND FSFULLR". That's certainly going to be in the set of stuff FOSSology and ScanCode will tell you are there. So one question is whether it is useful to have that rather complex SPDX expression, particularly if some of it concerns stuff that the user will never have on their system anyway. And these are cases where we can say the software is normally thought of as being "not GPL".
The other thing is that the binary rule means that you can have spec files that have different License: tags for different subpackages. A common case in Linux distros is to have a (source) package associated with multiple binary packages that might be under different licenses (based on analysis of compilation of the binaries), so for example one subpackage might be built from GPL version 2 code and another might be built from LGPL version 2.1 code. And that distinction might actually be useful for some users. (In reality, though, I think the fidelity to this approach to license description varies widely across Fedora packages, which might signify that it's actually too complex to be consistently workable.) If you switch to a pure source code-oriented license description standard, you necessarily lose that type of information, since the LGPL library and the GPL daemon or utility or whatever are built from the same source code, which will be a mix of GPL and LGPL. So you'd end up with a single License tag, say "GPL-2.0-or-later AND LGPL-2.1-or-later" (or, actually, "GPL-2.0-or-later AND LGPL-2.1-or-later AND GPL-2.0-or-later WITH Autoconf-exception-2.0 ..." etc.) instead of two License tags, where the main package might be GPL-2.0-or-later and the library subpackage might be LGPL-2.1-or-later.
(BTW I am not a Fedora packager nor do I play one on TV so I welcome any corrections to misstatements in the above. :-)
I have not really used ScanCode and have more familiarity (even if a bit outdated) with FOSSology. It is true that many scanner results require some amount of "reconciliation" as I call it - that is, manually inspecting results that are ambiguous in some way. Often, there is an easily human-identifiable "answer", but it still requires some looking. That being said, if some/most package maintainers are actually looking at all the files in a more manual way, using a scanner would be a big improvement over that. FOSSology, and I believe ScanCode both have the capability to output scan results in various formats, including an SPDX document.
A dilemma here is that using scanners, at least the kind that give the best results, is kind of a burden. It is worth bearing in mind that scanners, including the open source ones, have been developed mainly for internal use by specialized personnel within companies and the use of scanners by open source projects is pretty limited. I don't think we can expect Fedora packagers to try to get a FOSSology instance running and to figure out how to use it once they do (basing this on personal experience). Even ScanCode might be a bit challenging to work with. Simpler tools however might give suboptimal results. I can imagine Fedora running a scanner-as-a-service (do any other community distros do this?) -- maybe even something like an instance of ClearlyDefined -- but that is probably farfetched.
Richard