Le vendredi 08 mars 2019 à 18:01 -0800, Matthew Dempsky a écrit :
On Fri, Mar 8, 2019 at 4:52 PM 'Nicolas Mailhot' via
golang-dev <
golang-dev(a)googlegroups.com> wrote:
> It would be nice if it where true. Unfortunately even a simple
> command
> like tell me what you know about the local code tree
> go list -json -mod=readonly ./...
>
> will abort if the CI/CD system cuts network access to make sure
> builds
> do not depend on Internet state.
The behavior you described sounds like an issue to me. However, I
wasn't able to easily reproduce it locally: I tried running "strace -f
go list -json -mod=readonly ./... |& grep connect" under a few
scenarios (e.g., missing go.sum, or incomplete go.mod file), and none
of them seemed to result in network access.
Can you provide more detailed reproduction steps?
That's because upstream go workflows (from a distribution POW) cheat by
downloading blindly things from the internet before they have been
checked, filling the go cache silently with them, and grandfathering
directly downloaded third party code in a build as original source code.
We do not allow that. We make the promise to our users that we only make
available things, that we created from direct upstream source code, and
third party things, that already passed distribution QA checks in their
own separate build.
And that, only if this those party things were identified as necessarily
for the build¹ and installed (for go modules) in the GOPROXY directory
used for this specific build via system components.
At the very very first steps of a build recipe when we ask the copy of
the codebase being built "what are you, what are you composed of, what
to you need for building" there is literally only the codebase being
inspected and the go compiler available and GOPROXY is empty. Because we
need the answer to this question, to translate it into system component
ids, so the build container can be populated with those system
components, making the module files available inside GOPROXY.
But in module mode, go will assume everything is available all the time,
and if it is not available it can be downloaded directly from the
internet, so instead of answering the question from direct first-level
elements (the codebase and nothing else) it will try to work
transitively and abort because second level code is not available yet.
> > I would have expected that if Fedora modified a library,
they
> > would give it a different version number, so that for example
> > modifying v1.2.2 would produce v1.2.3-fedora.1
>
> And that's not the case neither for Fedora, nor RHEL, nor Debian,
> nor
> pretty much any large scale integrator,
When I run "gcc --version" on my work Debian machine, I see:
gcc (Debian 7.3.0-5) 7.3.0
And when you run
$ ls -l /usr/lib
you won't see any libfoo.so.x.y.z-debian or libfoo.so.x.y.z-5
Because the only libfoo.so.x.y.z allowed on system at any given time is
the one built from (debian-patched) foo x.y.z source code, no matter if
it's release 1 2 3 or 5 of the Debian build, no matter the level of
Debian patching and massaging each build carries.
So giving the underlying build command the "you're building 7.3.0 build
release 5 debian" info so it appears in version info of the binary is OK
and encouraged (we'll even pass it as ldflags info to make sure it is
recorded in reach binary).
But having the underlying build command try to discriminate between
different build releases, or the build releases or the third party
artefacts brought in the build root (like would happen if the build
release was exposed to the golang semver resolution engine) is not OK.
The underlying build command has not business doing any processing on
release info, it's system component metadata, the system component
engine will give it the release it needs to work with at any given time,
if the build command tries to second guess the system component engine
things start breaking fast.
If Debian is okay naming their version of GCC "Debian
7.3.0-5", I'd
think they're okay releasing a patched Go package as v1.2.3-debian.1.
You even seem to suggest this later when talking about "
mymodule(a)v1.2.2 release 1" and "mymodule(a)v1.2.2 release 2." Russ is
just talking about a different way of encoding those version strings.
> because when you integrate
> masses of third-party code you will eventually hit bugs in pretty
> much
> every component, so having to run patched code at every layer is the
> norm not the exception.
The need to make changes to open source packages is familiar to
Google. E.g., Google's internal build system incorporates a lot of
open source packages, and many need to be modified to accommodate
Google's internal development idiosyncracies. Even the Go toolchain
and standard library themselves are patched internally.
I think it would help if you could highlight the specific technical
hurdles you're running into trying to bundle Go packages into RPMs. I
expect the mechanisms are in place to do what you need, but I can
believe the tooling could use improvements to handle Linux-
distribution-scale integration efforts better, and I think the Go
project is interested in supporting those efforts.
Ok, thanks a lot, that would make things a lot simpler.
I'll list here the various commands we need implemented to plug go
modules in our system. Implemented means by upstream go or ourselves,
though we obviously would prefer it to be upstream go side, because that
ensures they are not broken in the future by upstream go tooling
changes, that other distros use the same commands as us, that we can
limit divergence and share the QA and fixing burden.
I believe that the commands needed apt-side for Debian and Ubuntu, are
pretty much the same, because apt/deb and dnf/yum/rpm have been on a
convergent evolution track for a long time, and at this point their
high-level architecture is the same, only implementation details differ
(we customarily scrouge Debian patches and fixes and they scrouge ours,
and we metoo each other's problem reports in upstream issue trackers).
Like I already wrote, our build process starts with a bare raw unpacked
copy of upstream sources in a specific directory. This copy will have
already been modified to remove things we do not like and patch out
problems². Obviously the patching out is an iterative process that's why
each release of a build may carry slightly different patches.
This copy typically does not include any VCS info³. Since go expects to
find the module version information there, and does not record it in the
project go.mod like other languages, life already sucks for us because
we need to carry a variable with this info and pass it manually to later
commands.
[available resources:
— minimal system,
— go compiler,
— empty system GOPROXY,
— prepared unpacked project sources in a specific directory]
A. We need a command to identify all the go module trees present in the
source directory. A variation of
$ find . -type f -name go.mod
would probably work, but that lacks all the sanity checks the upstream
go may want to apply now or in the future, to make sure a "go.mod" file
is a legitimate go module descriptor.
[available resources:
— minimal system,
— go compiler,
— empty system GOPROXY,
— prepared unpacked project sources in a specific directory]
B. We need another command (or the same) that lists:
1. the go module name of each of those trees,
2. their consolidated first-level requirements (not the list of their
individual first-level requirements)
Probably not both at once, or if both in a structured format we can
reprocess. The most convenient command is
<command> <directory> --provides → one module name per line
<command> <directory> --requires → one constrain
(module + associated version constrain) per line
The --requires command needs to filter out the modules names already
present in the source tree, because in a multi-module X project that
contains X/foo and X/bar modules, we want to build X/foo against the
local version of X/bar, not some other one
(and we will have some auditing work on the --provides output to make
sure every provides line actually belongs to the project being built and
is not a copy of someone else's code)
I lack the necessary perspective today to say how replaces should be
handled.
Local replaces clearly need ignoring, if some code needs the X third-
party module, we want it to use our curated X module, not some project-
local directory containing a fork of X.
Non-local replaces? That's less clear to me.
If a non local replace of B by C in module A means that every downstream
user of module A will then use C instead of B when processing A code,
that's fine with us, the fact B exists is an internal go compiler
information, we only need to know C is needed in the A build container
and in any container that uses the A module.
But, if a non local replace of B by C in module A only affects direct A
builds, and indirect A builds will still use B, we'll just cry and yell
and curse in a quiet corner, and then either remove the replacing in the
A module file, or make it transitive by patching our version of A code
imports statements to use C everywhere instead of A.
The result of the listing command will then be translated in rpm system
identifiers, and used to populate the build GOPROXY.
Translation basically means:
1. applying some form of namespacing:
“I need the go module named X” becomes “I need golang-module(X)”
(sadly the golang() namespace is already taken by GOPATH sources, and
can not be shared because go keeps GOPATH and module mode separate)
2. translating the golang version constrain syntax in rpm version
constrain syntax. That's pretty easy for semver:
“I need X(a)v1.2.xn--3-rhn becomes
“I need golang-module(X) ≥ 1.2.3 and < 2”
For non-semver constrains like a specific hash-identified snapshot
that's less nice, we do not allow hashes as valid version IDs rpm
side (because hash ordering can not be deduced from the hashes
themselves) so anything that absolutely require a specific snapshot
of X will be translated into “I need golang-module(X)(commit=hash)”
and golang-module(X)(commit=hash) is a separate ID from
golang-module(X)
Because dependency cycles exist in the real-world, and because we are
extra-careful to build only from things we already vetted, there will be
cases where we won't be able to satisfy upstream project requirements
If A needs B that needs C that needs A we will have build one of A B C
in reduced more at one point to break the cycle (we call that
boostrapping). No idea if it's better to inform the listing command that
one of the module needs it reports will be ignored, or if it's better to
ignore it silently.
[available resources:
— minimal system,
— go compiler,
— system GOPROXY populated with distro-produced go modules corresponding
to identified project requirements,
— prepared unpacked project sources in a specific director]
C. we need a go mod pack command that creates packed module files from
the unpacked project sources into a staging GOPROXY directory.
We can pass it the project version as argument (since it can not read it
in the upstream go.mod files).
Since VCS info is not available, the time component of the info file can
not be populated with it. And, we definitely do not want the command to
record the current time (because then the file content would depend on
when it was built, and we have processes that check that arch-agnostic
files files produced by different architecture builders are bit-for-bit
identical, and we have security auditors that replay builds later in
their own system and are alarmed when the replay produces different
results). And, frankly, the time info in VCSes is messy and unreliable,
and mtimes can not be used (the source preparation process can apply
patches, and that will change mtimes, even if you apply the very same
patch in two different runs).
Thus, we'd prefer this time info not to exist in our own info files, or
be set to zero, or, if go really needs it, to pass it as parameter to go
mod pack, (but that's more manual work our side to decide what to pass).
This command needs to operate in bulk on all the module trees previously
identified (so if it found 3 module trees, create 3 packed modules in
the staging GOPROXY), or a subset (explicitly provided module list).
This command needs to output the list of created module files (the zip
ziphash info mod files). The list file is a special case, we would
consider its file enveloppe as belonging to each of the versions of a
module, and the file content belonging to no one in particular (ghost
files in rpm tech). So it can be listed or not either way we will apply
special processing to it. Probably, better to list it for cleanliness.
The "files created" list can be either outputted to stdout (in which
case it outputs nothing else to stdout) or to a list file specified as
argument. No particular preference, we can handle both.
This list is used to ventilate the produced files in specific system
components: upstream project A may contain sources for A/B/C A/B/D A/E
modules, and we can choose to put all of those in a single system
component, or separate in two A/B/C+A/B/D and A/E, or go full granular
with three system components. The level of splitting depends on the
consequences in the system component dependency graph.
This go mod pack will need to be more discriminating than just “bulk
pack every file under the go.mod root directory, .gitignore included”
like I see go is doing right now.
If a project includes font files (for example the go font), we'll want
to expose those in /usr/share/fonts so they are not restricted to the go
compiler. If a project includes documentation files we will deploy them
in /usr/share/doc not in the module zip file. And so on (protobuf files
come to the mind)
So, typically, we'd want go mod pack to pack *only* the module tree Go
source code (things that can be used by the go compiler) ideally only
the source code that can be used on our platform (GOOS=linux), and vet
the packing of any other file.
So a basic go mod pack invocation only packs go source files and
testdata.
You have an info or dry run mode that says "default packing ignored/will
ignore all those files in the origin tree"
You can pass lists of regexes to the go mod pack command to either
include other non source code files, or to exclude already selected
files from the packing (--include regex and --exclude regex flags).
The regexes are obviously very module specific, depend on the module
resource needs and the distribution packaging policies. The generic go
tooling need not worry about those policies, just apply the produced
regexes.
Ideally, we could pass the go mod pack a --without module[@version]
argument, that causes it to not pack any the code that requires
module[@version]. That's necessary while bootstrapping and to cull very
expensive unit or integration tests that require hundreds of third-party
modules to run. Otherwise we can approximate it via exclude (but that
will require more Fedora manual work to identify the required excludes).
Please understand, we are not doing all this filtering and culling in
opposition to upstreams Go projects. We have no special wish to deviate
from them. Our ideal world is usptream code releases that can be used
directly as-is without any patching or filtering.
But, a lot of upstream projects lack discipline.
They will include files they have no legal right to distribute (and we
will remove those before step A).
They will include files that can be distributed, but not modified (and
entities like Fedora and Debian will think very hard if they really want
to relay those, because they are contrary to our free software policies,
and our users like to know that everything we ship can be modified
without legal problems).
They will reference third-party modules with incompatible (or completely
missing) licensing.
They will ship integration code that makes no sense outside of their own
information system.
They will ship for years broken and bitrotten project/test/example
code.
For them, continuing to include this code within their project is free.
their go get will happily download hundreds of unveted and unnecessary
Go modules from the internet. No one is likely to sue them for small
legal mistakes.
But shipping this code is *not* free for us. A distribution is big
enough it can be sued. Any failing unit test costs human time to check
it is an harmless failure. Any module dep pulled by this unnecessary
code is yet another software project that needs to be checked and
audited and integrated and maintained at the system component level
before it is available in the CI system.
So, we need to aggressively cull anything not necessary, to keep the our
integration costs and risks down.
And we really really would appreciate if project unit tests (+ testdata)
and project production code were split in separate zip files, with
separate build requirements, so the unit test dependency costs were
optional (we'd typically pay the project A unit test costs when
integrating project A, not when integrating something that uses project
A).
For all those reasons, the go modules files produced at this stage will
definitely *not* match the hashes produced by the notary, even if we do
not change a sigle line of go code.
[available resources:
– minimal system,
– go compiler,
– system GOPROXY populated with distro-produced go modules corresponding
to identified project requirements
– prepared unpacked project sources in a specific directory,
— staging GOPROXY containing candidate go modules corresponding to the
built project]
D. we need a command to build the project binaries from the files in the
system GOPROXY and the unpacked project sources,
OR from the system GOPROXY and the staging GOPROXY contents
(ideally, the second option, to make sure the candidate module files in
the staging directory are complete)
After this step we've hopefully finished producing files from the
prepared unpacked project sources. At this stage rpm deploys every
produced candidate file in a location that mirrors the target deployment
paths, under a specific prefix
So:
– a file in /usr/bin is a binary installed from existing system
components
– a file in /prefix/usr/bin is a candidate binary produced from the
ongoing build
— a file in GOPROXY is a Go module installed from existing system
components
— a file in /prefix/GOPROXY is a candidate Go module file produced from
the ongoing build
[available resources:
– minimal system,
– go compiler,
– system GOPROXY populated with distro-produced go modules corresponding
to identified project requirements
– prepared unpacked project sources in a specific directory,
— candidate tree under /prefix containing all the files that will end up
in new system modules. Some of those will replace existing files in /.
]
E. we need a command to run the unit tests of every produced go proxy
module (in the staging GOPROXY). Any failure (returning error code)
aborts the build process.
We'd really like this command to only take into account the candidate
tree, and the existing / tree (for files that do not belong to the
target system components).
[available resources:
– minimal system,
– go compiler,
– system GOPROXY populated with distro-produced go modules corresponding
to identified project requirements
– prepared unpacked project sources in a specific directory,
— candidate tree under /prefix containing all the files that will end up
in new system modules. Some of those will replace existing files in /.
]
F. At this point rpm starts ventilating the files contained in the
staging tree in new system components. Therefore, it needs to compute
the metadata of each of those system components
So, we have a new run of "what are you and what do you need" command,
this time on every mod/zip/zipphash/info fileset in /prefix/GOPROXY,
instead of the content of the prepared unpacked project sources. So,
probably some variation of the command in B.
Practically, those inspection runs can not be triggered by complex
objects like filesets, only individual files, so we'll probably trigger
them on the mod files, and assume that is the mod file is present, the
rest is likely to be here too.
rpm keeps track or where each file will end up, and attributes the
result of the query to the corresponding system component (it does not
tell the queried fileset where it will end up)
So we now have brand-new shiny system components that contain clean
audited GOPROXY module filesets, and declare they provide
golang-module(foo) = x.y.z, and that they need
golang-module(bar) ≥ a.b.c and < a+1
Installation of one of those system components will make the module
fileset exist in GOPROXY
But, that is not sufficient for go, because of the list files.
So, we need a final command, that rpm can invoque when it adds or
removes files in GOPROXY, to compute the list version indexes.
It could be approximated by a simple stupid shell script, but the go mod
cache code upstream performs more sanity checks than that, and we'd like
those sanity checks to be replicated in the "update list files" command.
And, that's pretty much all. We create clean new module files, we put
them in system components, installation of those components causes the
module files to exist in GOPROXY, and we'd like for the compiler to make
use of those module files without bothering us with notary checks (that
will fail pretty much all the time,due to how the whole integration
process is structured). The system components are digitally signed but
the build process has no access to the signing key (that is done on a
separate system with an HSM for security reasons) so the build process
can not generate detached signature files.
I would expect the amount of manual integration work needed would
scale with the amount of local changes required, not with the amount
of dependencies. E.g., if you have to modify a core Go module to work
better on Fedora, I would think you make that one change, and your
build system and tooling would handle automatically rebuilding
dependencies as appropriate.
Yes that's how it works, but only because the underlying build commands
are not supposed to discriminate between fedora build ids (releases)
We build
mymodule(a)v1.2.2 with some level of fixing of patching (release 1). That
creates a mod/zip/zipphash/info fileset
We build all the things that need mymodule(a)v1.2.2
Some time later a new issue causes us to adjust the fixing done for
mymodule(a)v1.2.2
So we build mymodule(a)v1.2.2 release 2 with the new fixes. For a lot of
languages that would be enough, and produce a new dll, transparently
used by all downstream users.
With Go static building that creates a new
mod/zip/zipphash/info fileset for mymodule(a)v1.2.2 (with new hashes)
and we have to perform the additional step of rebuild everything that
uses mymodule(a)v1.2.2, recursively.
We don't want the compiler to complain about the hash change or to
compare it to some internet notary.
We can certainly pass the release id info to the pack command in C so it
is recorded somewhere in the produced mod/zip/zipphash/info fileset and
each release fileset is unique.
That is, as long as it is only recorded for information purposes, and
the go command does not try to treat the result as different from
mymodule(a)v1.2.2, to discriminate between releases, or request a specific
release.
Best regards,
¹ Because first, sometimes the on-disk representation of two components
will clash. That's not supposed to happen with Go modules but our system
is not Go specific, so it was not specced around Go module
particularities, and who knows at this point if the module no-clash
design will be robust WRT the weird things upstreams like to invent in
the real word.
And second, we want to be sure that what we know about component
relationships is the truth, so security teams can rely on the system
component dependency graph when analysing incidents. The best way to
make sure it is the truth is to disallow anything not identified in the
CI container.
² (vendored copies of third party code, things we can not distribute
legaly like for example arial.ttf when upstream needs a "free" font and
does not understand copyright law, etc).
³ For historical reasons, because out CI infrastructure design antedates
modern VCSes, and because we absolutely do not want the build commands
to start pulling things from VCS history that do not match the version
we're attempting to build
--
Nicolas Mailhot