Software Management call for RFEs
nkadel at gmail.com
Mon May 27 12:53:53 UTC 2013
On Mon, May 27, 2013 at 6:17 AM, Florian Weimer <fweimer at redhat.com> wrote:
> On 05/27/2013 11:48 AM, Zdenek Pavlas wrote:
>>> And there package diffs, which are ed-style diffs of the
>>> Packages file I mentioned above. This approach would work quite well
>>> for primary.xml because it doesn't contain cross-references between
>>> packages using non-natural keys. It doesn't work for the SQLite
>>> database, either in binary or SQL dump format, because of the reliance
>>> on artificial primary keys (such as package IDs).
>> I've once tried this. With about 10k packages in fedora-updates, the delta
>> over 2-3 days was +491 -479. Assuming deletions are cheap, the delta
>> ideally be 5%. As expected, binary bsddiff yields much bigger (~29%)
> A line-wise diff is much smaller because dependencies and package
> descriptions mostly stay the same. (This assumes consistent sorting of the
> primary.xml file.)
The "diffs" are not the problem. The problem is hte excessively
frequent downloads of the repodata, which are compressed binaries with
checksums, not published as deltas or diffs. The result is a grossly
inefficient and far-too-frequent download of upstream repository
information. It's not the local SQLite database operations in the yum
cache that are killing me, at least, it's the short "metadata_expire"
Very few of us really need our metadata expired between our first cup
of coffee in the morning, and luncthtime. And very few of us need
yum-updatesd and other auto-magic update tools grinding our host and
our proxies bandwidth for several 20 Megabyte files every few hours,
nor grinding our local disk with the uncompressed *80 Megabyte* file
of primary.xml sitting in /var/cache/yum/.
This is an inherent problem with the "let's store more, and more, and
more data in the database". The databases for yum have gotten bulky
and cumbersome, and the automatic churn involved in updating them with
fresh repodata has become quite large.
> Can you point me to the primary.xml -> SQLite translation in yum? I've got
> a fairly efficient primary.xml parser. It might be interesting to see if
> it's possible to reduce the latency introduced by the SQLite conversion to
> close to zero. (Decompression and INSERTs can be interleaved with
> downloading, and maybe the index creation improvements in SQLite are
> sufficient these days.)
Good luck with that! It's not what I, personally, was just looking
for, but improving that would be nice. But the SQLite files are
getting larger, and larger, and larger. At 80 Meg and counting for the
Fedora primary.sqlie, it's getting out of hand.
>>> However, for many users that follow unstable or testing, package diffs
>>> are currently slower than downloading the full Packages file because the
>>> diffs are incremental (i.e., they contain the changes from file version
>>> N to N+1, and you have to apply all of them to get to the current
>>> version) and apt-get can easily write 100 MB or more because the
>>> Packages file is rewritten locally multiple times.
>> Yes, patch chaining should be avoided. I'd like to use N => 1 deltas,
>> that could be applied to many recent snapshots.
And it has little to do with the yum issue I've raised, which is
entirely about the metadata. My personal experience is that the use of
"deltarpms" is a bandwidth and resource issue completely overshadowed
by the constant grinding of disk and bandwidth for hte metadata.
More information about the devel