Re: Better repodata performance

Alexandre Oliva <aoliva@xxxxxxxxxx> · 30 Jan 2005 18:16:21 -0200

On Jan 30, 2005, Jeff Johnson <n3npq@xxxxxxxxx> wrote:

> Alexandre Oliva wrote:

>> Err...  By the time yum or any other depsolver decides to download a
>> package, it's already got all the headers for all packages.  And I

> Yep, "already got" so lets's go get the header again.

I see.  Still, it's probably unwise to get the transaction
verification procedure waiting for all big packages to download, or
competing for limited bandwidth, when the headers alone would have
sufficed.

An idea to overcome this issue without throwing away possible web
caching benefits would be to start a download of the entire rpm and,
by the time you get to the end of the header, you stop reading from
that connection until you've completed the transaction verification.

If you have a web proxy, it will likely keep on downloading the entire
package, and you'll end up downloading the rest of it very quickly,
but the downloads will be competing for bandwidth.

If you don't have a web proxy, however, things may get messy: not only
will you get competition for bandwidth, you'll also get competition
for any limit on open connections that may be imposed on you by
upstream (ISP, download server, etc).  (My DSL provider, for example,
won't let me establish more than 30 TCP connections simultaneously)

>> hope you're not suggesting yum to get rpm to download *all* packages
>> just because it needs headers.  *That* would be a waste of bandwidth

> Depends on how yum implements, but we agree that "all" is stupid,
> even if we appear to disagree whether headers being downloaded and
> then downloaded again is stupid.

We do agree on both accounts.

>>> into /var/cache/yum/repo/packages since you already know the header
>>> byte range you are interested in from the xml metadata, thereby
>>> saving the bandwidth used by reading the header twice.

>> Hmm...  I hope you're not saying yum actually fetches the header
>> portion out of the rpm files for purposes of dep resolution.  Although
>> I realize the information in the .xml file makes it perfectly
>> possible, it also makes it (mostly?) redundant.  Having to download
>> not only the big xml files but also all of the headers would suck in a
>> big way!

> The rpmlib API requires a header for a ride. So yes, that is exactly what
> is happening, yum is using byte ranges to pull headers from discovered
> packages where (if discovered, packages are needed) both header+payload
> could be pulled togethere and asynchronously.

I hope you're really not saying that, if I request to install package
foo, that depends on bar, it will also download headers for baz, a
totally unrelated package.  I can see that we'd need headers for foo
and bar, but not for baz.  I thought the point of the xml files and
the info on provides, filelists, etc, was precisely to enable the
depsolver to avoid having to download the headers for every package.

I'm wondering if it would be possible for a depsolver to create a
(smaller) .hdr file out of info in the .xml files, and feed that to
rpmlib for transaction-verification purposes.  This would enable it to
skip the download-header step before downloading the entire package.

> The repo data is a win over previous incarnations of yum becuase
> it's one, not hundreds, of files that needs to be downloaded.

It's clear to me that it's a win for a one-shot download.  What's not
clear is that, downloading the entire .xml files 2-3 times a day, or
every time it's updated (which rhn-applet would presumably do,
although a simple listing of package N-V-Rs would be enough for it),
you won't end up wasting more bandwidth than having the .hdr files
downloaded once and for all.

> filtering the data (i.e. headers have changelogs and more that is
> useless baggage) is also a win.

Definitely.  But couldn't we perhaps do it by intelligently filtering
information out of the rpm header and, say, generating a single
archive containing all of the info needed for depsolving and for
rpmlib's transaction verification?

It might still be useful to have something like header.info, but have
it compressed, and not only listing N-V-Rs, but also the header ranges
should one want to download the headers for individual packages, as
opposed to the xml files or equivalent, a previous version of which
was downloaded in the past and whose data is still locally-available,
and that has undergone very slight changes.

> So the suggestion was to download the package, not the header, and then
> extract the header from local, not remote storage.

I see.  Good one, if we can't help downloading the package (e.g.,
because rpmlib ends up deciding it can't install the packages)

>> I'd be very surprised if yum 2.1 actually worked this way.  I expect
>> far better from Seth, and from what I read during the design period
>> of the metadata format, I understood that the point of the xml files
>> was precisely to avoid having to download the hdr files in the first
>> place.  So why would they be needed?  To get rpmlib to verify the
>> transaction, perhaps?

> What do you expect? There is no way to create a transaction using rpmlib
> without a header, a header is wired in the rpmlib API. So honk at failed
> rpmlib design, not otherwise.

I was expecting depsolving wouldn't require all the headers.  And from
what I gather from your reply, it indeed doesn't.

> Look, repos currently change daily, perhaps twice daily.

They actually change more often than that.  rawhide changes daily,
yes.  FC updates sometimes change several times in a single day, and
then sometimes stay put for a few days.  Could be once or twice a day,
indeed.  Other repos such as dag and freshrpms change more often than
that, it seems to me.  At least once a day would be an accurate
description for them.

> Trying to optimize incremenatl updates for something that changes
> perhaps twice a day is fluff.

Let me try some back-of-the-envelope calculations here.  Consider an
FC install that remains installed for 40 weeks (~ 9 months), and has a
user permanently running rhn-applet, and whose administrator runs
up2date once a day on average.  Further consider that updates are
released, on average, once a day, and that, on average, only two of
the 7 weekly update runs actually have new packages to install (i.e.,
updates are generally published in batches)

Let's consider two scenarios: 1) using up2date with yum-2.0 (headers/)
repos (whoever claimed up2date supported rpmmd repodata/ misled me :-)
and 2) using yum-2.1 (repodata/) repos.

1) yum 2.0

  16MiB) initial download, distro's and empty updates's hdrs

   8MiB) daily (on average) downloads of header.info for updates,
     downloaded by rhn-applet, considering an average size of almost
     30KiB, for 40 weeks.  (both FC2 and FC3 updates for i386 have a
     header.info this big right now)

  16MiB) .hdr files for updates, downloaded by the update installer.
     Current FC2 i386 headers/ holds 9832KiB, whereas FC3 i386
     headers/ holds 8528KiB, but that doesn't count superseded
     updates, whose .hdr files are removed.  The assumption is that
     each header is downloaded once.  16MiB is a guestimate, that I
     believe to be inflated.  It doesn't take into account the
     duplicate downloads of header.info for updates, under the
     assumption that a web proxy would avoid downloading again what
     rhn-applet has already downloaded.

----

  40MiB) just in metadata over a period of 9 months, total

2) yum 2.1

   2.7MiB) initial download, distro's and empty updates'
     primary.xml.gz and filelists.xml.gz

  68MiB) daily (on average) downloads of primary.xml.gz, downloaded by
     rhn-applet, considering an average size of 250KiB (FC2 updates's
     is 240KiB, whereas FC3's is 257KiB, plus about 1KiB for
     repomd.xml)

  16MiB) .hdr files for updates, downloaded by the update installer
  (same as in case 1)

 192MiB) filelists.xml.gz for updates, downloaded twice a week on
 average by the update installer, to solve filename dep.

----

 278.7MiB) just in metadata over a period of 9 months, total

Looks like a waste of at least 238.7 MiB per user per 9-month install.
Sure, it's not a lot, only 26.5MiB a month, but it's almost 6 times as
much data being transferred for the very same purpose.  How is that a
win?  Multiply that by the number of users pounding on your mirrors
and it adds up to hundreds of GiB a month.

Of course there are some factors that can help minimize the wastage,
for example, a web proxy serving multiple machines, one of which is
updated before the others, will be able to serve the headers for yum
2.1 out of the cached .rpm files, so you transfer the headers by
themselves only once for all machines, instead of once per machine.
But then, yum 2.0 enables the web proxy to cache headers anyway, so
this would be a win for both, and less so for yum 2.1 if you update
multiple boxes in parallel.

Another factor is that you probably won't need filelists.xml.gz for
every update.  Maybe I don't quite understand how often it is needed,
but even if I have to download it only once a month, that's still
64MiB over 9 months, more than the 40MiB total metadata downloaded
over 9 months by yum 2.0.

> The rpm-metadata is already a huge win, as the previous incarnation
> checked time stamps on hundreds and thousands of headers, not one
> primary file.

I don't know how yum 2.0 did it, but up2date surely won't even try to
download a .hdr file if it already has it in /var/spool/up2date, so
this is not an issue.

> Sure there are further improvements, but busting up repo metadata
> ain't gonna be where the win is, there's little gold left in that
> mine.

repodata helps the initial download, granted, but it loses terribly in
the long run.

-- 
Alexandre Oliva             http://www.ic.unicamp.br/~oliva/
Red Hat Compiler Engineer   aoliva@{redhat.com, gcc.gnu.org}
Free Software Evangelist  oliva@{lsd.ic.unicamp.br, gnu.org}