On Fri, Jun 26, 2015 at 06:11:44PM +0200, Pierre-Yves Chibon wrote: > Yesterday and today I spent a little time going over the UDML script of > MirrorManager2. > Going through it, I ended up with few questions regarding it. > > * Repository name > UMDL's code clearly says: > # historically, Repository.name was a longer string with > # product and category deliniations. But we were getting > # unique constraint conflicts once we started introducing > # repositories under repositories. And .name isn't used for > # anything meaningful. So simply have it match dir.name, > # which can't conflict. > And quickly grepping through MM2's sources, I could not find a reference to > this, we alway rely on the repository's prefix, not its name. > > Question: Should we drop this? > It makes things confusing and is basically noise since we do not use it anywhere. It was a helpful column for fixing errors with the repos. But as the database is so huge everything we could drop should be dropped. [...] > * The directory table > So looking at the database and more precisely the directory table in that > database, it seems we store all the directories of the tree, ie: > /pub/alt/ > /pub/alt/anaconda/ > /pub/alt/bfo/ > /pub/alt/bfo/gpxe-20120514 > ... > This makes me a little pondering. What is the interest of keeping the whole > list of directories in the DB ? > After all, as far as I understand, the UMDL finds the repo in the tree (repo > being defined by the presence of a 'repodata' folder containing the repomd.xml > or by the presence of a 'summary' file and an 'objects' folder). > For these repo, we look for the most recent files, stores this info in the DB > and later use it to check if the mirrors are up to date. > > But do we need to checking that ``pub/fedora/linux`` exists when we later check > that ``pub/fedora/linux/updates/testing/21/x86_64/`` exists and is up to date? > > I am under the impression currently that dropping un-necessary directories would > save DB space (the directories being then linked in the host_category_dir table > listing for each host, in each category which dir are present) as well as > crawling time (both in the UMDL and in the crawler). Again, dropping unnecessary information from the database sounds good. Although this one sounds a bit more complex as you always have to delete directories if subdirectories appear and add directories if subdirectories disappear. > * Non-directory based support in UDML. > > So the UMDL script currently supports three ways of crawling the tree: > * file > * rsync > * directory > > We, in Fedora, are only using the last one. I believe the `rsync` mode was added > to support Ubuntu and the file mode is basically a simplified version of the > directory mode, but that we do not use at at the moment. > > I would like to propose that we drop support for rsync. I feel that it may be > simpler and easier to create an UMDL and a crawler for each distro that would > like to use MirrorManager than maintaining a one-script-fits-all UMDL that is > in fact tested for only one of the scenario. > That being said, if we ever have interest from Ubuntu, CentOS or any other > communities, we should definitively look into making the UMDL and crawler as > re-usable as possible for them, but keeping the distro-specific bits separated. Like already mentioned, RPM Fusion uses the rsync mode as the master mirror is 'far' away from the MirrorManager installation. It is still using MM1 on CentOS 5 and currently I am not immediately planing on upgrading to MM2. So it could be removed and I should be able to write the necessary umdl rsync crawler once I need it. Another thought about umdl I had concerns the file mode. We have for the categories 'Fedora EPEL' and 'Fedora Linux' files called 'fullfilelist'. Maybe that would be an option for umdl to use to reduce I/O on the NFS mounts. Only actually reading the files and metadata from NFS if it is necessary. Just one of those ideas. Adrian
Attachment:
pgp6eSCBBwZn3.pgp
Description: PGP signature
_______________________________________________ infrastructure mailing list infrastructure@xxxxxxxxxxxxxxxxxxxxxxx https://admin.fedoraproject.org/mailman/listinfo/infrastructure