On Mon, Jun 29, 2015 at 02:01:46PM +0000, Matt_Domsch@xxxxxxxx wrote: > > I am under the impression currently that dropping un-necessary > > directories would save DB space (the directories being then linked in > > the host_category_dir table listing for each host, in each category > > which dir are present) as well as crawling time (both in the UMDL and in > the crawler). > > > > > > == MD == You need non-repo directories for ISOs at least; there was a > time when we were able to mirror the entire Fedora static web content too; > able only because MM tracked all directories, not just repository > directories. MM1 also tried to be a "generic" mirror manager, not just a > Fedora-specific mirror manager, so I intentionally tracked everything, not > just Yum repos. > > Idea: what if we were tracking only the folders that have files in them, > so for example http://dl.fedoraproject.org/pub/epel/5/ would not end-up in > the database. > > In addition, we could add a sort of blacklist to avoid storing > http://dl.fedoraproject.org/pub/ just due to the presence of the > DIRECTORY_SIZES.txt file > > This would reduce the number of directories we store for the Atomic tree. > > == MD == I didn't optimize for a few non-file-containing directories. > You're welcome to if you see a need. But it's saving just a few entries > out of hundreds/thousands. I got curious to see how it looks in reality, so I wrote a quick python script that goes through the entire tree and count the number of folders, files and folders with no files in them, this is how the results look like: fedora 1814562 files found 4460 folders found 293 folders w/o files Ran in 11.111 min epel 222697 files found 492 folders found 19 folders w/o files Ran in 1.400 min alt 1309830 files found 13692 folders found 1614 folders w/o files Ran in 4.633 min fedora-secondary 3774530 files found 5576 folders found 651 folders w/o files Ran in 26.701 min archive 2705931 files found 3095 folders found 351 folders w/o files Ran in 22.042 min Total time: 65.887 min So it would save a few hundreds of entry in the directory table but it should still save some place in the host_category_dir table. Also when seeing this, it feels to me that we should be more flexible about which part of the tree we run against, could even be a sub-part (ie: a specific secondary arch or so). I also would like to see if we can parallelize the browsing of the tree. Pierre _______________________________________________ infrastructure mailing list infrastructure@xxxxxxxxxxxxxxxxxxxxxxx https://admin.fedoraproject.org/mailman/listinfo/infrastructure