This is great stuff! I think there is a huge need for this. It's amazing how much faster basic coreutils operations can be, even when just using one client as you show.
I've had a lot of success using dftw[1,2] to speed up these recursive operations on distributed filesystems (conditional finds, lustre OST retirement, etc). It's MPI-based (hybrid, actually) and many such tasks really scale well, and keep scaling when you spread the work over multiple client nodes (especially with gluster it seems). It's really fantastic. I've done things like combine it with mrmpi[3] to create a general mapreduce[4] for these situations.
Take for example du: standard /usr/bin/du (or just a find that prints size) on our 10-node distributed gluster filesystem takes well over a couple of hours for trees with tens of thousands of files. We've cut that down over an order of magnitude, to about 10 minutes, with a simple parallel du calculation using the above[5] (running with 16 procs across 4 nodes, ~85% strong scaling efficiency).
Like you, I have hopes of making a package of such utilities. Probably your threaded model will make this much more approachable, though, and elastic, too. I'll happily try out your tools when you're ready to post them, and a bet a lot of others will, too.
Best,
John
On Wed, Apr 16, 2014 at 10:31 AM, Joe Julian <joe@xxxxxxxxxxxxxxxx> wrote:
Excellent! I've been toying with the same concept in the back of my mind for a long while now. I'm sure there is an unrealized desire for such tools.
When your ready, please put such a toolset on forge.gluster.org.
On April 16, 2014 6:50:48 AM PDT, Michael Peek <peek@xxxxxxxxxxx> wrote:Hi guys,
(I'm new to this, so pardon me if my shenanigans turns out to be a waste of your time.)
I have been experimenting with Gluster by copying and deleting large numbers of files of all sizes. What I found was that when deleting a large number of small files, the deletion process seems to take a good chunk of my time -- in some cases it seemed to take a significant percentage of the time that it took to copy the files to the cluster to begin with. I'm guessing that the reason is a combination of find and rm -fr processing files serially and having to wait on the packets to travel back and forth over the network. But with a clustering filesystem, the bottleneck is processing files serially and waiting for network packets when you don't have to.
So I decided to try an experiment. Instead of using /bin/rm to delete files serially, I wrote my own quick-and-dirty recursive rm (and recursive ls) that uses pthreads (listed as "cluster-rm" and "cluster-ls" in the table below):
Methods:
1) This was done on a Linux system. I suspect that Linux (or any modern OS) caches filesystem information. For example, after setting up a directory, when running rm -fr on that directory, the time for rm to complete is lessened if I first run find on the same directory. So to avoid this caching effect, each command was run on it's own test directory. (I.e. find was never run on the same directory as rm -fr or cluster-rm.) This approach seemed to prevent inconsistencies resulting from any caching behavior, resulting in run times that were more consistent.
2) Each test directory contained the exact same data for each of the four commands tested (find, cluster-ls, rm, cluster-rm) for each test run.
3) All commands were run on a client machine and not one of the cluster nodes.
Results:
Data Size
Command
Test #1
Test #2
Test #3
Test #4
49GB
find -print
real 6m45.066s
user 0m0.172s
sys 0m0.748s
real 6m18.524s
user 0m0.140s
sys 0m0.508s
real 5m45.301s
user 0m0.156s
sys 0m0.484s
real 5m58.577s
user 0m0.132s
sys 0m0.480s
cluster-ls
real 2m32.770s
user 0m0.208s
sys 0m1.876s
real 2m21.376s
user 0m0.164s
sys 0m1.568s
real 2m40.511s
user 0m0.184s
sys 0m1.488s
real 2m36.202s
user 0m0.172s
sys 0m1.412s
49GB
rm -fr
real 16m36.264s
user 0m0.232s
sys 0m1.724s
real 16m16.795s
user 0m0.248s
sys 0m1.528s
real 15m54.503s
user 0m0.204s
sys 0m1.396s
real 16m10.037s
user 0m0.168s
sys 0m1.448s
cluster-rm
real 1m50.717s
user 0m0.236s
sys 0m1.820s
real 1m44.803s
user 0m0.192s
sys 0m2.100s
real 2m6.250s
user 0m0.224s
sys 0m2.200s
real 2m6.367s
user 0m0.224s
sys 0m2.316s
97GB
find -print
real 11m39.990s
user 0m0.380s
sys 0m1.428s
real 11m21.018s
user 0m0.380s
sys 0m1.224s
real 11m33.257s
user 0m0.288s
sys 0m0.924s
real 11m4.867s
user 0m0.332s
sys 0m1.244s
cluster-ls
real 4m46.829s
user 0m0.504s
sys 0m3.228s
real 5m15.538s
user 0m0.408s
sys 0m3.736s
real 4m52.075s
user 0m0.364s
sys 0m3.004s
real 4m43.134s
user 0m0.452s
sys 0m3.140s
97GB
rm -fr
real 29m34.138s
user 0m0.520s
sys 0m3.908s
real 28m11.000s
user 0m0.556s
sys 0m3.480s
real 28m37.154s
user 0m0.412s
sys 0m2.756s
real 28m41.724s
user 0m0.380s
sys 0m4.184s
cluster-rm
real 3m30.750s
user 0m0.524s
sys 0m4.932s
real 4m20.195s
user 0m0.456s
sys 0m5.316s
real 4m45.206s
user 0m0.444s
sys 0m4.584s
real 4m26.894s
user 0m0.436s
sys 0m4.732s
145GB
find -print
real 16m26.498s
user 0m0.520s
sys 0m2.244s
real 16m53.047s
user 0m0.596s
sys 0m1.740s
real 15m10.704s
user 0m0.364s
sys 0m1.748s
real 15m53.943s
user 0m0.456s
sys 0m1.764s
cluster-ls
real 6m52.006s
user 0m0.644s
sys 0m5.664s
real 7m7.361s
user 0m0.804s
sys 0m5.432s
real 7m4.109s
user 0m0.652s
sys 0m4.800s
real 6m37.229s
user 0m0.656s
sys 0m4.652s
145GB
rm -fr
real 40m10.396s
user 0m0.624s
sys 0m5.492s
real 42m17.851s
user 0m0.844s
sys 0m4.872s
real 39m6.493s
user 0m0.484s
sys 0m4.868s
real 39m52.047s
user 0m0.496s
sys 0m4.980s
cluster-rm
real 6m49.769s
user 0m0.708s
sys 0m6.440s
real 8m34.644s
user 0m0.852s
sys 0m8.345s
real 6m3.563s
user 0m0.636s
sys 0m5.844s
real 6m31.808s
user 0m0.664s
sys 0m5.996s
1.1TB
find -print real 62m4.043s
user 0m1.300s
sys 0m5.448s
real 61m11.584s
user 0m1.204s
sys 0m5.172s
real 65m37.389s
user 0m1.708s
sys 0m4.276s
real 63m51.822s
user 0m3.096s
sys 0m9.869s
cluster-ls
real 73m12.463s
user 0m2.472s
sys 0m19.289s
real 68m37.846s
user 0m2.080s
sys 0m18.625s
real 72m56.417s
user 0m2.516s
sys 0m18.601s
real 69m3.575s
user 0m4.316s
sys 0m35.986s
1.1TB
rm -fr
real 188m1.925s
user 0m2.240s
sys 0m21.705s
real 190m21.850s
user 0m2.372s
sys 0m18.885s
real 200m25.712s
user 0m5.840s
sys 0m46.363s
real 196m12.686s
user 0m4.916s
sys 0m41.519s
cluster-rm
real 85m46.463s
user 0m2.512s
sys 0m30.478s
real 90m29.055s
user 0m2.600s
sys 0m30.382s
real 88m16.063s
user 0m4.456s
sys 0m51.667
real 77m42.096s
user 0m2.464s
sys 0m31.638s
Conclusions:
1) Once I had a threaded version of rm, a threaded version of ls was easy to make, so I included it in the tests (listed above as cluster-ls). Performance looked spiffy up until the 1.1TB range, when cluster-ls started taking more time than find. Right now I can't explain why. 1.1TB takes a long time to set up and process (about a day for each set of four commands), it could be that regular nightly backups might be interfering with performance. If that's the case, then it calls into question the usefulness of my threaded approach. Also, naturally the output from cluster-ls is out of order, so grep and sed would most likely be used in conjunction with something like that, and I haven't yet time-tested 'cluster-ls | some-other-command' against using plain old find by itself.
2) Results from cluster-rm look pretty good to me across the board. Again, performance seems to fall off in the 1.1TB tests, and the reasons are not clear to me at this time, but performance is still half that of rm -fr. Run times fluctuate more than in the previous tests, but I suppose that's to be expected. But since performance does drop, it makes me wonder how well this approach scales on larger sets of data.
3) My threaded cluster-rm/ls commands are not clever. While traversing directories, any subdirectories found would result in a new thread to process it, up until some hard-coded limit is reached (for the above results, 100 threads were used). After the thread count limit is reached, directories are processed using plain old recursion until a thread exits, freeing up a thread to process another subdirectory.
Further Research:
A) I would like to test further with larger data sets.
B) I would like to implement a smarter algorithm for determining how many threads to use to maximize performance. Rather than a hard-coded maximum, a better approach might be to use some metric for measuring number of inodes processed per second, and use that to determine the effectiveness of adding more threads until a local maxima is reached.
C) How do these numbers change if the commands are run on one of the cluster nodes instead of a client?
I have some ideas of smarter things to try, but I am at best an inexperienced (if enthusiastic) dabbler in the programming arts. A professional would likely do a much better job.
But if this data looks at all interesting or useful, then maybe there would be a call for a handful of cluster-specific filesystem tools?
Michael Peek
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://supercolony.gluster.org/mailman/listinfo/gluster-users
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://supercolony.gluster.org/mailman/listinfo/gluster-users
_______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx http://supercolony.gluster.org/mailman/listinfo/gluster-users