Hello, On Tue, 6 Jan 2015 15:29:50 +0000 Shain Miley wrote: > Hello, > > We currently have a 12 node (3 monitor+9 OSD) ceph cluster, made up of > 107 x 4TB drives formatted with xfs. The cluster is running ceph version > 0.80.7: > I assume journals on the same HDD then. How much memory per node? [snip] > > A while back I created an 80 TB rbd image to be used as an archive > repository for some of our audio and video files. We are still seeing > good rados and rbd read and write throughput performance, however we > seem to be having quite a long delay in response times when we try to > list out the files in directories with a large number of folders, files, > etc. > > Subsequent directory listing times seem to run a lot faster (but I am > not sure for long that is the case before we see another instance of > slowness), however the initial directory listings can take 20 to 45 > seconds. > Basically the same thing(s) that Robert said. How big is "large"? How much memory on the machine you're mounting this image? Ah, never mind, just saw your follow-up. Definitely add memory to this machine if you can. The initial listing is always going to be slow-ish of sorts depending on a number of things in the cluster. As in, how busy is it (IOPS)? With journals on disk your HDDs are going to be sluggish individually and your directory information might reside mostly in one object (on one OSD), thus limiting you to the speed of that particular disk. And this is also where the memory of your storage nodes comes in, if it is large enough your "hot" objects will get cached there as well. To see if that's the case (at least temporarily), drop the caches on all of your storage nodes (echo 3 > /proc/sys/vm/drop_caches), mount your image, do the "ls -l" until it's "fast", umount it, mount it again and do the listing again. In theory, unless your cluster is extremely busy or your storage node have very little pagecache, the re-mounted image should get all the info it needs from said pagecache on your storage nodes, never having to go to the actual OSD disks and thus be fast(er) than the initial test. Finally to potentially improve the initial scan that has to come from the disks obviously, see how fragmented your OSDs are and depending on the results defrag them. Christian -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Fusion Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com