please try with 'option self-heal off' in the unify volume and check if that improves your performance. avati File layout is such that each host has it's own directory, for example the > GlusterFS website would be located in: > <fs_root>/db/org/g/www.glusterfd.org/ > and each directory will have a small number of potentially large data > files. > A similar setup on local disks (without gluster) has proven it's > capabilities > over the years. > > We use a distributed computing model, each node in the archive runs one > or more processes to update the archive. We use the nufa scheduler to favor > local files and we use a distributed hashing algorithm to prevent data from > moving around nodes (unless the configuration changes of course). > > I've included the GlusterFS configuration at the bottom of this e-mail. > > Data access and throughput are pretty good (good enough), but calling > stat() > on a directory can take extraordinary long. Here is for example a listing > of the .nl top level domain: > > vagabond at spider2:~/archive/db/nl$ time ls > 0/ 2/ 4/ 6/ 8/ a/ c/ e/ g/ i/ k/ m/ o/ q/ s/ u/ w/ y/ > 1/ 3/ 5/ 7/ 9/ b/ d/ f/ h/ j/ l/ n/ p/ r/ t/ v/ x/ z/ > > real 4m28.373s > user 0m0.004s > sys 0m0.000s > > > The same operation performed directly on the local filesystem of the > namespace > node returns almost instantly (also for large directories): > > time ls /local.mnt/md0/glfs-namespace/db/nl/a | wc -l > 17506 > > real 0m0.043s > user 0m0.032s > sys 0m0.012s > > > A trace of the namespace gluster deamon shows that it is performing a > lstat() on all the subdirectories (nl/0/*, nl/1/* etc). Information that > IMO is not needed at this time. In our case the total number of directories > on the filesystem goes into the many millions so this behaviour is hurting > performance. > > Now for our questions: > > * is this expected to scale to tens of millions of directories? > > * is this behaviour a necessity for GlusterFS to operate correctly or is > it some form of performance optimisation? Is it tunable? > > * what exactly is the sequency of events to handle a directory listing? > Is this request handled by the namespace node only? > > * is there anything we can tune or change to speed up directory access? > > Thanks for your time, > > Arend-Jan > > > **** Hardware config **** > data nodes > - 1 x Xeon quad core 2.5 Ghz > - 4 x Barracuda ES.2 SATA 3.0-Gb/s 1-TB Hard Drive > Disks configured in RAID0, 128k chunks > Filesystem XFS > Network: gigabit LAN > > namespace node > - 2 x Xeon quad core 2.5 Ghz > - 4 x Cheetah(R) 15K.5 U320 SCSI Hard Drives > Disks configured in RAID1 (1 mirror, 1 spare) > Filesystem XFS > Network: gigabit LAN > > > Glusterfs Version: 1.3.11 with Fuse fuse-2.7.3glfs10 > Glusterfs Version: 1.4-pre5 with Fuse fuse-2.7.3glfs10 > > > **** GlusterFS data node config **** > > volume brick-posix0 > type storage/posix > option directory /local.mnt/md0/glfs-data > end-volume > > volume brick-lock0 > type features/posix-locks > subvolumes brick-posix0 > end-volume > > volume brick-fixed0 > type features/fixed-id > option fixed-uid 2224 > option fixed-gid 224 > subvolumes brick-lock0 > end-volume > > volume brick-iothreads0 > type performance/io-threads > option thread-count 4 > subvolumes brick-fixed0 > end-volume > > volume brick0 > type performance/read-ahead > subvolumes brick-iothreads0 > end-volume > > volume server > type protocol/server > option transport-type tcp/server > subvolumes brick0 > option auth.ip.brick0.allow 10.1.0.* > end-volume > > > **** GlusterFS namespace config **** > > volume brick-posix > type storage/posix > option directory /local.mnt/md0/glfs-namespace > end-volume > > volume brick-namespace > type features/fixed-id > option fixed-uid 2224 > option fixed-gid 224 > subvolumes brick-posix > end-volume > > volume server > type protocol/server > option transport-type tcp/server > subvolumes brick-namespace > option auth.ip.brick-namespace.allow 10.1.0.* > end-volume > > > **** GlusterFS client config **** > > volume brick-0-0 > type protocol/client > option transport-type tcp/client > option remote-host archive0 > option remote-subvolume brick0 > end-volume > > volume brick-1-0 > type protocol/client > option transport-type tcp/client > option remote-host archive1 > option remote-subvolume brick0 > end-volume > > volume brick-2-0 > type protocol/client > option transport-type tcp/client > option remote-host archive2 > option remote-subvolume brick0 > end-volume > > volume ns0 > type protocol/client > option transport-type tcp/client > option remote-host archivens0 > option remote-subvolume brick-namespace > end-volume > > volume unify > type cluster/unify > option namespace ns0 > option scheduler nufa > option nufa.local-volume-name brick-2-0 # depends on data node of > course > option nufa.limits.min-free-disk 10% > subvolumes brick-0-0 brick-1-0 brick-2-0 > end-volume > > -- > Arend-Jan Wijtzes -- Wiseguys -- www.wise-guys.nl > > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://zresearch.com/cgi-bin/mailman/listinfo/gluster-users > > -- If I traveled to the end of the rainbow As Dame Fortune did intend, Murphy would be there to tell me The pot's at the other end. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://zresearch.com/pipermail/gluster-users/attachments/20081015/5276c9f2/attachment.htm