We at Wiseguys are looking into GlusterFS to run our Internet Archive. The archive stores webpages collected by our spiders. The test setup consists of three data machines, each exporting a volume of about 3.7TB and one nameserver machine. File layout is such that each host has it's own directory, for example the GlusterFS website would be located in: <fs_root>/db/org/g/www.glusterfd.org/ and each directory will have a small number of potentially large data files. A similar setup on local disks (without gluster) has proven it's capabilities over the years. We use a distributed computing model, each node in the archive runs one or more processes to update the archive. We use the nufa scheduler to favor local files and we use a distributed hashing algorithm to prevent data from moving around nodes (unless the configuration changes of course). I've included the GlusterFS configuration at the bottom of this e-mail. Data access and throughput are pretty good (good enough), but calling stat() on a directory can take extraordinary long. Here is for example a listing of the .nl top level domain: vagabond at spider2:~/archive/db/nl$ time ls 0/ 2/ 4/ 6/ 8/ a/ c/ e/ g/ i/ k/ m/ o/ q/ s/ u/ w/ y/ 1/ 3/ 5/ 7/ 9/ b/ d/ f/ h/ j/ l/ n/ p/ r/ t/ v/ x/ z/ real 4m28.373s user 0m0.004s sys 0m0.000s The same operation performed directly on the local filesystem of the namespace node returns almost instantly (also for large directories): time ls /local.mnt/md0/glfs-namespace/db/nl/a | wc -l 17506 real 0m0.043s user 0m0.032s sys 0m0.012s A trace of the namespace gluster deamon shows that it is performing a lstat() on all the subdirectories (nl/0/*, nl/1/* etc). Information that IMO is not needed at this time. In our case the total number of directories on the filesystem goes into the many millions so this behaviour is hurting performance. Now for our questions: * is this expected to scale to tens of millions of directories? * is this behaviour a necessity for GlusterFS to operate correctly or is it some form of performance optimisation? Is it tunable? * what exactly is the sequency of events to handle a directory listing? Is this request handled by the namespace node only? * is there anything we can tune or change to speed up directory access? Thanks for your time, Arend-Jan **** Hardware config **** data nodes - 1 x Xeon quad core 2.5 Ghz - 4 x Barracuda ES.2 SATA 3.0-Gb/s 1-TB Hard Drive Disks configured in RAID0, 128k chunks Filesystem XFS Network: gigabit LAN namespace node - 2 x Xeon quad core 2.5 Ghz - 4 x Cheetah? 15K.5 U320 SCSI Hard Drives Disks configured in RAID1 (1 mirror, 1 spare) Filesystem XFS Network: gigabit LAN Glusterfs Version: 1.3.11 with Fuse fuse-2.7.3glfs10 Glusterfs Version: 1.4-pre5 with Fuse fuse-2.7.3glfs10 **** GlusterFS data node config **** volume brick-posix0 type storage/posix option directory /local.mnt/md0/glfs-data end-volume volume brick-lock0 type features/posix-locks subvolumes brick-posix0 end-volume volume brick-fixed0 type features/fixed-id option fixed-uid 2224 option fixed-gid 224 subvolumes brick-lock0 end-volume volume brick-iothreads0 type performance/io-threads option thread-count 4 subvolumes brick-fixed0 end-volume volume brick0 type performance/read-ahead subvolumes brick-iothreads0 end-volume volume server type protocol/server option transport-type tcp/server subvolumes brick0 option auth.ip.brick0.allow 10.1.0.* end-volume **** GlusterFS namespace config **** volume brick-posix type storage/posix option directory /local.mnt/md0/glfs-namespace end-volume volume brick-namespace type features/fixed-id option fixed-uid 2224 option fixed-gid 224 subvolumes brick-posix end-volume volume server type protocol/server option transport-type tcp/server subvolumes brick-namespace option auth.ip.brick-namespace.allow 10.1.0.* end-volume **** GlusterFS client config **** volume brick-0-0 type protocol/client option transport-type tcp/client option remote-host archive0 option remote-subvolume brick0 end-volume volume brick-1-0 type protocol/client option transport-type tcp/client option remote-host archive1 option remote-subvolume brick0 end-volume volume brick-2-0 type protocol/client option transport-type tcp/client option remote-host archive2 option remote-subvolume brick0 end-volume volume ns0 type protocol/client option transport-type tcp/client option remote-host archivens0 option remote-subvolume brick-namespace end-volume volume unify type cluster/unify option namespace ns0 option scheduler nufa option nufa.local-volume-name brick-2-0 # depends on data node of course option nufa.limits.min-free-disk 10% subvolumes brick-0-0 brick-1-0 brick-2-0 end-volume -- Arend-Jan Wijtzes -- Wiseguys -- www.wise-guys.nl