Glusterfs performance with large directories

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



please try with 'option self-heal off' in the unify volume and check if that
improves your performance.

avati

File layout is such that each host has it's own directory, for example the
> GlusterFS website would be located in:
> <fs_root>/db/org/g/www.glusterfd.org/
> and each directory will have a small number of potentially large data
> files.
> A similar setup on local disks (without gluster) has proven it's
> capabilities
> over the years.
>
> We use a distributed computing model, each node in the archive runs one
> or more processes to update the archive. We use the nufa scheduler to favor
> local files and we use a distributed hashing algorithm to prevent data from
> moving around nodes (unless the configuration changes of course).
>
> I've included the GlusterFS configuration at the bottom of this e-mail.
>
> Data access and throughput are pretty good (good enough), but calling
> stat()
> on a directory can take extraordinary long. Here is for example a listing
> of the .nl top level domain:
>
> vagabond at spider2:~/archive/db/nl$ time ls
> 0/  2/  4/  6/  8/  a/  c/  e/  g/  i/  k/  m/  o/  q/  s/  u/  w/  y/
> 1/  3/  5/  7/  9/  b/  d/  f/  h/  j/  l/  n/  p/  r/  t/  v/  x/  z/
>
> real    4m28.373s
> user    0m0.004s
> sys     0m0.000s
>
>
> The same operation performed directly on the local filesystem of the
> namespace
> node returns almost instantly (also for large directories):
>
> time ls /local.mnt/md0/glfs-namespace/db/nl/a | wc -l
>  17506
>
> real    0m0.043s
> user    0m0.032s
> sys     0m0.012s
>
>
> A trace of the namespace gluster deamon shows that it is performing a
> lstat() on all the subdirectories (nl/0/*, nl/1/* etc). Information that
> IMO is not needed at this time. In our case the total number of directories
> on the filesystem goes into the many millions so this behaviour is hurting
> performance.
>
> Now for our questions:
>
> * is this expected to scale to tens of millions of directories?
>
> * is this behaviour a necessity for GlusterFS to operate correctly or is
> it some form of performance optimisation? Is it tunable?
>
> * what exactly is the sequency of events to handle a directory listing?
> Is this request handled by the namespace node only?
>
> * is there anything we can tune or change to speed up directory access?
>
> Thanks for your time,
>
> Arend-Jan
>
>
> **** Hardware config ****
> data nodes
> - 1 x Xeon quad core 2.5 Ghz
> - 4 x Barracuda ES.2 SATA 3.0-Gb/s 1-TB Hard Drive
> Disks configured in RAID0, 128k chunks
> Filesystem XFS
> Network: gigabit LAN
>
> namespace node
> - 2 x Xeon quad core 2.5 Ghz
> - 4 x Cheetah(R) 15K.5 U320 SCSI Hard Drives
> Disks configured in RAID1 (1 mirror, 1 spare)
> Filesystem XFS
> Network: gigabit LAN
>
>
> Glusterfs Version: 1.3.11 with Fuse fuse-2.7.3glfs10
> Glusterfs Version: 1.4-pre5 with Fuse fuse-2.7.3glfs10
>
>
> **** GlusterFS data node config ****
>
> volume brick-posix0
>  type storage/posix
>  option directory /local.mnt/md0/glfs-data
> end-volume
>
> volume brick-lock0
>  type features/posix-locks
>  subvolumes brick-posix0
> end-volume
>
> volume brick-fixed0
>  type features/fixed-id
>  option fixed-uid 2224
>  option fixed-gid 224
>  subvolumes brick-lock0
> end-volume
>
> volume brick-iothreads0
>  type performance/io-threads
>  option thread-count 4
>  subvolumes brick-fixed0
> end-volume
>
> volume brick0
>  type performance/read-ahead
>  subvolumes brick-iothreads0
> end-volume
>
> volume server
>  type protocol/server
>  option transport-type tcp/server
>  subvolumes brick0
>  option auth.ip.brick0.allow 10.1.0.*
> end-volume
>
>
> **** GlusterFS namespace config ****
>
> volume brick-posix
>  type storage/posix
>  option directory /local.mnt/md0/glfs-namespace
> end-volume
>
> volume brick-namespace
>  type features/fixed-id
>  option fixed-uid 2224
>  option fixed-gid 224
>  subvolumes brick-posix
> end-volume
>
> volume server
>  type protocol/server
>  option transport-type tcp/server
>  subvolumes brick-namespace
>  option auth.ip.brick-namespace.allow 10.1.0.*
> end-volume
>
>
> **** GlusterFS client config ****
>
> volume brick-0-0
>  type protocol/client
>  option transport-type tcp/client
>  option remote-host archive0
>  option remote-subvolume brick0
> end-volume
>
> volume brick-1-0
>  type protocol/client
>  option transport-type tcp/client
>  option remote-host archive1
>  option remote-subvolume brick0
> end-volume
>
> volume brick-2-0
>  type protocol/client
>  option transport-type tcp/client
>  option remote-host archive2
>  option remote-subvolume brick0
> end-volume
>
> volume ns0
>  type protocol/client
>  option transport-type tcp/client
>  option remote-host archivens0
>  option remote-subvolume brick-namespace
> end-volume
>
> volume unify
>  type cluster/unify
>  option namespace ns0
>  option scheduler nufa
>  option nufa.local-volume-name brick-2-0        # depends on data node of
> course
>  option nufa.limits.min-free-disk 10%
>  subvolumes brick-0-0 brick-1-0 brick-2-0
> end-volume
>
> --
> Arend-Jan Wijtzes -- Wiseguys -- www.wise-guys.nl
>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://zresearch.com/cgi-bin/mailman/listinfo/gluster-users
>
>


-- 
If I traveled to the end of the rainbow
As Dame Fortune did intend,
Murphy would be there to tell me
The pot's at the other end.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://zresearch.com/pipermail/gluster-users/attachments/20081015/5276c9f2/attachment.htm 


[Index of Archives]     [Gluster Development]     [Linux Filesytems Development]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux