> -----Original Message----- > From: linux-cluster-bounces@xxxxxxxxxx [mailto:linux-cluster-bounces@xxxxxxxxxx] > On Behalf Of Alan Brown > Sent: Friday, March 11, 2011 7:07 PM > > Personal observation: GFS and GFS2 currently have utterly rotten performance for > activities involving many small files, such as NFS exporting /home via NFS sync > mounts. They also fails miserably if there are a lot of files in a single directory (more > than 5-700, with things getting unusable beyond about 1500 files) While I certainly agree there are common scenarios in which GFS performs slowly (backup by rsync is one), your characterization of GFS performance within large directories isn't completely fair. Here's a test I just ran on a cluster node, immediately after rebooting, joining the cluster and mounting a GFS filesystem: [root@cluster1 76]# time ls 00076985.ts 28d80a9c.ts 52b778d2.ts 7f50762b.ts a9c5f908.ts d39d0032.ts 00917c3e.ts 28de643b.ts 532d3fd7.ts 7f5dea46.ts a9e0328b.ts d3bcc9fb.ts ... 289d2764.ts 527b6f37.ts 7f3e5c9a.ts a989df77.ts d36c57fc.ts 28c3aa38.ts 52ab865f.ts 7f3e9278.ts a9aa3dba.ts d392d793.ts real 0m0.034s user 0m0.008s sys 0m0.004s [root@cluster1 76]# ls | wc -l 1970 The key is that only a few locks are needed to list the directory: [root@cluster1 76]# gfs_tool counters /tb2 locks 32 locks held 25 Running "ls -l" on the same directory takes a bit longer (by a factor of about 20): [root@cluster1 76]# time ls -l total 1970 -rw-r----- 1 root root 42 Mar 2 12:01 00076985.ts -rw-r----- 1 root root 42 Mar 2 12:01 00917c3e.ts -rw-r----- 1 root root 42 Mar 2 12:01 00b60c66.ts ... -rw-r----- 1 root root 42 Mar 2 12:01 ffc02edd.ts -rw-r----- 1 root root 42 Mar 2 12:01 ffefd00a.ts -rw-r----- 1 root root 42 Mar 2 12:01 fff80ff6.ts real 0m0.641s user 0m0.032s sys 0m0.032s presumably because it has to acquire quite a few additional locks: [root@cluster1 76]# gfs_tool counters /tb2 locks 3972 locks held 3965 For better or worse, "ls -l" (or equivalently, the aliased "ls --color=tty" for Red Hat users) is a very common operation for interactive users, and such users often have an immediate negative reaction to using GFS as a consequence. In my personal opinion: - Decades of work on Linux have optimized local filesystem performance and system call performance to the point that system call overhead is often treated as negligible for most applications. Running "ls -l" within a large directory is a slow, expensive operations on any system, but if it "feels" fast enough (in terms of wall clock time, not compute cycles) there's little incentive to optimize it further. I find this is true of software applications as well. It's shocking to me how many unnecessary system calls our own applications make, often as a result of libraries such as glibc. - Cluster filesystems require a lot of network communication to maintain perfect consistency. The network protocols used by (e.g.) DLM to maintain this consistency are probably slower than the methods of maintaining memory cache consistency on a SMP system by several orders of magnitude. It follows that assumptions about stat() performance on a local filesystem do not necessarily hold on a clustered filesystem, and application performance can suffer as a result. - Overcoming this may involve significant changes to the Linux system call interface (assuming there won't be a hardware solution anytime soon). For example, relying on the traditional stat() interface for file metadata limits us to one file per system call. In the case of a clustered filesystem, stat() often triggers a synchronous network round-trip via the locking protocol. A theoretical stat() interface that supports looking up multiple files at once would be an improvement, but is relatively difficult to implement because it would entail changing the system kernel, libraries, and application software. - Ethernet is a terrible medium for a distributed locking protocol. Ethernet is well suited for applications needing high bandwidth that are not particularly sensitive to latency. DLM doesn't need lots of bandwidth, but is very sensitive to latency. There exists better hardware for this (e.g. http://www.dolphinics.com/products/pemb-sci-d352.html) than Ethernet, but alas Ethernet is ubiquitous and little work has been done in the cluster community to support alternative hardware as far as I am aware. As an example, while running a "du" command on my GFS mount point, I observed the Ethernet traffic peak: 12:20:33 PM IFACE rxpck/s txpck/s rxbyt/s txbyt/s rxcmp/s txcmp/s rxmcst/s 12:20:38 PM eth0 3517.60 3520.60 545194.80 631191.20 0.00 0.00 0.00 So a few thousand packets per second is the best this cluster node could muster. Average packet sizes are less than 200 bytes each way. I'm sure I could bring in my network experts and improve these results somewhat, maybe with hardware that supports TCP offloading, but you'd never improve this by more than perhaps an order of magnitude because you're hitting the limits of what Ethernet hardware can do. In summary, the state of the art in Linux clustered filesystems is unlikely to change much until we change the way we write software applications to optimize system call usage, or redesign the system call interface to take better advantage of distributed locking protocols, or start using new hardware that provides for distributed shared memory much more efficiently than Ethernet is capable of. Until any of those things happen, many users are bound to be unimpressed with GFS and similar clustered filesystems, relegating these to remain a niche technology. -Jeff -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster