Hi. I'm looking for a clustered filesystem for a very simple scenario. I've set up Gluster but my tests have shown quite a performance penalty when compared to using a local XFS filesystem. This no doubt reflects the reality of moving to a proper distributed filesystem, but I'd like to quickly check that I haven't missed something obvious that might improve performance. I plan to have two Amazon AWS EC2 instances (virtual machines) both accessing the same filesystem for read/writes. Access will be almost entirely reads, with the occasional modification, deletion or creation of files. Ideally I wanted all those reads going straight to the local XFS filesystem and just the writes incurring a distributed performance penalty. :-) So I've set up two VMs with Centos 7.2 and Gluster 3.8.8, each machine running as a combined Gluster server and client. One brick on each machine, one volume in a 1 x 2 replica configuration. Everything works, it's just the performance penalty which is a surprise. :-) My test directory has 9,066 files and directories; 7,987 actual files. Total size is 63MB data, 85MB allocated; an average size of 8KB data per file. The brick's files have a total of 117MB allocated, with the extra 32MB working out pretty much to be exactly the sum of the extra 4KB extents that would have been allocated for the XFS attributes per file - the VMs were installed with the default 256 byte inode size for the local filesystem, and from what I've read Gluster will force the filesystem to allocate an extent for its attributes. 'xfs_bmap' on a few files shows this is the case. A simple 'cat' of every file when laid out in 'native' directories on the XFS filesystem takes about 3 seconds. A cat of all the files in the brick's directory on the same filesystem takes about 6.4 seconds, which I figure is due to the extra I/O for the inode metadata extents (although not quite certain; the additional extents added about 40% extra to the disk block allocation, so I'm unsure as to why the time increase was 100%). Doing the same test through the glusterfs mount takes about 25 seconds; roughly four times longer than reading those same files directly from the brick itself. It took 30 seconds until I applied the 'md-cache' settings (for those variables that still exist in 3.8.8) mentioned in this very helpful article: http://blog.gluster.org/category/performance/ So use of the md-cache in a 'cold run' shaved off 5 seconds - due to common directory LOOKUP operations being cached I guess. Output of a 'volume info' is as follows: Volume Name: g1 Type: Replicate Volume ID: bac6cd70-ca0d-4173-9122-644051444fe5 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: serverA:/data/brick1 Brick2: serverC:/data/brick1 Options Reconfigured: transport.address-family: inet performance.readdir-ahead: on nfs.disable: on cluster.self-heal-daemon: enable features.cache-invalidation: on features.cache-invalidation-timeout: 600 performance.stat-prefetch: on performance.md-cache-timeout: 60 network.inode-lru-limit: 90000 The article suggests a value of 600 for features.cache-invalidation-timeout but my Gluster version only permits a maximum value of 60. Network speed between the two VMs is about 120 MBytes/sec - the two VMs inhabit the same Amazon Virtual Private Cloud - so I don't think bandwidth is a factor. The 400% slowdown is no doubt the penalty incurred in moving to a proper distributed filesystem. That article and other web pages I've read all say that each open of a file results in synchronous LOOKUP operations on all the replicas, so I'm guessing it just takes that much time for everything to happen before a file can be opened. Gluster profiling shows that there are 11,198 LOOKUP operations on the test cat of the 7,987 files. As a Gluster newbie I'd appreciate some quick advice if possible - 1. Is this sort of performance hit - on directories of small files - typical for such a simple Gluster configuration? 2. Is there anything I can do to speed things up? :-) 3. Repeating the 'cat' test immediately after the first test run saw the time dive from 25 seconds down to 4 seconds. Before I'd set those md-cache variables it had taken 17 seconds, due, I assume, to the actual file data being cached in the Linux buffer cache. So those md-cache settings really did make a change - taking off another 13 seconds - once everything was cached. Flushing/invalidating the Linux memory cache made the next test go back to the 25 seconds. So it seems to me that the md-cache must hold its contents in the Linux memory buffers cache ... which surprised me, because I thought a user-space system like Gluster would have the cache within the daemons or maybe a shared memory segment, nothing that would be affected by clearing the Linux buffer cache. I was expecting a run after invalidating the linux cache would take something between 4 seconds and 25 seconds, with the md-cache still primed but the file data expired. Just out of curiosity in how the md-cache is implemented ... why does clearing the Linux buffers seem to affect it? 4. The documentation says that Geo Gluster does 'asynchronous replication', which is something that would really help, but that it's 'master/slave', so I'm assuming that Geo Gluster won't fulfill my requirements of both servers being able to occasionally write/modify/delete files? 5. In my brick directory I have a '.trashcan' subdirectory - which is documented - but also a '.glusterfs' directory, which seems to have lots of magical files in some sort of housekeeping structure. Surprisingly the total amount of data under .glusterfs is greater than the total size of the actual files in my test directory. I haven't seen a description of what .glusterfs is used for ... are they vital to the operation of Gluster, or can they be deleted? Just curious. At once stage I had 1.1 GB of files in my volume, which expanded to be 1.5GB in the brick (due to the metadata extents) and a whopping 1.6GB of extra data materialized under the .glusterfs directory! 6. Since I'm using Centos I try to stick with things that are available through the Red Hat repository channel ... so in my looking for distributed filesystems I saw mention of Ceph. Because I wanted only a simple replicated filesystem it seemed to me that Ceph - being based/focused on 'object' storage? - wouldn't be as good a fit as Gluster. Evil question to a Gluster mailing list - will Ceph give me any significantly better performance in reading small files? I've tried to investigate and find out what I can but I could be missing something really obvious in my ignorance, so I would appreciate any quick tips/answers from the experts. Thanks! _______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx http://lists.gluster.org/mailman/listinfo/gluster-users