Re: Quick performance check?

Cedric Lemarchand <yipikai7@xxxxxxxxx> · Fri, 3 Feb 2017 14:02:59 +0100

On 3 Feb 2017, at 13:48, Gambit15 <dougti+gluster@xxxxxxxxx> wrote:

Hi Alex,
 I don't use Gluster for storing large amounts of small files, however from what I've read, that does appear to its big achilles heel.

I am not an expert but I agree, due to its distributed nature, the induced per file access latency plays a big role when you have to deal with lot of small files, but it seems there are some tuning options available, a good place to start could be : https://access.redhat.com/documentation/en-US/Red_Hat_Storage/3/html/Administration_Guide/Small_File_Performance_Enhancements.html

Personally, if you're not looking to scale out to a lot more servers, I'd go with Ceph or DRBD. Gluster's best features are in its scalability.

AFAIK Ceph need at least 3 monitors (aka a quorum) to be fully “hight available”, so the entry ticket is pretty high and from my point of view over-kill for such needs, except if you plane to scale out too. DRBD seems a more reasonable approach.

Cheers 

Also, it's worth pointing out that in any setup, you've got to be careful with 2 node configurations as they're highly vulnerable to split-brain scenarios.

Given the relatively small size of your data, caching tweaks & an arbiter may well save you here, however I don't use enough of its caching features to be able to give advice on it.

D

On 3 February 2017 at 08:28, Alex Sudakar <alex.sudakar@xxxxxxxxx> wrote:
Hi.  I'm looking for a clustered filesystem for a very simple

scenario.  I've set up Gluster but my tests have shown quite a

performance penalty when compared to using a local XFS filesystem.

This no doubt reflects the reality of moving to a proper distributed

filesystem, but I'd like to quickly check that I haven't missed

something obvious that might improve performance.

I plan to have two Amazon AWS EC2 instances (virtual machines) both

accessing the same filesystem for read/writes.  Access will be almost

entirely reads, with the occasional modification, deletion or creation

of files.  Ideally I wanted all those reads going straight to the

local XFS filesystem and just the writes incurring a distributed

performance penalty.  :-)

So I've set up two VMs with Centos 7.2 and Gluster 3.8.8, each machine

running as a combined Gluster server and client.  One brick on each

machine, one volume in a 1 x 2 replica configuration.

Everything works, it's just the performance penalty which is a surprise.  :-)

My test directory has 9,066 files and directories; 7,987 actual files.

Total size is 63MB data, 85MB allocated; an average size of 8KB data

per file.  The brick's files have a total of 117MB allocated, with the

extra 32MB working out pretty much to be exactly the sum of the extra

4KB extents that would have been allocated for the XFS attributes per

file - the VMs were installed with the default 256 byte inode size for

the local filesystem, and from what I've read Gluster will force the

filesystem to allocate an extent for its attributes.  'xfs_bmap' on a

few files shows this is the case.

A simple 'cat' of every file when laid out in 'native' directories on

the XFS filesystem takes about 3 seconds.  A cat of all the files in

the brick's directory on the same filesystem takes about 6.4 seconds,

which I figure is due to the extra I/O for the inode metadata extents

(although not quite certain; the additional extents added about 40%

extra to the disk block allocation, so I'm unsure as to why the time

increase was 100%).

Doing the same test through the glusterfs mount takes about 25

seconds; roughly four times longer than reading those same files

directly from the brick itself.

It took 30 seconds until I applied the 'md-cache' settings (for those

variables that still exist in 3.8.8) mentioned in this very helpful

article:

  http://blog.gluster.org/category/performance/

So use of the md-cache in a 'cold run' shaved off 5 seconds - due to

common directory LOOKUP operations being cached I guess.

Output of a 'volume info' is as follows:

Volume Name: g1

Type: Replicate

Volume ID: bac6cd70-ca0d-4173-9122-644051444fe5

Status: Started

Snapshot Count: 0

Number of Bricks: 1 x 2 = 2

Transport-type: tcp

Bricks:

Brick1: serverA:/data/brick1

Brick2: serverC:/data/brick1

Options Reconfigured:

transport.address-family: inet

performance.readdir-ahead: on

nfs.disable: on

cluster.self-heal-daemon: enable

features.cache-invalidation: on

features.cache-invalidation-timeout: 600

performance.stat-prefetch: on

performance.md-cache-timeout: 60

network.inode-lru-limit: 90000

The article suggests a value of 600 for

features.cache-invalidation-timeout but my Gluster version only

permits a maximum value of 60.

Network speed between the two VMs is about 120 MBytes/sec - the two

VMs inhabit the same Amazon Virtual Private Cloud - so I don't think

bandwidth is a factor.

The 400% slowdown is no doubt the penalty incurred in moving to a

proper distributed filesystem.  That article and other web pages I've

read all say that each open of a file results in synchronous LOOKUP

operations on all the replicas, so I'm guessing it just takes that

much time for everything to happen before a file can be opened.

Gluster profiling shows that there are 11,198 LOOKUP operations on the

test cat of the 7,987 files.

As a Gluster newbie I'd appreciate some quick advice if possible -

1.  Is this sort of performance hit - on directories of small files -

typical for such a simple Gluster configuration?

2.  Is there anything I can do to speed things up?  :-)

3.  Repeating the 'cat' test immediately after the first test run saw

the time dive from 25 seconds down to 4 seconds.  Before I'd set those

md-cache variables it had taken 17 seconds, due, I assume, to the

actual file data being cached in the Linux buffer cache.  So those

md-cache settings really did make a change - taking off another 13

seconds - once everything was cached.

Flushing/invalidating the Linux memory cache made the next test go

back to the 25 seconds.  So it seems to me that the md-cache must hold

its contents in the Linux memory buffers cache ... which surprised me,

because I thought a user-space system like Gluster would have the

cache within the daemons or maybe a shared memory segment, nothing

that would be affected by clearing the Linux buffer cache.  I was

expecting a run after invalidating the linux cache would take

something between 4 seconds and 25 seconds, with the md-cache still

primed but the file data expired.

Just out of curiosity in how the md-cache is implemented ... why does

clearing the Linux buffers seem to affect it?

4.  The documentation says that Geo Gluster does 'asynchronous

replication', which is something that would really help, but that it's

'master/slave', so I'm assuming that Geo Gluster won't fulfill my

requirements of both servers being able to occasionally

write/modify/delete files?

5.  In my brick directory I have a '.trashcan' subdirectory - which is

documented - but also a '.glusterfs' directory, which seems to have

lots of magical files in some sort of housekeeping structure.

Surprisingly the total amount of data under .glusterfs is greater than

the total size of the actual files in my test directory.  I haven't

seen a description of what .glusterfs is used for ... are they vital

to the operation of Gluster, or can they be deleted?  Just curious.

At once stage I had 1.1 GB of files in my volume, which expanded to be

1.5GB in the brick (due to the metadata extents) and a whopping 1.6GB

of extra data materialized under the .glusterfs directory!

6.  Since I'm using Centos I try to stick with things that are

available through the Red Hat repository channel ... so in my looking

for distributed filesystems I saw mention of Ceph.  Because I wanted

only a simple replicated filesystem it seemed to me that Ceph - being

based/focused on 'object' storage? - wouldn't be as good a fit as

Gluster.  Evil question to a Gluster mailing list - will Ceph give me

any significantly better performance in reading small files?

I've tried to investigate and find out what I can but I could be

missing something really obvious in my ignorance, so I would

appreciate any quick tips/answers from the experts.  Thanks!

_______________________________________________

Gluster-users mailing list

Gluster-users@xxxxxxxxxxx

http://lists.gluster.org/mailman/listinfo/gluster-users

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-users

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-users