Hi David,
It's difficult to find anything structured (but it's the same for Linux and other tech). I use Red Hat's doxumentation, guideds online (crosscheck the options with official documentation) and experience shared on the mailing list.
I don't see anything (iin /var/lib/gluster/groups) that will match your profile, but I think that you should try with performance.read-ahead and performance.readdir-ahead 'off' . I have found out a bug (didn't read the whole stuff) , that might be interesting for you :
https://bugzilla.redhat.com/show_bug.cgi?id=1601166
Also, Arbiter is very important in order to avoid split brain situations (but based on my experience , issues still can occur) and best the brick for the Arbiter to be an SSD as it needs to process the metadata as fast as possible. With v7, there is an option the client to have an Arbiter even in the cloud (remote arbiter) that is used only when 1 data brick is down.
Please report the issue with the cache - that should not be like that.
Are you using Jumbo frames (MTU 9000)?
What is yoir brick's I/O scheduler ?
Best Regards,
Strahil Nikolov
Hi Strahil,We may have had a heal since the GFS arbiter node wasn't accessible from the GFS clients, only from the other GFS servers. Unfortunately we haven't been able to produce the problem seen in production while testing so are unsure whether making the GFS arbiter node directly available to clients has fixed the issue.The load on GFS is mainly:1. There are a small number of files around 5MB in size which are read often and change infrequently.2. There are a large number of directories which are opened for reading to read the list of contents frequently.3. There are a large number of new files around 5MB in size written frequently and read infrequently.We haven't touched the tuning options as we don't really feel qualified to tell what needs changed from the default. Do you know of any suitable guides to get started?For some reason performance.cache-size is reported as both 32MB and 128MB. Is it worth reporting even for version 5.6?Here is the "gluster volume info" taken on the first node. Note that the third node (the arbiter) is currently taken out of the cluster:Volume Name: gvol0
Type: Replicate
Volume ID: fb5af69e-1c3e-4164-8b23-c1d7bec9b1b6
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: gfs1:/nodirectwritedata/gluster/gvol0
Brick2: gfs2:/nodirectwritedata/gluster/gvol0
Options Reconfigured:
diagnostics.client-log-level: INFO
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inetThanks for your help and advice.On Sat, 28 Dec 2019 at 17:46, Strahil <hunter86_bg@yahoo.com> wrote:Hi David,
It seems that I have misread your quorum options, so just ignore that from my previous e-mail.
Best Regards,
Strahil NikolovOn Dec 27, 2019 15:38, Strahil <hunter86_bg@yahoo.com> wrote:Hi David,
Gluster supports live rolling upgrade, so there is no need to redeploy at all - but the migration notes should be checked as some features must be disabled first.
Also, the gluster client should remount in order to bump the gluster op-version.What kind of workload do you have ?
I'm asking as there are predefined (and recommended) settings located at /var/lib/gluster/groups .
You can check the options for each group and cross-check the options meaning in the docs before activating a setting.I still have a vague feeling that ,during that high-peak of network bandwidth, there was a heal going on. Have you checked that ?
Also, sharding is very useful , when you work with large files and the heal is reduced to the size of the shard.
N.B.: Once sharding is enabled, DO NOT DISABLE it - as you will loose your data.
Using GLUSTER v7.1 (soon on CentOS & Debian) allows using latest features and optimizations while support from gluster Dev community is quite active.
P.S: I'm wondering how 'performance.cache-size' can both be 32 MB and 128 MB. Please double-check this (maybe I'm reading it wrong on my smartphone) and if needed raise a bug on bugzilla.redhat.com
P.S2: Please provide 'gluster volume info' as 'cluster.quorum-type' -> 'none' is not normal for replicated volumes (arbiters are using in replica volumes)
According to the dooutput (otps://docs.gluster.org/en/latest/Administrator%20Guide/arbiter-volumes-and-quorum/) :
Note: Enabling the arbiter feature automatically configures client-quorum to 'auto'. This setting is not to be changed.
Here is my output (Hyperconverged Virtualization Cluster -> oVirt):
# gluster volume info engine | grep quorum
cluster.quorum-type: auto
cluster.server-quorum-type: serverChanging quorum is more 'riskier' than other options, so you need to take necessary measures. I think , we all know what will happen , if the cluster is out of quorum and you change the quorum settings to more stringent ones :D
P.S3: If you decide to reset your gluster volume to the defaults, you can create a new volume (same type as current one), the get the options for that volume and put them in a file and then bulk deploy via 'gluster volume set <Original Volume> group custom-group' , where the file is located on every gluster server in the '/var/lib/gluster/groups' directory.
Last , get rid of the sample volume.Best Regards,
Strahil NikolovOn Dec 27, 2019 03:22, David Cunningham <dcunningham@voisonics.com> wrote:Hi Strahil,Our volume options are as below. Thanks for the suggestion to upgrade to version 6 or 7. We could do that be simply removing the current installation and installing the new one (since it's not live right now). We might have to convince the customer that it's likely to succeed though, as at the moment I think they believe that GFS is not going to work for them.Option Value
------ -----
cluster.lookup-unhashed on
cluster.lookup-optimize on
cluster.min-free-disk 10%
cluster.min-free-inodes 5%
cluster.rebalance-stats off
cluster.subvols-per-directory (null)
cluster.readdir-optimize off
cluster.rsync-hash-regex (null)
cluster.extra-hash-regex (null)
cluster.dht-xattr-name trusted.glusterfs.dht
cluster.randomize-hash-range-by-gfid off
cluster.rebal-throttle normal
cluster.lock-migration off
cluster.force-migration off
cluster.local-volume-name (null)
cluster.weighted-rebalance on
cluster.switch-pattern (null)
cluster.entry-change-log on
cluster.read-subvolume (null)
cluster.read-subvolume-index -1
cluster.read-hash-mode 1
cluster.background-self-heal-count 8
cluster.metadata-self-heal on
cluster.data-self-heal on
cluster.entry-self-heal on
cluster.self-heal-daemon on
cluster.heal-timeout 600
cluster.self-heal-window-size 1
cluster.data-change-log on
cluster.metadata-change-log on
cluster.data-self-heal-algorithm (null)
cluster.eager-lock on
disperse.eager-lock on
disperse.other-eager-lock on
disperse.eager-lock-timeout 1
disperse.other-eager-lock-timeout 1
cluster.quorum-type none
cluster.quorum-count (null)
cluster.choose-local true
cluster.self-heal-readdir-size 1KB
cluster.post-op-delay-secs 1
cluster.ensure-durability on
cluster.consistent-metadata no
cluster.heal-wait-queue-length 128
cluster.favorite-child-policy none
cluster.full-lock yes
cluster.stripe-block-size 128KB
cluster.stripe-coalesce true
diagnostics.latency-measurement off
diagnostics.dump-fd-stats off
diagnostics.count-fop-hits off
diagnostics.brick-log-level INFO
diagnostics.client-log-level INFO
diagnostics.brick-sys-log-level CRITICAL
diagnostics.client-sys-log-level CRITICAL
diagnostics.brick-logger (null)
diagnostics.client-logger (null)
diagnostics.brick-log-format (null)
diagnostics.client-log-format (null)
diagnostics.brick-log-buf-size 5
diagnostics.client-log-buf-size 5
diagnostics.brick-log-flush-timeout 120
diagnostics.client-log-flush-timeout 120
diagnostics.stats-dump-interval 0
diagnostics.fop-sample-interval 0
diagnostics.stats-dump-format json
diagnostics.fop-sample-buf-size 65535
diagnostics.stats-dnscache-ttl-sec 86400
performance.cache-max-file-size 0
performance.cache-min-file-size 0
performance.cache-refresh-timeout 1
performance.cache-priority
performance.cache-size 32MB
performance.io-thread-count 16
performance.high-prio-threads 16
performance.normal-prio-threads 16
performance.low-prio-threads 16
performance.least-prio-threads 1
performance.enable-least-priority on
performance.iot-watchdog-secs (null)
performance.iot-cleanup-disconnected-reqsoff
performance.iot-pass-through false
performance.io-cache-pass-through false
performance.cache-size 128MB
performance.qr-cache-timeout 1
performance.cache-invalidation false
performance.ctime-invalidation false
performance.flush-behind on
performance.nfs.flush-behind on
performance.write-behind-window-size 1MB
performance.resync-failed-syncs-after-fsyncoff
performance.nfs.write-behind-window-size1MB
performance.strict-o-direct off
performance.nfs.strict-o-direct off
performance.strict-write-ordering off
performance.nfs.strict-write-ordering off
performance.write-behind-trickling-writeson
performance.aggregate-size 128KB
performance.nfs.write-behind-trickling-writeson
performance.lazy-open yes
performance.read-after-open yes
performance.open-behind-pass-through false
performance.read-ahead-page-count 4
performance.read-ahead-pass-through false
performance.readdir-ahead-pass-through false
performance.md-cache-pass-through false
performance.md-cache-timeout 1
performance.cache-swift-metadata true
performance.cache-samba-metadata false
performance.cache-capability-xattrs true
performance.cache-ima-xattrs true
performance.md-cache-statfs off
performance.xattr-cache-list
performance.nl-cache-pass-through false
features.encryption off
encryption.master-key (null)
encryption.data-key-size 256
encryption.block-size 4096
network.frame-timeout 1800
network.ping-timeout 42
network.tcp-window-size (null)
network.remote-dio disable
client.event-threads 2
client.tcp-user-timeout 0
client.keepalive-time 20
client.keepalive-interval 2
client.keepalive-count 9
network.tcp-window-size (null)
network.inode-lru-limit 16384
auth.allow *
auth.reject (null)
transport.keepalive 1
server.allow-insecure on
server.root-squash off
server.anonuid 65534
server.anongid 65534
server.statedump-path /var/run/gluster
server.outstanding-rpc-limit 64
server.ssl (null)
auth.ssl-allow *
server.manage-gids off
server.dynamic-auth on
client.send-gids on
server.gid-timeout 300
server.own-thread (null)
server.event-threads 1
server.tcp-user-timeout 0
server.keepalive-time 20
server.keepalive-interval 2
server.keepalive-count 9
transport.listen-backlog 1024
ssl.own-cert (null)
ssl.private-key (null)
ssl.ca-list (null)
ssl.crl-path (null)
ssl.certificate-depth (null)
ssl.cipher-list (null)
ssl.dh-param (null)
ssl.ec-curve (null)
transport.address-family inet
performance.write-behind on
performance.read-ahead on
performance.readdir-ahead on
performance.io-cache on
performance.quick-read on
performance.open-behind on
performance.nl-cache off
performance.stat-prefetch on
performance.client-io-threads off
performance.nfs.write-behind on
performance.nfs.read-ahead off
performance.nfs.io-cache off
performance.nfs.quick-read off
performance.nfs.stat-prefetch off
performance.nfs.io-threads off
performance.force-readdirp true
performance.cache-invalidation false
features.uss off
features.snapshot-directory .snaps
features.show-snapshot-directory off
features.tag-namespaces off
network.compression off
network.compression.window-size -15
network.compression.mem-level 8
network.compression.min-size 0
network.compression.compression-level -1
network.compression.debug false
features.default-soft-limit 80%
features.soft-timeout 60
features.hard-timeout 5
features.alert-time 86400
features.quota-deem-statfs off
geo-replication.indexing off
geo-replication.indexing off
geo-replication.ignore-pid-check off
geo-replication.ignore-pid-check off
features.quota off
features.inode-quota off
features.bitrot disable
debug.trace off
debug.log-history no
debug.log-file no
debug.exclude-ops (null)
debug.include-ops (null)
debug.error-gen off
debug.error-failure (null)
debug.error-number (null)
debug.random-failure off
debug.error-fops (null)
nfs.disable on
features.read-only off
features.worm off
features.worm-file-level off
features.worm-files-deletable on
features.default-retention-period 120
features.retention-mode relax
features.auto-commit-period 180
storage.linux-aio off
storage.batch-fsync-mode reverse-fsync
storage.batch-fsync-delay-usec 0
storage.owner-uid -1
storage.owner-gid -1
storage.node-uuid-pathinfo off
storage.health-check-interval 30
storage.build-pgfid off
storage.gfid2path on
storage.gfid2path-separator :
storage.reserve 1
storage.health-check-timeout 10
storage.fips-mode-rchecksum off
storage.force-create-mode 0000
storage.force-directory-mode 0000
storage.create-mask 0777
storage.create-directory-mask 0777
storage.max-hardlinks 100
storage.ctime off
storage.bd-aio off
config.gfproxyd off
cluster.server-quorum-type off
cluster.server-quorum-ratio 0
changelog.changelog off
changelog.changelog-dir {{ brick.path }}/.glusterfs/changelogs
changelog.encoding ascii
changelog.rollover-time 15
changelog.fsync-interval 5
changelog.changelog-barrier-timeout 120
changelog.capture-del-path off
features.barrier disable
features.barrier-timeout 120
features.trash off
features.trash-dir .trashcan
features.trash-eliminate-path (null)
features.trash-max-filesize 5MB
features.trash-internal-op off
cluster.enable-shared-storage disable
cluster.write-freq-threshold 0
cluster.read-freq-threshold 0
cluster.tier-pause off
cluster.tier-promote-frequency 120
cluster.tier-demote-frequency 3600
cluster.watermark-hi 90
cluster.watermark-low 75
cluster.tier-mode cache
cluster.tier-max-promote-file-size 0
cluster.tier-max-mb 4000
cluster.tier-max-files 10000
cluster.tier-query-limit 100
cluster.tier-compact on
cluster.tier-hot-compact-frequency 604800
cluster.tier-cold-compact-frequency 604800
features.ctr-enabled off
features.record-counters off
features.ctr-record-metadata-heat off
features.ctr_link_consistency off
features.ctr_lookupheal_link_timeout 300
features.ctr_lookupheal_inode_timeout 300
features.ctr-sql-db-cachesize 12500
features.ctr-sql-db-wal-autocheckpoint 25000
features.selinux on
locks.trace off
locks.mandatory-locking off
cluster.disperse-self-heal-daemon enable
cluster.quorum-reads no
client.bind-insecure (null)
features.shard off
features.shard-block-size 64MB
features.shard-lru-limit 16384
features.shard-deletion-rate 100
features.scrub-throttle lazy
features.scrub-freq biweekly
features.scrub false
features.expiry-time 120
features.cache-invalidation off
features.cache-invalidation-timeout 60
features.leases off
features.lease-lock-recall-timeout 60
disperse.background-heals 8
disperse.heal-wait-qlength 128
cluster.heal-timeout 600
dht.force-readdirp on
disperse.read-policy gfid-hash
cluster.shd-max-threads 1
cluster.shd-wait-qlength 1024
cluster.locking-scheme full
cluster.granular-entry-heal no
features.locks-revocation-secs 0
features.locks-revocation-clear-all false
features.locks-revocation-max-blocked 0
features.locks-monkey-unlocking false
features.locks-notify-contention no
features.locks-notify-contention-delay 5
disperse.shd-max-threads 1
disperse.shd-wait-qlength 1024
disperse.cpu-extensions auto
disperse.self-heal-window-size 1
cluster.use-compound-fops off
performance.parallel-readdir off
performance.rda-request-size 131072
performance.rda-low-wmark 4096
performance.rda-high-wmark 128KB
performance.rda-cache-limit 10MB
performance.nl-cache-positive-entry false
performance.nl-cache-limit 10MB
performance.nl-cache-timeout 60
cluster.brick-multiplex off
cluster.max-bricks-per-process 0
disperse.optimistic-change-log on
disperse.stripe-cache 4
cluster.halo-enabled False
cluster.halo-shd-max-latency 99999
cluster.halo-nfsd-max-latency 5
cluster.halo-max-latency 5
cluster.halo-max-replicas 99999
cluster.halo-min-replicas 2
cluster.daemon-log-level INFO
debug.delay-gen off
delay-gen.delay-percentage 10%
delay-gen.delay-duration 100000
delay-gen.enable
disperse.parallel-writes on
features.sdfs on
features.cloudsync off
features.utime off
ctime.noatime on
feature.cloudsync-storetype (null)Thanks again.On Wed, 25 Dec 2019 at 05:51, Strahil <hunter86_bg@yahoo.com> wrote:Hi David,
On Dec 24, 2019 02:47, David Cunningham <dcunningham@voisonics.com> wrote:
>
> Hello,
>
> In testing we found that actually the GFS client having access to all 3 nodes made no difference to performance. Perhaps that's because the 3rd node that wasn't accessible from the client before was the arbiter node?
It makes sense, as no data is being generated towards the arbiter.
> Presumably we shouldn't have an arbiter node listed under backupvolfile-server when mounting the filesystem? Since it doesn't store all the data surely it can't be used to serve the data.I have my arbiter defined as last backup and no issues so far. At least the admin can easily identify the bricks from the mount options.
> We did have direct-io-mode=disable already as well, so that wasn't a factor in the performance problems.
Have you checked if the client vedsion ia not too old.
Also you can check the cluster's operation cersion:
# gluster volume get all cluster.max-op-version
# gluster volume get all cluster.op-versionCluster's op version should be at max-op-version.
In my mind come 2 options:
A) Upgrade to latest GLUSTER v6 or even v7 ( I know it won't be easy) and then set the op version to highest possible.
# gluster volume get all cluster.max-op-version
# gluster volume get all cluster.op-versionB) Deploy a NFS Ganesha server and connect the client over NFS v4.2 (and control the parallel connections from Ganesha).
Can you provide your Gluster volume's options?
'gluster volume get <VOLNAME> all'> Thanks again for any advice.
>
>
>
> On Mon, 23 Dec 2019 at 13:09, David Cunningham <dcunningham@voisonics.com> wrote:
>>
>> Hi Strahil,
>>
>> Thanks for that. We do have one backup server specified, but will add the second backup as well.
>>
>>
>> On Sat, 21 Dec 2019 at 11:26, Strahil <hunter86_bg@yahoo.com> wrote:
>>>
>>> Hi David,
>>>
>>> Also consider using the mount option to specify backup server via 'backupvolfile-server=server2:server3' (you can define more but I don't thing replica volumes greater that 3 are usefull (maybe in some special cases).
>>>
>>> In such way, when the primary is lost, your client can reach a backup one without disruption.
>>>
>>> P.S.: Client may 'hang' - if the primary server got rebooted ungracefully - as the communication must timeout before FUSE addresses the next server. There is a special script for killing gluster processes in '/usr/share/gluster/scripts' which can be used for setting up a systemd service to do that for you on shutdown.
>>>
>>> Best Regards,
>>> Strahil Nikolov
>>>
>>> On Dec 20, 2019 23:49, David Cunningham <dcunningham@voisonics.com> wrote:
>>>>
>>>> Hi Stahil,
>>>>
>>>> Ah, that is an important point. One of the nodes is not accessible from the client, and we assumed that it only needed to reach the GFS node that was mounted so didn't think anything of it.
>>>>
>>>> We will try making all nodes accessible, as well as "direct-io-mode=disable".
>>>>
>>>> Thank you.
>>>>
>>>>
>>>> On Sat, 21 Dec 2019 at 10:29, Strahil Nikolov <hunter86_bg@yahoo.com> wrote:
>>>>>
>>>>> Actually I haven't clarified myself.
>>>>> FUSE mounts on the client side is connecting directly to all bricks consisted of the volume.
>>>>> If for some reason (bad routing, firewall blocked) there could be cases where the client can reach 2 out of 3 bricks and this can constantly cause healing to happen (as one of the bricks is never updated) which will degrade the performance and cause excessive network usage.
>>>>> As your attachment is from one of the gluster nodes, this could be the case.
>>>>>
>>>>> Best Regards,
>>>>> Strahil Nikolov
>>>>>
>>>>> В петък, 20 декември 2019 г., 01:49:56 ч. Гринуич+2, David Cunningham <dcunningham@voisonics.com> написа:
>>>>>
>>>>>
>>>>> Hi Strahil,
>>>>>
>>>>> The chart attached to my original email is taken from the GFS server.
>>>>>
>>>>> I'm not sure what you mean by accessing all bricks simultaneously. We've mounted it from the client like this:
>>>>> gfs1:/gvol0 /mnt/glusterfs/ glusterfs defaults,direct-io-mode=disable,_netdev,backupvolfile-server=gfs2,fetch-attempts=10 0 0
>>>>>
>>>>> Should we do something different to access all bricks simultaneously?
>>>>>
>>>>> Thanks for your help!
>>>>>
>>>>>
>>>>> On Fri, 20 Dec 2019 at 11:47, Strahil Nikolov <hunter86_bg@yahoo.com> wrote:
>>>>>>
>>>>>> I'm not sure if you did measure the traffic from client side (tcpdump on a client machine) or from Server side.
>>>>>>
>>>>>> In both cases , please verify that the client accesses all bricks simultaneously, as this can cause unnecessary heals.
>>>>>>
>>>>>> Have you thought about upgrading to v6? There are some enhancements in v6 which could be beneficial.
>>>>>>
>>>>>> Yet, it is indeed strange that so much traffic is generated with FUSE.
>>>>>>
>>>>>> Another aproach is to test with NFSGanesha which suports pNFS and can natively speak with Gluster, which cant bring you closer to the previous setup and also provide some extra performance.
>>>>>>
>>>>>>
>>>>>> Best Regards,
>>>>>> Strahil Nikolov
>>>>>>
>>>>>>
>>>>>>
>>
>>
>> --
>> David Cunningham, Voisonics Limited
>> http://voisonics.com/
>> USA: +1 213 221 1092
>> New Zealand: +64 (0)28 2558 3782
>
>
>
> --
> David Cunningham, Voisonics Limited
> http://voisonics.com/
> USA: +1 213 221 1092
> New Zealand: +64 (0)28 2558 3782Best Regards,
Strahil Nikolov
--David Cunningham, Voisonics Limited
http://voisonics.com/
USA: +1 213 221 1092
New Zealand: +64 (0)28 2558 3782
--David Cunningham, Voisonics Limited
http://voisonics.com/
USA: +1 213 221 1092
New Zealand: +64 (0)28 2558 3782
________ Community Meeting Calendar: APAC Schedule - Every 2nd and 4th Tuesday at 11:30 AM IST Bridge: https://bluejeans.com/441850968 NA/EMEA Schedule - Every 1st and 3rd Tuesday at 01:00 PM EDT Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@xxxxxxxxxxx https://lists.gluster.org/mailman/listinfo/gluster-users