Thanks for the details. Response inline -
On 4/1/19 9:45 PM, Jim Kinney wrote:
On Sun, 2019-03-31 at 23:01 +0530, Soumya Koduri wrote:
On 3/29/19 10:39 PM, Poornima Gurusiddaiah wrote:
On Fri, Mar 29, 2019, 10:03 PM Jim Kinney <
jim.kinney@xxxxxxxxx
<mailto:jim.kinney@xxxxxxxxx>
<mailto:
jim.kinney@xxxxxxxxx
<mailto:jim.kinney@xxxxxxxxx>
>> wrote:
Currently running 3.12 on Centos 7.6. Doing cleanups on split-brain
and out of sync, need heal files.
We need to migrate the three replica servers to gluster v. 5 or 6.
Also will need to upgrade about 80 clients as well. Given that a
complete removal of gluster will not touch the 200+TB of data on 12
volumes, we are looking at doing that process, Stop all clients,
stop all glusterd services, remove all of it, install new version,
setup new volumes from old bricks, install new clients, mount
everything.
We would like to get some better performance from nfs-ganesha mounts
but that doesn't look like an option (not done any parameter tweaks
in testing yet). At a bare minimum, we would like to minimize the
total downtime of all systems.
Could you please be more specific here? As in are you looking for better
performance during upgrade process or in general? Compared to 3.12,
there are lot of perf improvements done in both glusterfs and esp.,
nfs-ganesha (latest stable - V2.7.x) stack. If you could provide more
information about your workloads (for eg., large-file,small-files,
metadata-intensive) , we can make some recommendations wrt to configuration.
Sure. More details:
We are (soon to be) running a three-node replica only gluster service (2
nodes now, third is racked and ready for sync and being added to gluster
cluster). Each node has 2 external drive arrays plus one internal. Each
node has 40G IB plus 40G IP connections (plans to upgrade to 100G). We
currently have 9 volumes and each is 7TB up to 50TB of space. Each
volume is a mix of thousands of large (>1GB) and tens of thousands of
small (~100KB) plus thousands inbetween.
Currently we have a 13-node computational cluster with varying GPU
abilities that mounts all of these volumes using gluster-fuse. Writes
are slow and reads are also as if from a single server. I have data from
a test setup (not anywhere near the capacity of the production system -
just for testing commands and recoveries) that indicates raw NFS is much
faster but no gluster, gluster-fuse is much slower. We have mmap issues
with python and fuse-mounted locations. Converting to NFS solves this.
We have tinkered with kernel settings to handle oom-killer so it will no
longer drop glusterfs when an errant job eat all the ram (set
oom_score_adj - -1000 for all glusterfs pids).
Have you tried tuning any perf parameters? From the volume options you
have shared below, I see that there is scope to improve performance (for
eg., by enabling md-cache parameters and parallel-readdir, metadata
related operations latency can be improved). Request Poornima, Xavi or
Du to comment on recommended values for better I/O throughput for your
workload.
We would like to transition (smoothly!!) to gluster 5 or 6 with
nfs-ganesha 2.7 and see some performance improvements. We will be using
corosync and pacemaker for NFS failover. It would be fantastic be able
to saturate a 10G IPoIB (or 40G IB !) connection to each compute node in
the current computational cluster. Right now we absolutely can't get
much write speed ( copy a 6.2GB file from host to gluster storage took
1m 21s. cp from disk to /dev/null is 7s). cp from gluster to /dev/null
is 1.0m (same 6.2GB file). That's a 10Gbps IPoIB connection at only 800Mbps.
Few things to note here -
* The volume option "nfs.disable" command refers to GluscterNFS service
which is being deprecated and not enabled by default in the latest
gluster versions available (like in gluster 5 & 6). We recommend
NFS-Ganesha and hence this option needs to be turned on (to disable
GlusterNFS)
* Starting from Gluster 3.11 , HA configuration bits for NFS-Ganesha
have been removed from gluster codebase. So you would need to either
manually configure any HA service on top of NFS-Ganesha servers or use
storhaug [1] to configure the same.
* Coming to technical aspects, by switching to 'NFS', you could benefit
from heavy caching done by NFS client and few other optimizations it
does. Even NFS-Ganesha server does metadata caching and resides on the
same nodes as the glusterfs servers. Apart from these, NFS-Ganesha acts
like any other glusterfs client (but by making use of libgfapi and not
fuse mount). It would be interesting to check if and how much
improvement you get with 'NFS' when compared to fuse protocol for your
workload. Please let us know when you have the test environment ready.
Will make recommendations wrt to few settings for NFS-Ganesha server
and client.
Thanks,
Soumya
[1] https://github.com/linux-ha-storage/storhaug
We would like to do things like enable SSL encryption of all data flows
(we deal with PHI data in a HIPAA-regulated setting) but are concerned
about performance. We are running dual Intel Xeon E5-2630L (12 physical
cores each @ 2.4GHz) and 128GB RAM in each server node. We have 170
users. About 20 are active at any time.
The current setting on /home (others are similar if not identical, maybe
nfs-disable is true for others):
gluster volume get home all
Option Value
------ -----
cluster.lookup-unhashed on
cluster.lookup-optimize off
cluster.min-free-disk 10%
cluster.min-free-inodes 5%
cluster.rebalance-stats off
cluster.subvols-per-directory (null)
cluster.readdir-optimize off
cluster.rsync-hash-regex (null)
cluster.extra-hash-regex (null)
cluster.dht-xattr-name trusted.glusterfs.dht
cluster.randomize-hash-range-by-gfid off
cluster.rebal-throttle normal
cluster.lock-migration off
cluster.local-volume-name (null)
cluster.weighted-rebalance on
cluster.switch-pattern (null)
cluster.entry-change-log on
cluster.read-subvolume (null)
cluster.read-subvolume-index -1
cluster.read-hash-mode 1
cluster.background-self-heal-count 8
cluster.metadata-self-heal on
cluster.data-self-heal on
cluster.entry-self-heal on
cluster.self-heal-daemon enable
cluster.heal-timeout 600
cluster.self-heal-window-size 1
cluster.data-change-log on
cluster.metadata-change-log on
cluster.data-self-heal-algorithm (null)
cluster.eager-lock on
disperse.eager-lock on
cluster.quorum-type none
cluster.quorum-count (null)
cluster.choose-local true
cluster.self-heal-readdir-size 1KB
cluster.post-op-delay-secs 1
cluster.ensure-durability on
cluster.consistent-metadata no
cluster.heal-wait-queue-length 128
cluster.favorite-child-policy none
cluster.stripe-block-size 128KB
cluster.stripe-coalesce true
diagnostics.latency-measurement off
diagnostics.dump-fd-stats off
diagnostics.count-fop-hits off
diagnostics.brick-log-level INFO
diagnostics.client-log-level INFO
diagnostics.brick-sys-log-level CRITICAL
diagnostics.client-sys-log-level CRITICAL
diagnostics.brick-logger (null)
diagnostics.client-logger (null)
diagnostics.brick-log-format (null)
diagnostics.client-log-format (null)
diagnostics.brick-log-buf-size 5
diagnostics.client-log-buf-size 5
diagnostics.brick-log-flush-timeout 120
diagnostics.client-log-flush-timeout 120
diagnostics.stats-dump-interval 0
diagnostics.fop-sample-interval 0
diagnostics.stats-dump-format json
diagnostics.fop-sample-buf-size 65535
diagnostics.stats-dnscache-ttl-sec 86400
performance.cache-max-file-size 0
performance.cache-min-file-size 0
performance.cache-refresh-timeout 1
performance.cache-priority
performance.cache-size 32MB
performance.io-thread-count 16
performance.high-prio-threads 16
performance.normal-prio-threads 16
performance.low-prio-threads 16
performance.least-prio-threads 1
performance.enable-least-priority on
performance.cache-size 128MB
performance.flush-behind on
performance.nfs.flush-behind on
performance.write-behind-window-size 1MB
performance.resync-failed-syncs-after-fsyncoff
performance.nfs.write-behind-window-size1MB
performance.strict-o-direct off
performance.nfs.strict-o-direct off
performance.strict-write-ordering off
performance.nfs.strict-write-ordering off
performance.lazy-open yes
performance.read-after-open no
performance.read-ahead-page-count 4
performance.md-cache-timeout 1
performance.cache-swift-metadata true
performance.cache-samba-metadata false
performance.cache-capability-xattrs true
performance.cache-ima-xattrs true
features.encryption off
encryption.master-key (null)
encryption.data-key-size 256
encryption.block-size 4096
network.frame-timeout 1800
network.ping-timeout 42
network.tcp-window-size (null)
features.lock-heal off
features.grace-timeout 10
network.remote-dio disable
client.event-threads 2
client.tcp-user-timeout 0
client.keepalive-time 20
client.keepalive-interval 2
client.keepalive-count 9
network.tcp-window-size (null)
network.inode-lru-limit 16384
auth.allow *
auth.reject (null)
transport.keepalive 1
server.allow-insecure (null)
server.root-squash off
server.anonuid 65534
server.anongid 65534
server.statedump-path /var/run/gluster
server.outstanding-rpc-limit 64
features.lock-heal off
features.grace-timeout 10
server.ssl (null)
auth.ssl-allow *
server.manage-gids off
server.dynamic-auth on
client.send-gids on
server.gid-timeout 300
server.own-thread (null)
server.event-threads 1
server.tcp-user-timeout 0
server.keepalive-time 20
server.keepalive-interval 2
server.keepalive-count 9
transport.listen-backlog 10
ssl.own-cert (null)
ssl.private-key (null)
ssl.ca-list (null)
ssl.crl-path (null)
ssl.certificate-depth (null)
ssl.cipher-list (null)
ssl.dh-param (null)
ssl.ec-curve (null)
performance.write-behind on
performance.read-ahead on
performance.readdir-ahead off
performance.io-cache on
performance.quick-read on
performance.open-behind on
performance.nl-cache off
performance.stat-prefetch on
performance.client-io-threads off
performance.nfs.write-behind on
performance.nfs.read-ahead off
performance.nfs.io-cache off
performance.nfs.quick-read off
performance.nfs.stat-prefetch off
performance.nfs.io-threads off
performance.force-readdirp true
performance.cache-invalidation false
features.uss off
features.snapshot-directory .snaps
features.show-snapshot-directory off
network.compression off
network.compression.window-size -15
network.compression.mem-level 8
network.compression.min-size 0
network.compression.compression-level -1
network.compression.debug false
features.limit-usage (null)
features.default-soft-limit 80%
features.soft-timeout 60
features.hard-timeout 5
features.alert-time 86400
features.quota-deem-statfs off
geo-replication.indexing off
geo-replication.indexing off
geo-replication.ignore-pid-check off
geo-replication.ignore-pid-check off
features.quota off
features.inode-quota off
features.bitrot disable
debug.trace off
debug.log-history no
debug.log-file no
debug.exclude-ops (null)
debug.include-ops (null)
debug.error-gen off
debug.error-failure (null)
debug.error-number (null)
debug.random-failure off
debug.error-fops (null)
nfs.enable-ino32 no
nfs.mem-factor 15
nfs.export-dirs on
nfs.export-volumes on
nfs.addr-namelookup off
nfs.dynamic-volumes off
nfs.register-with-portmap on
nfs.outstanding-rpc-limit 16
nfs.port 2049
nfs.rpc-auth-unix on
nfs.rpc-auth-null on
nfs.rpc-auth-allow all
nfs.rpc-auth-reject none
nfs.ports-insecure off
nfs.trusted-sync off
nfs.trusted-write off
nfs.volume-access read-write
nfs.export-dir
nfs.disable off
nfs.nlm on
nfs.acl on
nfs.mount-udp off
nfs.mount-rmtab /var/lib/glusterd/nfs/rmtab
nfs.rpc-statd /sbin/rpc.statd
nfs.server-aux-gids off
nfs.drc off
nfs.drc-size 0x20000
nfs.read-size (1 * 1048576ULL)
nfs.write-size (1 * 1048576ULL)
nfs.readdir-size (1 * 1048576ULL)
nfs.rdirplus on
nfs.exports-auth-enable (null)
nfs.auth-refresh-interval-sec (null)
nfs.auth-cache-ttl-sec (null)
features.read-only off
features.worm off
features.worm-file-level off
features.default-retention-period 120
features.retention-mode relax
features.auto-commit-period 180
storage.linux-aio off
storage.batch-fsync-mode reverse-fsync
storage.batch-fsync-delay-usec 0
storage.owner-uid -1
storage.owner-gid -1
storage.node-uuid-pathinfo off
storage.health-check-interval 30
storage.build-pgfid on
storage.gfid2path on
storage.gfid2path-separator :
storage.bd-aio off
cluster.server-quorum-type off
cluster.server-quorum-ratio 0
changelog.changelog off
changelog.changelog-dir (null)
changelog.encoding ascii
changelog.rollover-time 15
changelog.fsync-interval 5
changelog.changelog-barrier-timeout 120
changelog.capture-del-path off
features.barrier disable
features.barrier-timeout 120
features.trash off
features.trash-dir .trashcan
features.trash-eliminate-path (null)
features.trash-max-filesize 5MB
features.trash-internal-op off
cluster.enable-shared-storage disable
cluster.write-freq-threshold 0
cluster.read-freq-threshold 0
cluster.tier-pause off
cluster.tier-promote-frequency 120
cluster.tier-demote-frequency 3600
cluster.watermark-hi 90
cluster.watermark-low 75
cluster.tier-mode cache
cluster.tier-max-promote-file-size 0
cluster.tier-max-mb 4000
cluster.tier-max-files 10000
cluster.tier-query-limit 100
cluster.tier-compact on
cluster.tier-hot-compact-frequency 604800
cluster.tier-cold-compact-frequency 604800
features.ctr-enabled off
features.record-counters off
features.ctr-record-metadata-heat off
features.ctr_link_consistency off
features.ctr_lookupheal_link_timeout 300
features.ctr_lookupheal_inode_timeout 300
features.ctr-sql-db-cachesize 12500
features.ctr-sql-db-wal-autocheckpoint 25000
features.selinux on
locks.trace off
locks.mandatory-locking off
cluster.disperse-self-heal-daemon enable
cluster.quorum-reads no
client.bind-insecure (null)
features.shard off
features.shard-block-size 64MB
features.scrub-throttle lazy
features.scrub-freq biweekly
features.scrub false
features.expiry-time 120
features.cache-invalidation off
features.cache-invalidation-timeout 60
features.leases off
features.lease-lock-recall-timeout 60
disperse.background-heals 8
disperse.heal-wait-qlength 128
cluster.heal-timeout 600
dht.force-readdirp on
disperse.read-policy gfid-hash
cluster.shd-max-threads 1
cluster.shd-wait-qlength 1024
cluster.locking-scheme full
cluster.granular-entry-heal no
features.locks-revocation-secs 0
features.locks-revocation-clear-all false
features.locks-revocation-max-blocked 0
features.locks-monkey-unlocking false
disperse.shd-max-threads 1
disperse.shd-wait-qlength 1024
disperse.cpu-extensions auto
disperse.self-heal-window-size 1
cluster.use-compound-fops off
performance.parallel-readdir off
performance.rda-request-size 131072
performance.rda-low-wmark 4096
performance.rda-high-wmark 128KB
performance.rda-cache-limit 10MB
performance.nl-cache-positive-entry false
performance.nl-cache-limit 10MB
performance.nl-cache-timeout 60
cluster.brick-multiplex off
cluster.max-bricks-per-process 0
disperse.optimistic-change-log on
cluster.halo-enabled False
cluster.halo-shd-max-latency 99999
cluster.halo-nfsd-max-latency 5
cluster.halo-max-latency 5
cluster.halo-max-replicas
Thanks,
Soumya
Does this process make more sense than a version upgrade process to
4.1, then 5, then 6? What "gotcha's" do I need to be ready for? I
have until late May to prep and test on old, slow hardware with a
small amount of files and volumes.
You can directly upgrade from 3.12 to 6.x. I would suggest that rather
than deleting and creating Gluster volume. +Hari and +Sanju for further
guidelines on upgrade, as they recently did upgrade tests. +Soumya to
add to the nfs-ganesha aspect.
Regards,
Poornima
--
James P. Kinney III
Every time you stop a school, you will have to build a jail. What you
gain at one end you lose at the other. It's like feeding a dog on his
own tail. It won't fatten the dog.
- Speech 11/23/1900 Mark Twain
http://heretothereideas.blogspot.com/
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
<mailto:Gluster-users@xxxxxxxxxxx>
<mailto:
Gluster-users@xxxxxxxxxxx
<mailto:Gluster-users@xxxxxxxxxxx>
>
https://lists.gluster.org/mailman/listinfo/gluster-users
--
James P. Kinney III
Every time you stop a school, you will have to build a jail. What you
gain at one end you lose at the other. It's like feeding a dog on his
own tail. It won't fatten the dog.
- Speech 11/23/1900 Mark Twain
http://heretothereideas.blogspot.com/
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users