Re: upgrade best practices

Soumya Koduri <skoduri@xxxxxxxxxx> · Tue, 2 Apr 2019 00:07:18 +0530

Thanks for the details. Response inline -

On 4/1/19 9:45 PM, Jim Kinney wrote:
On Sun, 2019-03-31 at 23:01 +0530, Soumya Koduri wrote:

On 3/29/19 10:39 PM, Poornima Gurusiddaiah wrote:

On Fri, Mar 29, 2019, 10:03 PM Jim Kinney <
jim.kinney@xxxxxxxxx
 <mailto:jim.kinney@xxxxxxxxx>

<mailto:
jim.kinney@xxxxxxxxx
 <mailto:jim.kinney@xxxxxxxxx>
>> wrote:

     Currently running 3.12 on Centos 7.6. Doing cleanups on split-brain
     and out of sync, need heal files.

     We need to migrate the three replica servers to gluster v. 5 or 6.
     Also will need to upgrade about 80 clients as well. Given that a
     complete removal of gluster will not touch the 200+TB of data on 12
     volumes, we are looking at doing that process, Stop all clients,
     stop all glusterd services, remove all of it, install new version,
     setup new volumes from old bricks, install new clients, mount
     everything.

     We would like to get some better performance from nfs-ganesha mounts
     but that doesn't look like an option (not done any parameter tweaks
     in testing yet). At a bare minimum, we would like to minimize the
     total downtime of all systems.

Could you please be more specific here? As in are you looking for better
performance during upgrade process or in general? Compared to 3.12,
there are lot of perf improvements done in both glusterfs and esp.,
nfs-ganesha (latest stable - V2.7.x) stack. If you could provide more
information about your workloads (for eg., large-file,small-files,
metadata-intensive) , we can make some recommendations wrt to configuration.

Sure. More details:

We are (soon to be) running a three-node replica only gluster service (2 
nodes now, third is racked and ready for sync and being added to gluster 
cluster). Each node has 2 external drive arrays plus one internal. Each 
node has 40G IB plus 40G IP connections (plans to upgrade to 100G). We 
currently have 9 volumes and each is 7TB up to 50TB of space. Each 
volume is a mix of thousands of large (>1GB) and tens of thousands of 
small (~100KB) plus thousands inbetween.

Currently we have a 13-node computational cluster with varying GPU 
abilities that mounts all of these volumes using gluster-fuse. Writes 
are slow and reads are also as if from a single server. I have data from 
a test setup (not anywhere near the capacity of the production system - 
just for testing commands and recoveries) that indicates raw NFS is much 
faster but no gluster, gluster-fuse is much slower. We have mmap issues 
with python and fuse-mounted locations. Converting to NFS solves this. 
We have tinkered with kernel settings to handle oom-killer so it will no 
longer drop glusterfs when an errant job eat all the ram (set 
oom_score_adj - -1000 for all glusterfs pids).

Have you tried tuning any perf parameters? From the volume options you 
have shared below, I see that there is scope to improve performance (for 
eg., by enabling md-cache parameters and parallel-readdir, metadata 
related operations latency can be improved). Request Poornima, Xavi or 
Du to comment on recommended values for better I/O throughput for your 
workload.

We would like to transition (smoothly!!) to gluster 5 or 6 with 
nfs-ganesha 2.7 and see some performance improvements. We will be using 
corosync and pacemaker for NFS failover. It would be fantastic be able 
to saturate a 10G IPoIB (or 40G IB !) connection to each compute node in 
the current computational cluster. Right now we absolutely can't get 
much write speed ( copy a 6.2GB file from host to gluster storage took 
1m 21s. cp from disk to /dev/null is 7s). cp from gluster to /dev/null 
is 1.0m (same 6.2GB file). That's a 10Gbps IPoIB connection at only 800Mbps.

Few things to note here -
* The volume option "nfs.disable" command refers to GluscterNFS service 
which is being deprecated and not enabled by default in the latest 
gluster versions available (like in gluster 5 & 6). We recommend 
NFS-Ganesha and hence this option needs to be turned on (to disable 
GlusterNFS)

* Starting from Gluster 3.11 , HA configuration bits for NFS-Ganesha 
have been removed from gluster codebase. So you would need to either 
manually configure any HA service on top of NFS-Ganesha servers or use 
storhaug [1] to configure the same.

* Coming to technical aspects, by switching to 'NFS', you could benefit 
from heavy caching done by NFS client and few other optimizations it 
does. Even NFS-Ganesha server does metadata caching and resides on the 
same nodes as the glusterfs servers. Apart from these, NFS-Ganesha acts 
like any other glusterfs client (but by making use of libgfapi and not 
fuse mount). It would be interesting to check if and how much 
improvement you get with 'NFS' when compared to fuse protocol for your 
workload. Please let us know when you have the test environment ready. 
Will make  recommendations wrt to few settings for NFS-Ganesha server 
and client.

Thanks,
Soumya

[1] https://github.com/linux-ha-storage/storhaug

We would like to do things like enable SSL encryption of all data flows 
(we deal with PHI data in a HIPAA-regulated setting) but are concerned 
about performance. We are running dual Intel Xeon  E5-2630L (12 physical 
cores each @ 2.4GHz) and 128GB RAM in each server node. We have 170 
users. About 20 are active at any time.

The current setting on /home (others are similar if not identical, maybe 
nfs-disable is true for others):

gluster volume get home all
Option                                  Value
------                                  -----
cluster.lookup-unhashed                 on
cluster.lookup-optimize                 off
cluster.min-free-disk                   10%
cluster.min-free-inodes                 5%
cluster.rebalance-stats                 off
cluster.subvols-per-directory           (null)
cluster.readdir-optimize                off
cluster.rsync-hash-regex                (null)
cluster.extra-hash-regex                (null)
cluster.dht-xattr-name                  trusted.glusterfs.dht
cluster.randomize-hash-range-by-gfid    off
cluster.rebal-throttle                  normal
cluster.lock-migration                  off
cluster.local-volume-name               (null)
cluster.weighted-rebalance              on
cluster.switch-pattern                  (null)
cluster.entry-change-log                on
cluster.read-subvolume                  (null)
cluster.read-subvolume-index            -1
cluster.read-hash-mode                  1
cluster.background-self-heal-count      8
cluster.metadata-self-heal              on
cluster.data-self-heal                  on
cluster.entry-self-heal                 on
cluster.self-heal-daemon                enable
cluster.heal-timeout                    600
cluster.self-heal-window-size           1
cluster.data-change-log                 on
cluster.metadata-change-log             on
cluster.data-self-heal-algorithm        (null)
cluster.eager-lock                      on
disperse.eager-lock                     on
cluster.quorum-type                     none
cluster.quorum-count                    (null)
cluster.choose-local                    true
cluster.self-heal-readdir-size          1KB
cluster.post-op-delay-secs              1
cluster.ensure-durability               on
cluster.consistent-metadata             no
cluster.heal-wait-queue-length          128
cluster.favorite-child-policy           none
cluster.stripe-block-size               128KB
cluster.stripe-coalesce                 true
diagnostics.latency-measurement         off
diagnostics.dump-fd-stats               off
diagnostics.count-fop-hits              off
diagnostics.brick-log-level             INFO
diagnostics.client-log-level            INFO
diagnostics.brick-sys-log-level         CRITICAL
diagnostics.client-sys-log-level        CRITICAL
diagnostics.brick-logger                (null)
diagnostics.client-logger               (null)
diagnostics.brick-log-format            (null)
diagnostics.client-log-format           (null)
diagnostics.brick-log-buf-size          5
diagnostics.client-log-buf-size         5
diagnostics.brick-log-flush-timeout     120
diagnostics.client-log-flush-timeout    120
diagnostics.stats-dump-interval         0
diagnostics.fop-sample-interval         0
diagnostics.stats-dump-format           json
diagnostics.fop-sample-buf-size         65535
diagnostics.stats-dnscache-ttl-sec      86400
performance.cache-max-file-size         0
performance.cache-min-file-size         0
performance.cache-refresh-timeout       1
performance.cache-priority
performance.cache-size                  32MB
performance.io-thread-count             16
performance.high-prio-threads           16
performance.normal-prio-threads         16
performance.low-prio-threads            16
performance.least-prio-threads          1
performance.enable-least-priority       on
performance.cache-size                  128MB
performance.flush-behind                on
performance.nfs.flush-behind            on
performance.write-behind-window-size    1MB
performance.resync-failed-syncs-after-fsyncoff
performance.nfs.write-behind-window-size1MB
performance.strict-o-direct             off
performance.nfs.strict-o-direct         off
performance.strict-write-ordering       off
performance.nfs.strict-write-ordering   off
performance.lazy-open                   yes
performance.read-after-open             no
performance.read-ahead-page-count       4
performance.md-cache-timeout            1
performance.cache-swift-metadata        true
performance.cache-samba-metadata        false
performance.cache-capability-xattrs     true
performance.cache-ima-xattrs            true
features.encryption                     off
encryption.master-key                   (null)
encryption.data-key-size                256
encryption.block-size                   4096
network.frame-timeout                   1800
network.ping-timeout                    42
network.tcp-window-size                 (null)
features.lock-heal                      off
features.grace-timeout                  10
network.remote-dio                      disable
client.event-threads                    2
client.tcp-user-timeout                 0
client.keepalive-time                   20
client.keepalive-interval               2
client.keepalive-count                  9
network.tcp-window-size                 (null)
network.inode-lru-limit                 16384
auth.allow                              *
auth.reject                             (null)
transport.keepalive                     1
server.allow-insecure                   (null)
server.root-squash                      off
server.anonuid                          65534
server.anongid                          65534
server.statedump-path                   /var/run/gluster
server.outstanding-rpc-limit            64
features.lock-heal                      off
features.grace-timeout                  10
server.ssl                              (null)
auth.ssl-allow                          *
server.manage-gids                      off
server.dynamic-auth                     on
client.send-gids                        on
server.gid-timeout                      300
server.own-thread                       (null)
server.event-threads                    1
server.tcp-user-timeout                 0
server.keepalive-time                   20
server.keepalive-interval               2
server.keepalive-count                  9
transport.listen-backlog                10
ssl.own-cert                            (null)
ssl.private-key                         (null)
ssl.ca-list                             (null)
ssl.crl-path                            (null)
ssl.certificate-depth                   (null)
ssl.cipher-list                         (null)
ssl.dh-param                            (null)
ssl.ec-curve                            (null)
performance.write-behind                on
performance.read-ahead                  on
performance.readdir-ahead               off
performance.io-cache                    on
performance.quick-read                  on
performance.open-behind                 on
performance.nl-cache                    off
performance.stat-prefetch               on
performance.client-io-threads           off
performance.nfs.write-behind            on
performance.nfs.read-ahead              off
performance.nfs.io-cache                off
performance.nfs.quick-read              off
performance.nfs.stat-prefetch           off
performance.nfs.io-threads              off
performance.force-readdirp              true
performance.cache-invalidation          false
features.uss                            off
features.snapshot-directory             .snaps
features.show-snapshot-directory        off
network.compression                     off
network.compression.window-size         -15
network.compression.mem-level           8
network.compression.min-size            0
network.compression.compression-level   -1
network.compression.debug               false
features.limit-usage                    (null)
features.default-soft-limit             80%
features.soft-timeout                   60
features.hard-timeout                   5
features.alert-time                     86400
features.quota-deem-statfs              off
geo-replication.indexing                off
geo-replication.indexing                off
geo-replication.ignore-pid-check        off
geo-replication.ignore-pid-check        off
features.quota                          off
features.inode-quota                    off
features.bitrot                         disable
debug.trace                             off
debug.log-history                       no
debug.log-file                          no
debug.exclude-ops                       (null)
debug.include-ops                       (null)
debug.error-gen                         off
debug.error-failure                     (null)
debug.error-number                      (null)
debug.random-failure                    off
debug.error-fops                        (null)
nfs.enable-ino32                        no
nfs.mem-factor                          15
nfs.export-dirs                         on
nfs.export-volumes                      on
nfs.addr-namelookup                     off
nfs.dynamic-volumes                     off
nfs.register-with-portmap               on
nfs.outstanding-rpc-limit               16
nfs.port                                2049
nfs.rpc-auth-unix                       on
nfs.rpc-auth-null                       on
nfs.rpc-auth-allow                      all
nfs.rpc-auth-reject                     none
nfs.ports-insecure                      off
nfs.trusted-sync                        off
nfs.trusted-write                       off
nfs.volume-access                       read-write
nfs.export-dir
nfs.disable                             off
nfs.nlm                                 on
nfs.acl                                 on
nfs.mount-udp                           off
nfs.mount-rmtab                         /var/lib/glusterd/nfs/rmtab
nfs.rpc-statd                           /sbin/rpc.statd
nfs.server-aux-gids                     off
nfs.drc                                 off
nfs.drc-size                            0x20000
nfs.read-size                           (1 * 1048576ULL)
nfs.write-size                          (1 * 1048576ULL)
nfs.readdir-size                        (1 * 1048576ULL)
nfs.rdirplus                            on
nfs.exports-auth-enable                 (null)
nfs.auth-refresh-interval-sec           (null)
nfs.auth-cache-ttl-sec                  (null)
features.read-only                      off
features.worm                           off
features.worm-file-level                off
features.default-retention-period       120
features.retention-mode                 relax
features.auto-commit-period             180
storage.linux-aio                       off
storage.batch-fsync-mode                reverse-fsync
storage.batch-fsync-delay-usec          0
storage.owner-uid                       -1
storage.owner-gid                       -1
storage.node-uuid-pathinfo              off
storage.health-check-interval           30
storage.build-pgfid                     on
storage.gfid2path                       on
storage.gfid2path-separator             :
storage.bd-aio                          off
cluster.server-quorum-type              off
cluster.server-quorum-ratio             0
changelog.changelog                     off
changelog.changelog-dir                 (null)
changelog.encoding                      ascii
changelog.rollover-time                 15
changelog.fsync-interval                5
changelog.changelog-barrier-timeout     120
changelog.capture-del-path              off
features.barrier                        disable
features.barrier-timeout                120
features.trash                          off
features.trash-dir                      .trashcan
features.trash-eliminate-path           (null)
features.trash-max-filesize             5MB
features.trash-internal-op              off
cluster.enable-shared-storage           disable
cluster.write-freq-threshold            0
cluster.read-freq-threshold             0
cluster.tier-pause                      off
cluster.tier-promote-frequency          120
cluster.tier-demote-frequency           3600
cluster.watermark-hi                    90
cluster.watermark-low                   75
cluster.tier-mode                       cache
cluster.tier-max-promote-file-size      0
cluster.tier-max-mb                     4000
cluster.tier-max-files                  10000
cluster.tier-query-limit                100
cluster.tier-compact                    on
cluster.tier-hot-compact-frequency      604800
cluster.tier-cold-compact-frequency     604800
features.ctr-enabled                    off
features.record-counters                off
features.ctr-record-metadata-heat       off
features.ctr_link_consistency           off
features.ctr_lookupheal_link_timeout    300
features.ctr_lookupheal_inode_timeout   300
features.ctr-sql-db-cachesize           12500
features.ctr-sql-db-wal-autocheckpoint  25000
features.selinux                        on
locks.trace                             off
locks.mandatory-locking                 off
cluster.disperse-self-heal-daemon       enable
cluster.quorum-reads                    no
client.bind-insecure                    (null)
features.shard                          off
features.shard-block-size               64MB
features.scrub-throttle                 lazy
features.scrub-freq                     biweekly
features.scrub                          false
features.expiry-time                    120
features.cache-invalidation             off
features.cache-invalidation-timeout     60
features.leases                         off
features.lease-lock-recall-timeout      60
disperse.background-heals               8
disperse.heal-wait-qlength              128
cluster.heal-timeout                    600
dht.force-readdirp                      on
disperse.read-policy                    gfid-hash
cluster.shd-max-threads                 1
cluster.shd-wait-qlength                1024
cluster.locking-scheme                  full
cluster.granular-entry-heal             no
features.locks-revocation-secs          0
features.locks-revocation-clear-all     false
features.locks-revocation-max-blocked   0
features.locks-monkey-unlocking         false
disperse.shd-max-threads                1
disperse.shd-wait-qlength               1024
disperse.cpu-extensions                 auto
disperse.self-heal-window-size          1
cluster.use-compound-fops               off
performance.parallel-readdir            off
performance.rda-request-size            131072
performance.rda-low-wmark               4096
performance.rda-high-wmark              128KB
performance.rda-cache-limit             10MB
performance.nl-cache-positive-entry     false
performance.nl-cache-limit              10MB
performance.nl-cache-timeout            60
cluster.brick-multiplex                 off
cluster.max-bricks-per-process          0
disperse.optimistic-change-log          on
cluster.halo-enabled                    False
cluster.halo-shd-max-latency            99999
cluster.halo-nfsd-max-latency           5
cluster.halo-max-latency                5
cluster.halo-max-replicas

Thanks,
Soumya

     Does this process make more sense than a version upgrade process to
     4.1, then 5, then 6? What "gotcha's" do I need to be ready for? I
     have until late May to prep and test on old, slow hardware with a
     small amount of files and volumes.

You can directly upgrade from 3.12 to 6.x. I would suggest that rather
than deleting and creating Gluster volume. +Hari and +Sanju for further
guidelines on upgrade, as they recently did upgrade tests. +Soumya to
add to the nfs-ganesha aspect.

Regards,
Poornima

     --

     James P. Kinney III

     Every time you stop a school, you will have to build a jail. What you
     gain at one end you lose at the other. It's like feeding a dog on his
     own tail. It won't fatten the dog.
     - Speech 11/23/1900 Mark Twain

http://heretothereideas.blogspot.com/

     _______________________________________________
     Gluster-users mailing list

Gluster-users@xxxxxxxxxxx
 <mailto:Gluster-users@xxxxxxxxxxx>
  <mailto:
Gluster-users@xxxxxxxxxxx
 <mailto:Gluster-users@xxxxxxxxxxx>
>

https://lists.gluster.org/mailman/listinfo/gluster-users

--

James P. Kinney III

Every time you stop a school, you will have to build a jail. What you
gain at one end you lose at the other. It's like feeding a dog on his
own tail. It won't fatten the dog.
- Speech 11/23/1900 Mark Twain

http://heretothereideas.blogspot.com/

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users