Hi all,
I have a 3 node gluster cluster:
glusterfs 3.7.12 built on Jun 27 2016 12:40:53
glusterfs 3.7.12 built on Jun 27 2016 12:40:53
3 x AWS EC2 m4.xlarge 1 x 1.5Tb STI EBS
We have had several issues where GlusterFS FUSE client hosts get into 'D' on reading the file system and now have one of the 3 nodes with substantially more CPU load than the other two.
This morning (several hours ago) we shut down one node exhibiting the high CPU behaviour and now that node is back to normal and one of the other 3 has high CPU.
The brick logs on both the 'good' nodes are emitting this over and over.
[2016-10-30 20:55:48.483045] I [MSGID: 115072] [server-rpc-fops.c:1786:server_setattr_cbk] 0-marketplace_nfs-server: 676120: SETATTR /ftpdata/[REDACTED]/bulk_import/.batches/fa44bf76-5706-40fc-b068-33e11a22bdd9/source (f4569925-6c13-4e51-8d97-38cf6c0b198a) ==> (Operation not permitted) [Operation not permitted]
[2016-10-30 20:55:51.511782] I [MSGID: 115072] [server-rpc-fops.c:1786:server_setattr_cbk] 0-marketplace_nfs-server: 695527: SETATTR /ftpdata/[REDACTED]/bulk_import/.batches/fa44bf76-5706-40fc-b068-33e11a22bdd9/source (f4569925-6c13-4e51-8d97-38cf6c0b198a) ==> (Operation not permitted) [Operation not permitted]
[2016-10-30 20:55:51.553858] I [MSGID: 115072] [server-rpc-fops.c:1786:server_setattr_cbk] 0-marketplace_nfs-server: 708963: SETATTR /ftpdata/[REDACTED]/bulk_import/.batches/c42cc627-adc6-43a4-93dd-ea6ec0eaa9cb/source (eaceb846-9ee1-4fab-8f4d-e2a93e85710f) ==> (Operation not permitted) [Operation not permitted]
[2016-10-30 20:55:54.660251] I [MSGID: 115072] [server-rpc-fops.c:1786:server_setattr_cbk] 0-marketplace_nfs-server: 709004: SETATTR /ftpdata/[REDACTED]/bulk_import/.batches/fa44bf76-5706-40fc-b068-33e11a22bdd9/source (f4569925-6c13-4e51-8d97-38cf6c0b198a) ==> (Operation not permitted) [Operation not permitted]
[2016-10-30 20:55:54.866259] I [MSGID: 115072] [server-rpc-fops.c:1786:server_setattr_cbk] 0-marketplace_nfs-server: 356367: SETATTR /ftpdata/[REDACTED]/bulk_import (d7dda9df-bf56-4f71-b6b8-59f2ccc41016) ==> (Operation not permitted) [Operation not permitted]
There are loads of files that are in the gluster volume heal <vol> info.
Client nodes using GlusterFS FUSE clients have been hanging in an uninterruptible state causing issues. I can see files are missing on some of the GlusterFS bricks and I have used a client mounting the bricks via NFS and doing a file traverse and stat on each file and directory (find /<mountpoints> -type f -exec stat {} \;) and I can see files being healed.
We are hoping that this will get us back into a normal state but we are unsure if there is a larger problem looming.
I have included the statedump and current setting:
Option Value
------ -----
cluster.lookup-unhashed on
cluster.lookup-optimize on
cluster.min-free-disk 10%
cluster.min-free-inodes 5%
cluster.rebalance-stats off
cluster.subvols-per-directory (null)
cluster.readdir-optimize off
cluster.rsync-hash-regex (null)
cluster.extra-hash-regex (null)
cluster.dht-xattr-name trusted.glusterfs.dht
cluster.randomize-hash-range-by-gfid off
cluster.rebal-throttle normal
cluster.local-volume-name (null)
cluster.weighted-rebalance on
cluster.switch-pattern (null)
cluster.entry-change-log on
cluster.read-subvolume (null)
cluster.read-subvolume-index -1
cluster.read-hash-mode 1
cluster.background-self-heal-count 8
cluster.metadata-self-heal on
cluster.data-self-heal on
cluster.entry-self-heal on
cluster.self-heal-daemon disable
cluster.heal-timeout 600
cluster.self-heal-window-size 1
cluster.data-change-log on
cluster.metadata-change-log on
cluster.data-self-heal-algorithm full
cluster.eager-lock on
disperse.eager-lock on
cluster.quorum-type none
cluster.quorum-count (null)
cluster.choose-local true
cluster.self-heal-readdir-size 1KB
cluster.post-op-delay-secs 1
cluster.ensure-durability on
cluster.consistent-metadata no
cluster.heal-wait-queue-length 128
cluster.stripe-block-size 128KB
cluster.stripe-coalesce true
diagnostics.latency-measurement off
diagnostics.dump-fd-stats off
diagnostics.count-fop-hits off
diagnostics.brick-log-level INFO
diagnostics.client-log-level INFO
diagnostics.brick-sys-log-level CRITICAL
diagnostics.client-sys-log-level CRITICAL
diagnostics.brick-logger (null)
diagnostics.client-logger (null)
diagnostics.brick-log-format (null)
diagnostics.client-log-format (null)
diagnostics.brick-log-buf-size 5
diagnostics.client-log-buf-size 5
diagnostics.brick-log-flush-timeout 120
diagnostics.client-log-flush-timeout 120
performance.cache-max-file-size 0
performance.cache-min-file-size 0
performance.cache-refresh-timeout 1
performance.cache-priority
performance.cache-size 512MB
performance.io-thread-count 16
performance.high-prio-threads 16
performance.normal-prio-threads 16
performance.low-prio-threads 16
performance.least-prio-threads 1
performance.enable-least-priority on
performance.least-rate-limit 0
performance.cache-size 512MB
performance.flush-behind on
performance.nfs.flush-behind on
performance.write-behind-window-size 1MB
performance.resync-failed-syncs-after-fsyncoff
performance.nfs.write-behind-window-size1MB
performance.strict-o-direct off
performance.nfs.strict-o-direct off
performance.strict-write-ordering off
performance.nfs.strict-write-ordering off
performance.lazy-open yes
performance.read-after-open no
performance.read-ahead-page-count 4
performance.md-cache-timeout 1
performance.cache-swift-metadata true
features.encryption off
encryption.master-key (null)
encryption.data-key-size 256
encryption.block-size 4096
network.frame-timeout 1800
network.ping-timeout 15
network.tcp-window-size (null)
features.lock-heal off
features.grace-timeout 10
network.remote-dio disable
client.event-threads 2
network.ping-timeout 15
network.tcp-window-size (null)
network.inode-lru-limit 16384
auth.allow *
auth.reject (null)
transport.keepalive (null)
server.allow-insecure (null)
server.root-squash off
server.anonuid 65534
server.anongid 65534
server.statedump-path /var/run/gluster
server.outstanding-rpc-limit 64
features.lock-heal off
features.grace-timeout (null)
server.ssl (null)
auth.ssl-allow *
server.manage-gids off
server.dynamic-auth on
client.send-gids on
server.gid-timeout 300
server.own-thread (null)
server.event-threads 2
ssl.own-cert (null)
ssl.private-key (null)
ssl.ca-list (null)
ssl.crl-path (null)
ssl.certificate-depth (null)
ssl.cipher-list (null)
ssl.dh-param (null)
ssl.ec-curve (null)
performance.write-behind on
performance.read-ahead on
performance.readdir-ahead on
performance.io-cache on
performance.quick-read on
performance.open-behind on
performance.stat-prefetch on
performance.client-io-threads off
performance.nfs.write-behind on
performance.nfs.read-ahead off
performance.nfs.io-cache off
performance.nfs.quick-read off
performance.nfs.stat-prefetch off
performance.nfs.io-threads off
performance.force-readdirp true
features.file-snapshot off
features.uss off
features.snapshot-directory .snaps
features.show-snapshot-directory off
network.compression off
network.compression.window-size -15
network.compression.mem-level 8
network.compression.min-size 0
network.compression.compression-level -1
network.compression.debug false
features.limit-usage (null)
features.quota-timeout 0
features.default-soft-limit 80%
features.soft-timeout 60
features.hard-timeout 5
features.alert-time 86400
features.quota-deem-statfs off
geo-replication.indexing off
geo-replication.indexing off
geo-replication.ignore-pid-check off
geo-replication.ignore-pid-check off
features.quota off
features.inode-quota off
features.bitrot disable
debug.trace off
debug.log-history no
debug.log-file no
debug.exclude-ops (null)
debug.include-ops (null)
debug.error-gen off
debug.error-failure (null)
debug.error-number (null)
debug.random-failure off
debug.error-fops (null)
nfs.enable-ino32 no
nfs.mem-factor 15
nfs.export-dirs on
nfs.export-volumes on
nfs.addr-namelookup off
nfs.dynamic-volumes off
nfs.register-with-portmap on
nfs.outstanding-rpc-limit 16
nfs.port 2049
nfs.rpc-auth-unix on
nfs.rpc-auth-null on
nfs.rpc-auth-allow all
nfs.rpc-auth-reject none
nfs.ports-insecure off
nfs.trusted-sync off
nfs.trusted-write off
nfs.volume-access read-write
nfs.export-dir
nfs.disable false
nfs.nlm on
nfs.acl on
nfs.mount-udp off
nfs.mount-rmtab /var/lib/glusterd/nfs/rmtab
nfs.rpc-statd /sbin/rpc.statd
nfs.server-aux-gids off
nfs.drc off
nfs.drc-size 0x20000
nfs.read-size (1 * 1048576ULL)
nfs.write-size (1 * 1048576ULL)
nfs.readdir-size (1 * 1048576ULL)
nfs.rdirplus on
nfs.exports-auth-enable (null)
nfs.auth-refresh-interval-sec (null)
nfs.auth-cache-ttl-sec (null)
features.read-only off
features.worm off
storage.linux-aio off
storage.batch-fsync-mode reverse-fsync
storage.batch-fsync-delay-usec 0
storage.owner-uid -1
storage.owner-gid -1
storage.node-uuid-pathinfo off
storage.health-check-interval 30
storage.build-pgfid off
storage.bd-aio off
cluster.server-quorum-type off
cluster.server-quorum-ratio 51%
changelog.changelog off
changelog.changelog-dir (null)
changelog.encoding ascii
changelog.rollover-time 15
changelog.fsync-interval 5
changelog.changelog-barrier-timeout 120
changelog.capture-del-path off
features.barrier disable
features.barrier-timeout 120
features.trash off
features.trash-dir .trashcan
features.trash-eliminate-path (null)
features.trash-max-filesize 5MB
features.trash-internal-op off
cluster.enable-shared-storage disable
cluster.write-freq-threshold 0
cluster.read-freq-threshold 0
cluster.tier-pause off
cluster.tier-promote-frequency 120
cluster.tier-demote-frequency 3600
cluster.watermark-hi 90
cluster.watermark-low 75
cluster.tier-mode cache
cluster.tier-max-mb 4000
cluster.tier-max-files 10000
features.ctr-enabled off
features.record-counters off
features.ctr-record-metadata-heat off
features.ctr_link_consistency off
features.ctr_lookupheal_link_timeout 300
features.ctr_lookupheal_inode_timeout 300
features.ctr-sql-db-cachesize 1000
features.ctr-sql-db-wal-autocheckpoint 1000
locks.trace (null)
cluster.disperse-self-heal-daemon enable
cluster.quorum-reads no
client.bind-insecure (null)
ganesha.enable off
features.shard off
features.shard-block-size 4MB
features.scrub-throttle lazy
features.scrub-freq biweekly
features.scrub false
features.expiry-time 120
features.cache-invalidation off
features.cache-invalidation-timeout 60
disperse.background-heals 8
disperse.heal-wait-qlength 128
dht.force-readdirp on
disperse.read-policy round-robin
cluster.shd-max-threads 1
cluster.shd-wait-qlength 1024
cluster.locking-scheme full
Help is appreciated
Den
Attachment:
data-data0-marketplace_nfs.4868.dump.1477862666.gz
Description: GNU Zip compressed data
_______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-users