Re: Issue with Pro active self healing for Erasure coding

Xavier Hernandez <xhernandez@xxxxxxxxxx> · Fri, 26 Jun 2015 09:41:30 +0200

Could you file a bug for this ?

I'll investigate the problem.

Xavi

On 06/26/2015 08:58 AM, Mohamed Pakkeer wrote:
Hi Xavier

We are facing same I/O error after upgrade into gluster 3.7.2.

Description of problem:
=======================
In a 3 x (4 + 2) = 18 distributed disperse volume, there are
input/output error of some files on fuse mount after simulating the
following scenario

1.   Simulate the disk failure by killing the disk pid and again adding
the same disk after formatting the drive
2.   Try to read the recovered or healed file after 2 bricks/nodes were
brought down

Version-Release number of selected component (if applicable):
==============================================================

admin@node001:~$ sudo gluster --version
glusterfs 3.7.2 built on Jun 19 2015 16:33:27
Repository revision: git://git.gluster.com/glusterfs.git
<http://git.gluster.com/glusterfs.git>
Copyright (coffee) 2006-2011 Gluster Inc. <http://www.gluster.com>
GlusterFS comes with ABSOLUTELY NO WARRANTY.
You may redistribute copies of GlusterFS under the terms of the GNU
General Public License.

Steps to Reproduce:

1. create a 3x(4+2) disperse volume across nodes
2. FUSE mount on the client and start creating files/directories with mkdir and rsync/dd
3. simulate the disk failure by killing pid of any disk on one node and add again the same disk after formatting the drive
4. start volume by force
5. self haling adding the file name with 0 bytes in newly formatted drive
6. wait more time to finish self healing, but self healing is not happening the file lies on 0 bytes
7. Try to read same file from client, now the file name with 0 byte try to recovery and recovery completed. Get the md5sum of the file with all client live and the result is positive
8. Now, bring down 2 of the node
9. Now try to get the mdsum of same recoverd file, client throws I/O error

Screen shots

admin@node001:~$ sudo gluster volume info

Volume Name: vaulttest21
Type: Distributed-Disperse
Volume ID: ac6a374d-a0a2-405c-823d-0672fd92f0af
Status: Started
Number of Bricks: 3 x (4 + 2) = 18
Transport-type: tcp
Bricks:
Brick1: 10.1.2.1:/media/disk1
Brick2: 10.1.2.2:/media/disk1
Brick3: 10.1.2.3:/media/disk1
Brick4: 10.1.2.4:/media/disk1
Brick5: 10.1.2.5:/media/disk1
Brick6: 10.1.2.6:/media/disk1
Brick7: 10.1.2.1:/media/disk2
Brick8: 10.1.2.2:/media/disk2
Brick9: 10.1.2.3:/media/disk2
Brick10: 10.1.2.4:/media/disk2
Brick11: 10.1.2.5:/media/disk2
Brick12: 10.1.2.6:/media/disk2
Brick13: 10.1.2.1:/media/disk3
Brick14: 10.1.2.2:/media/disk3
Brick15: 10.1.2.3:/media/disk3
Brick16: 10.1.2.4:/media/disk3
Brick17: 10.1.2.5:/media/disk3
Brick18: 10.1.2.6:/media/disk3
Options Reconfigured:
performance.readdir-ahead: on

*_After simulated the disk failure( node3- disk2) and adding aging by
formatting the drive _*

admin@node003:~$ date

Thu Jun 25 *16:21:58* IST 2015

admin@node003:~$ ls -l -h /media/disk2

total 1.6G

drwxr-xr-x 3 root root   22 Jun 25 16:18 1

*-rw-r--r-- 2 root root    0 Jun 25 16:17 up1*

*-rw-r--r-- 2 root root    0 Jun 25 16:17 up2*

-rw-r--r-- 2 root root 797M Jun 25 16:03 up3

-rw-r--r-- 2 root root 797M Jun 25 16:04 up4

--

admin@node003:~$ date

Thu Jun 25 *16:25:09* IST 2015

admin@node003:~$ ls -l -h  /media/disk2

total 1.6G

drwxr-xr-x 3 root root   22 Jun 25 16:18 1

*-rw-r--r-- 2 root root    0 Jun 25 16:17 up1*

*-rw-r--r-- 2 root root    0 Jun 25 16:17 up2*

-rw-r--r-- 2 root root 797M Jun 25 16:03 up3

-rw-r--r-- 2 root root 797M Jun 25 16:04 up4

admin@node003:~$ date

Thu Jun 25 *16:41:25* IST 2015

admin@node003:~$  ls -l -h  /media/disk2

total 1.6G

drwxr-xr-x 3 root root   22 Jun 25 16:18 1

-rw-r--r-- 2 root root    0 Jun 25 16:17 up1

-rw-r--r-- 2 root root    0 Jun 25 16:17 up2

-rw-r--r-- 2 root root 797M Jun 25 16:03 up3

-rw-r--r-- 2 root root 797M Jun 25 16:04 up4

*after waiting nearly 20 minutes, self healing is not recovered the full
data junk . Then try to read the file using md5sum*
*
*
root@mas03:/mnt/gluster# time md5sum up1
4650543ade404ed5a1171726e76f8b7c  up1

real    1m58.010s
user    0m6.243s
sys     0m0.778s

*corrupted junk starts growing*

admin@node003:~$ ls -l -h  /media/disk2
total 2.6G
drwxr-xr-x 3 root root   22 Jun 25 16:18 1
-rw-r--r-- 2 root root 797M Jun 25 15:57 up1
-rw-r--r-- 2 root root    0 Jun 25 16:17 up2
-rw-r--r-- 2 root root 797M Jun 25 16:03 up3
-rw-r--r-- 2 root root 797M Jun 25 16:04 up4

*_To verify healed file after two node 5 & 6 taken offline_*

root@mas03:/mnt/gluster# time md5sum up1
md5sum: up1:*Input/output error*

Still the I/O error is not rectified. Could you suggest, if any thing
wrong on our testing?

admin@node001:~$ sudo gluster volume get vaulttest21 all
Option                                  Value
------                                  -----
cluster.lookup-unhashed                 on
cluster.lookup-optimize                 off
cluster.min-free-disk                   10%
cluster.min-free-inodes                 5%
cluster.rebalance-stats                 off
cluster.subvols-per-directory           (null)
cluster.readdir-optimize                off
cluster.rsync-hash-regex                (null)
cluster.extra-hash-regex                (null)
cluster.dht-xattr-name                  trusted.glusterfs.dht
cluster.randomize-hash-range-by-gfid    off
cluster.rebal-throttle                  normal
cluster.local-volume-name               (null)
cluster.weighted-rebalance              on
cluster.entry-change-log                on
cluster.read-subvolume                  (null)
cluster.read-subvolume-index            -1
cluster.read-hash-mode                  1
cluster.background-self-heal-count      16
cluster.metadata-self-heal              on
cluster.data-self-heal                  on
cluster.entry-self-heal                 on
cluster.self-heal-daemon                on
cluster.heal-timeout                    600
cluster.self-heal-window-size           1
cluster.data-change-log                 on
cluster.metadata-change-log             on
cluster.data-self-heal-algorithm        (null)
cluster.eager-lock                      on
cluster.quorum-type                     none
cluster.quorum-count                    (null)
cluster.choose-local                    true
cluster.self-heal-readdir-size          1KB
cluster.post-op-delay-secs              1
cluster.ensure-durability               on
cluster.consistent-metadata             no
cluster.stripe-block-size               128KB
cluster.stripe-coalesce                 true
diagnostics.latency-measurement         off
diagnostics.dump-fd-stats               off
diagnostics.count-fop-hits              off
diagnostics.brick-log-level             INFO
diagnostics.client-log-level            INFO
diagnostics.brick-sys-log-level         CRITICAL
diagnostics.client-sys-log-level        CRITICAL
diagnostics.brick-logger                (null)
diagnostics.client-logger               (null)
diagnostics.brick-log-format            (null)
diagnostics.client-log-format           (null)
diagnostics.brick-log-buf-size          5
diagnostics.client-log-buf-size         5
diagnostics.brick-log-flush-timeout     120
diagnostics.client-log-flush-timeout    120
performance.cache-max-file-size         0
performance.cache-min-file-size         0
performance.cache-refresh-timeout       1
performance.cache-priority
performance.cache-size                  32MB
performance.io-thread-count             16
performance.high-prio-threads           16
performance.normal-prio-threads         16
performance.low-prio-threads            16
performance.least-prio-threads          1
performance.enable-least-priority       on
performance.least-rate-limit            0
performance.cache-size                  128MB
performance.flush-behind                on
performance.nfs.flush-behind            on
performance.write-behind-window-size    1MB
performance.nfs.write-behind-window-size1MB
performance.strict-o-direct             off
performance.nfs.strict-o-direct         off
performance.strict-write-ordering       off
performance.nfs.strict-write-ordering   off
performance.lazy-open                   yes
performance.read-after-open             no
performance.read-ahead-page-count       4
performance.md-cache-timeout            1
features.encryption                     off
encryption.master-key                   (null)
encryption.data-key-size                256
encryption.block-size                   4096
network.frame-timeout                   1800
network.ping-timeout                    42
network.tcp-window-size                 (null)
features.lock-heal                      off
features.grace-timeout                  10
network.remote-dio                      disable
client.event-threads                    2
network.ping-timeout                    42
network.tcp-window-size                 (null)
network.inode-lru-limit                 16384
auth.allow                              *
auth.reject                             (null)
transport.keepalive                     (null)
server.allow-insecure                   (null)
server.root-squash                      off
server.anonuid                          65534
server.anongid                          65534
server.statedump-path                   /var/run/gluster
server.outstanding-rpc-limit            64
features.lock-heal                      off
features.grace-timeout                  (null)
server.ssl                              (null)
auth.ssl-allow                          *
server.manage-gids                      off
client.send-gids                        on
server.gid-timeout                      300
server.own-thread                       (null)
server.event-threads                    2
performance.write-behind                on
performance.read-ahead                  on
performance.readdir-ahead               on
performance.io-cache                    on
performance.quick-read                  on
performance.open-behind                 on
performance.stat-prefetch               on
performance.client-io-threads           off
performance.nfs.write-behind            on
performance.nfs.read-ahead              off
performance.nfs.io-cache                off
performance.nfs.quick-read              off
performance.nfs.stat-prefetch           off
performance.nfs.io-threads              off
performance.force-readdirp              true
features.file-snapshot                  off
features.uss                            off
features.snapshot-directory             .snaps
features.show-snapshot-directory        off
network.compression                     off
network.compression.window-size         -15
network.compression.mem-level           8
network.compression.min-size            0
network.compression.compression-level   -1
network.compression.debug               false
features.limit-usage                    (null)
features.quota-timeout                  0
features.default-soft-limit             80%
features.soft-timeout                   60
features.hard-timeout                   5
features.alert-time                     86400
features.quota-deem-statfs              off
geo-replication.indexing                off
geo-replication.indexing                off
geo-replication.ignore-pid-check        off
geo-replication.ignore-pid-check        off
features.quota                          off
features.inode-quota                    off
features.bitrot                         disable
debug.trace                             off
debug.log-history                       no
debug.log-file                          no
debug.exclude-ops                       (null)
debug.include-ops                       (null)
debug.error-gen                         off
debug.error-failure                     (null)
debug.error-number                      (null)
debug.random-failure                    off
debug.error-fops                        (null)
nfs.enable-ino32                        no
nfs.mem-factor                          15
nfs.export-dirs                         on
nfs.export-volumes                      on
nfs.addr-namelookup                     off
nfs.dynamic-volumes                     off
nfs.register-with-portmap               on
nfs.outstanding-rpc-limit               16
nfs.port                                2049
nfs.rpc-auth-unix                       on
nfs.rpc-auth-null                       on
nfs.rpc-auth-allow                      all
nfs.rpc-auth-reject                     none
nfs.ports-insecure                      off
nfs.trusted-sync                        off
nfs.trusted-write                       off
nfs.volume-access                       read-write
nfs.export-dir
nfs.disable                             false
nfs.nlm                                 on
nfs.acl                                 on
nfs.mount-udp                           off
nfs.mount-rmtab                         /var/lib/glusterd/nfs/rmtab
nfs.rpc-statd                           /sbin/rpc.statd
nfs.server-aux-gids                     off
nfs.drc                                 off
nfs.drc-size                            0x20000
nfs.read-size                           (1 * 1048576ULL)
nfs.write-size                          (1 * 1048576ULL)
nfs.readdir-size                        (1 * 1048576ULL)
nfs.exports-auth-enable                 (null)
nfs.auth-refresh-interval-sec           (null)
nfs.auth-cache-ttl-sec                  (null)
features.read-only                      off
features.worm                           off
storage.linux-aio                       off
storage.batch-fsync-mode                reverse-fsync
storage.batch-fsync-delay-usec          0
storage.owner-uid                       -1
storage.owner-gid                       -1
storage.node-uuid-pathinfo              off
storage.health-check-interval           30
storage.build-pgfid                     off
storage.bd-aio                          off
cluster.server-quorum-type              off
cluster.server-quorum-ratio             0
changelog.changelog                     off
changelog.changelog-dir                 (null)
changelog.encoding                      ascii
changelog.rollover-time                 15
changelog.fsync-interval                5
changelog.changelog-barrier-timeout     120
changelog.capture-del-path              off
features.barrier                        disable
features.barrier-timeout                120
features.trash                          off
features.trash-dir                      .trashcan
features.trash-eliminate-path           (null)
features.trash-max-filesize             5MB
features.trash-internal-op              off
cluster.enable-shared-storage           disable
features.ctr-enabled                    off
features.record-counters                off
features.ctr_link_consistency           off
locks.trace                             (null)
cluster.disperse-self-heal-daemon       enable
cluster.quorum-reads                    no
client.bind-insecure                    (null)
ganesha.enable                          off
features.shard                          off
features.shard-block-size               4MB
features.scrub-throttle                 lazy
features.scrub-freq                     biweekly
features.expiry-time                    120
features.cache-invalidation             off
features.cache-invalidation-timeout     60

Thanks & regards
Backer

On Mon, Jun 15, 2015 at 1:26 PM, Xavier Hernandez <xhernandez@xxxxxxxxxx
<mailto:xhernandez@xxxxxxxxxx>> wrote:

    On 06/15/2015 09:25 AM, Mohamed Pakkeer wrote:

        Hi Xavier,

        When can we expect the 3.7.2 release for fixing the I/O error
        which we
        discussed on this mail thread?.

    As per the latest meeting held last wednesday [1] it will be
    released this week.

    Xavi

    [1]
    http://meetbot.fedoraproject.org/gluster-meeting/2015-06-10/gluster-meeting.2015-06-10-12.01.html

        Thanks
        Backer

        On Wed, May 27, 2015 at 8:02 PM, Xavier Hernandez
        <xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>
        <mailto:xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>>>
        wrote:

             Hi again,

             in today's gluster meeting [1] it has been decided that
        3.7.1 will
             be released urgently to solve a bug in glusterd. All fixes
        planned
             for 3.7.1 will be moved to 3.7.2 which will be released
        soon after.

             Xavi

             [1]
        http://meetbot.fedoraproject.org/gluster-meeting/2015-05-27/gluster-meeting.2015-05-27-12.01.html

             On 05/27/2015 12:01 PM, Xavier Hernandez wrote:

                 On 05/27/2015 11:26 AM, Mohamed Pakkeer wrote:

                     Hi Xavier,

                     Thanks for your reply. When can we expect the 3.7.1
        release?

                 AFAIK a beta of 3.7.1 will be released very soon.

                     cheers
                     Backer

                     On Wed, May 27, 2015 at 1:22 PM, Xavier Hernandez
                     <xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx> <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>>
                     <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>

                     <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>>>> wrote:

                          Hi,

                          some Input/Output error issues have been
        identified and
                     fixed. These
                          fixes will be available on 3.7.1.

                          Xavi

                          On 05/26/2015 10:15 AM, Mohamed Pakkeer wrote:

                              Hi Glusterfs Experts,

                              We are testing glusterfs 3.7.0 tarball on
        our 10
                     Node glusterfs
                              cluster.
                              Each node has 36 dirves and please find
        the volume
                     info below

                              Volume Name: vaulttest5
                              Type: Distributed-Disperse
                              Volume ID:
        68e082a6-9819-4885-856c-1510cd201bd9
                              Status: Started
                              Number of Bricks: 36 x (8 + 2) = 360
                              Transport-type: tcp
                              Bricks:
                              Brick1: 10.1.2.1:/media/disk1
                              Brick2: 10.1.2.2:/media/disk1
                              Brick3: 10.1.2.3:/media/disk1
                              Brick4: 10.1.2.4:/media/disk1
                              Brick5: 10.1.2.5:/media/disk1
                              Brick6: 10.1.2.6:/media/disk1
                              Brick7: 10.1.2.7:/media/disk1
                              Brick8: 10.1.2.8:/media/disk1
                              Brick9: 10.1.2.9:/media/disk1
                              Brick10: 10.1.2.10:/media/disk1
                              Brick11: 10.1.2.1:/media/disk2
                              Brick12: 10.1.2.2:/media/disk2
                              Brick13: 10.1.2.3:/media/disk2
                              Brick14: 10.1.2.4:/media/disk2
                              Brick15: 10.1.2.5:/media/disk2
                              Brick16: 10.1.2.6:/media/disk2
                              Brick17: 10.1.2.7:/media/disk2
                              Brick18: 10.1.2.8:/media/disk2
                              Brick19: 10.1.2.9:/media/disk2
                              Brick20: 10.1.2.10:/media/disk2
                              ...
                              ....
                              Brick351: 10.1.2.1:/media/disk36
                              Brick352: 10.1.2.2:/media/disk36
                              Brick353: 10.1.2.3:/media/disk36
                              Brick354: 10.1.2.4:/media/disk36
                              Brick355: 10.1.2.5:/media/disk36
                              Brick356: 10.1.2.6:/media/disk36
                              Brick357: 10.1.2.7:/media/disk36
                              Brick358: 10.1.2.8:/media/disk36
                              Brick359: 10.1.2.9:/media/disk36
                              Brick360: 10.1.2.10:/media/disk36
                              Options Reconfigured:
                              performance.readdir-ahead: on

                              We did some performance testing and
        simulated the
                     proactive self
                              healing
                              for Erasure coding. Disperse volume has been
                     created across
                     nodes.

                              _*Description of problem*_

                              I disconnected the *network of two nodes*
        and tried
                     to write
                              some video
                              files and *glusterfs* *wrote the video
        files on
                     balance 8 nodes
                              perfectly*. I tried to download the
        uploaded file
                     and it was
                              downloaded
                              perfectly. Then i enabled the network of
        two nodes,
                     the pro
                              active self
                              healing mechanism worked perfectly and
        wrote the
                     unavailable
                     junk of
                              data to the recently enabled node from the
        other 8
                     nodes. But
                     when i
                              tried to download the same file node, it
        showed
                     Input/Output
                              error. I
                              couldn't download the file. I think there
        is an
                     issue in pro
                              active self
                              healing.

                              Also we tried the simulation with one node
        network
                     failure. We
                     faced
                              same I/O error issue while downloading the
        file

                              _Error while downloading file _
                              _
                              _

                              root@master02:/home/admin# rsync -r --progress
                              /mnt/gluster/file13_AN
                              ./1/file13_AN-2

                              sending incremental file list

                              file13_AN

                                  3,342,355,597 100% 4.87MB/s    0:10:54
        (xfr#1,
                     to-chk=0/1)

                              rsync: read errors mapping
        "/mnt/gluster/file13_AN":
                              Input/output error (5)

                              WARNING: file13_AN failed verification --
        update
                     discarded (will
                              try again).

                                 root@master02:/home/admin# cp
        /mnt/gluster/file13_AN
                              ./1/file13_AN-3

                              cp: error reading ‘/mnt/gluster/file13_AN’:
                     Input/output error

                              cp: failed to extend ‘./1/file13_AN-3’:
                     Input/output error_
                              _

                              We can't conclude the issue with glusterfs
        3.7.0 or
                     our glusterfs
                              configuration.

                              Any help would be greatly appreciated

                              --
                              Cheers
                              Backer

          _______________________________________________
                              Gluster-users mailing list
        Gluster-users@xxxxxxxxxxx <mailto:Gluster-users@xxxxxxxxxxx>
        <mailto:Gluster-users@xxxxxxxxxxx
        <mailto:Gluster-users@xxxxxxxxxxx>>
                     <mailto:Gluster-users@xxxxxxxxxxx
        <mailto:Gluster-users@xxxxxxxxxxx>
                     <mailto:Gluster-users@xxxxxxxxxxx
        <mailto:Gluster-users@xxxxxxxxxxx>>>
        http://www.gluster.org/mailman/listinfo/gluster-users

                 _______________________________________________
                 Gluster-users mailing list
        Gluster-users@xxxxxxxxxxx <mailto:Gluster-users@xxxxxxxxxxx>
        <mailto:Gluster-users@xxxxxxxxxxx
        <mailto:Gluster-users@xxxxxxxxxxx>>
        http://www.gluster.org/mailman/listinfo/gluster-users

--
Thanks & Regards
K.Mohamed Pakkeer
Mobile- 0091-8754410114

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users