Re: Replicated striped data lose

Krutika Dhananjay <kdhananj@xxxxxxxxxx> · Tue, 15 Mar 2016 17:33:43 +0530

Hmm ok. Could you share the nfs.log content?

-Krutika

On Tue, Mar 15, 2016 at 1:45 PM, Mahdi Adnan <mahdi.adnan@xxxxxxxxxxxxxxxxx> wrote:

    Okay, here's what i did;

    Volume Name: v

    Type: Distributed-Replicate

    Volume ID: b348fd8e-b117-469d-bcc0-56a56bdfc930

    Status: Started

    Number of Bricks: 3 x 2 = 6

    Transport-type: tcp

    Bricks:

    Brick1: gfs001:/bricks/b001/v

    Brick2: gfs001:/bricks/b002/v

    Brick3: gfs001:/bricks/b003/v

    Brick4: gfs002:/bricks/b004/v

    Brick5: gfs002:/bricks/b005/v

    Brick6: gfs002:/bricks/b006/v

    Options Reconfigured:

    features.shard-block-size: 128MB

    features.shard: enable

    cluster.server-quorum-type: server

    cluster.quorum-type: auto

    network.remote-dio: enable

    cluster.eager-lock: enable

    performance.stat-prefetch: off

    performance.io-cache: off

    performance.read-ahead: off

    performance.quick-read: off

    performance.readdir-ahead: on

    same error.

    and still mounting using glusterfs will work just fine.

    Respectfully

          Mahdi A. Mahdi

    On 03/15/2016 11:04 AM, Krutika
      Dhananjay wrote:

          OK but what if you use it with replication? Do you still
            see the error? I think not.

          Could you give it a try and tell me what you find?

        -Krutika

        On Tue, Mar 15, 2016 at 1:23 PM, Mahdi
          Adnan <mahdi.adnan@xxxxxxxxxxxxxxxxx>
          wrote:

             Hi,

              I have created the following volume;

              Volume Name: v

              Type: Distribute

              Volume ID: 90de6430-7f83-4eda-a98f-ad1fabcf1043

              Status: Started

              Number of Bricks: 3

              Transport-type: tcp

              Bricks:

              Brick1: gfs001:/bricks/b001/v

              Brick2: gfs001:/bricks/b002/v

              Brick3: gfs001:/bricks/b003/v

              Options Reconfigured:

              features.shard-block-size: 128MB

                features.shard: enable

                cluster.server-quorum-type: server

                cluster.quorum-type: auto

                network.remote-dio: enable

                cluster.eager-lock: enable

                performance.stat-prefetch: off

                performance.io-cache: off

                performance.read-ahead: off

                performance.quick-read: off

               performance.readdir-ahead: on

              and after mounting it in ESXi and trying to clone a VM to
              it, i got the same error.

              Respectfully

                    Mahdi A. Mahdi

                  On 03/15/2016 10:44 AM, Krutika Dhananjay wrote:

                                  Hi,

                                  Do not use sharding and stripe
                                  together in the same volume because

                                a) It is not recommended and there is no
                                point in using both. Using sharding
                                alone on your volume should work fine.

                              b) Nobody tested it.

                            c) Like Niels said, stripe feature is
                            virtually deprecated.

                          I would suggest that you create an nx3 volume
                          where n is the number of distribute subvols
                          you prefer, enable group virt options on it,
                          and enable sharding on it,

                          set the shard-block-size that you feel
                          appropriate and then just start off with VM
                          image creation etc.

                        If you run into any issues even after you do
                        this, let us know and we'll help you out.

                      -Krutika  

                      On Tue, Mar 15, 2016 at
                        1:07 PM, Mahdi Adnan <mahdi.adnan@xxxxxxxxxxxxxxxxx>
                        wrote:

                           Thanks
                            Krutika,

                            I have deleted the volume and created a new
                            one.

                            I found that it may be an issue with the NFS
                            itself, i have created a new striped volume
                            and enabled sharding and mounted it via
                            glusterfs and it worked just fine, if i
                            mount it with nfs it will fail and gives me
                            the same errors.

                            Respectfully

                                  Mahdi A. Mahdi

                                On 03/15/2016 06:24 AM, Krutika
                                  Dhananjay wrote:

                                                Hi,

                                                So could you share the
                                                xattrs associated with
                                                the file at
                                                <BRICK_PATH>/.glusterfs/c3/e8/c3e88cc1-7e0a-4d46-9685-2d12131a5e1c

                                            Here's what you need to
                                            execute:

                                          # getfattr -d -m . -e hex
                                          /mnt/b1/v/.glusterfs/c3/e8/c3e88cc1-7e0a-4d46-9685-2d12131a5e1c     
                                          on the first node and

                                          # getfattr -d -m . -e hex
                                          /mnt/b2/v/.glusterfs/c3/e8/c3e88cc1-7e0a-4d46-9685-2d12131a5e1c     
                                          on the second.

                                      Also, it is normally advised to
                                      use a replica 3 volume as opposed
                                      to replica 2 volume to guard
                                      against split-brains.

                                    -Krutika

                                    On Mon, Mar
                                      14, 2016 at 3:17 PM, Mahdi Adnan <mahdi.adnan@xxxxxxxxxxxxxxxxx>
                                      wrote:

                                         sorry for
                                          serial posting but, i got new
                                          logs it might help..

                                          the message appear during the
                                          migration;

                                          /var/log/glusterfs/nfs.log

                                          [2016-03-14 09:45:04.573765] I
                                          [MSGID: 109036]
                                          [dht-common.c:8043:dht_log_new_layout_for_dir_selfheal]
                                          0-testv-dht: Setting layout of
                                          /New Virtual Machine_1 with
                                          [Subvol_name: testv-stripe-0,
                                          Err: -1 , Start: 0 , Stop:
                                          4294967295 , Hash: 1 ], 

                                          [2016-03-14 09:45:04.957499] E
                                          [shard.c:369:shard_modify_size_and_block_count]

                                          (-->/usr/lib64/glusterfs/3.7.8/xlator/cluster/distribute.so(dht_file_setattr_cbk+0x14f)

                                          [0x7f27a13c067f]
                                          -->/usr/lib64/glusterfs/3.7.8/xlator/features/shard.so(shard_common_setattr_cbk+0xcc)

                                          [0x7f27a116681c]
                                          -->/usr/lib64/glusterfs/3.7.8/xlator/features/shard.so(shard_modify_size_and_block_count+0xdd)

                                          [0x7f27a116584d] )
                                          0-testv-shard: Failed to get
                                          trusted.glusterfs.shard.file-size
                                          for
                                          c3e88cc1-7e0a-4d46-9685-2d12131a5e1c

                                          [2016-03-14 09:45:04.957577] W
                                          [MSGID: 112199]
                                          [nfs3-helpers.c:3418:nfs3_log_common_res]
                                          0-nfs-nfsv3: /New Virtual
                                          Machine_1/New Virtual
                                          Machine-flat.vmdk => (XID:
                                          3fec5a26, SETATTR: NFS:
                                          22(Invalid argument for
                                          operation), POSIX: 22(Invalid
                                          argument)) [Invalid argument]

                                          [2016-03-14 09:45:05.079657] E
                                          [MSGID: 112069]
                                          [nfs3.c:3649:nfs3_rmdir_resume]
                                          0-nfs-nfsv3: No such file or
                                          directory: (192.168.221.52:826)
                                          testv :
                                          00000000-0000-0000-0000-000000000001

                                          Respectfully

                                                Mahdi A. Mahd

                                              On 03/14/2016 11:14
                                                AM, Mahdi Adnan wrote:

                                                So i have deployed a new
                                                server "Cisco UCS
                                                C220M4" and created a
                                                new volume;

                                                Volume Name: testv

                                                Type: Stripe

                                                Volume ID:
                                                55cdac79-fe87-4f1f-90c0-15c9100fe00b

                                                Status: Started

                                                Number of Bricks: 1 x 2
                                                = 2

                                                Transport-type: tcp

                                                Bricks:

                                                Brick1:
                                                10.70.0.250:/mnt/b1/v

                                                Brick2:
                                                10.70.0.250:/mnt/b2/v

                                                Options Reconfigured:

                                                nfs.disable: off

                                                features.shard-block-size:
                                                64MB

                                                features.shard: enable

                                                cluster.server-quorum-type:
                                                server

                                                cluster.quorum-type:
                                                auto

                                                network.remote-dio:
                                                enable

                                                cluster.eager-lock:
                                                enable

                                                performance.stat-prefetch:
                                                off

                                                performance.io-cache:
                                                off

                                                performance.read-ahead:
                                                off

                                                performance.quick-read:
                                                off

                                                performance.readdir-ahead:
                                                off

                                                same error ..

                                                can anyone share with me
                                                the info of a working
                                                striped volume ? 

                                                On 03/14/2016 09:02
                                                  AM, Mahdi Adnan wrote:

                                                  I have a pool of two
                                                  bricks in the same
                                                  server;

                                                  Volume Name: k

                                                  Type: Stripe

                                                  Volume ID:
                                                  1e9281ce-2a8b-44e8-a0c6-e3ebf7416b2b

                                                  Status: Started

                                                  Number of Bricks: 1 x
                                                  2 = 2

                                                  Transport-type: tcp

                                                  Bricks:

                                                  Brick1:
                                                  gfs001:/bricks/t1/k

                                                  Brick2:
                                                  gfs001:/bricks/t2/k

                                                  Options Reconfigured:

                                                  features.shard-block-size:
                                                  64MB

                                                  features.shard: on

                                                  cluster.server-quorum-type:
                                                  server

                                                  cluster.quorum-type:
                                                  auto

                                                  network.remote-dio:
                                                  enable

                                                  cluster.eager-lock:
                                                  enable

                                                  performance.stat-prefetch:
                                                  off

                                                  performance.io-cache:
                                                  off

                                                  performance.read-ahead:
                                                  off

                                                  performance.quick-read:
                                                  off

                                                  performance.readdir-ahead:
                                                  off

                                                  same issue ...

                                                  glusterfs 3.7.8 built
                                                  on Mar 10 2016
                                                  20:20:45.

                                                  Respectfully

                                                        Mahdi A.
                                                          Mahdi

                                                    Systems

                                                        Administrator

                                                        IT. Department

                                                        Earthlink
Telecommunications

                                                      Cell:

                                                        07903316180

                                                        Work: 3352

                                                        Skype: mahdi.adnan@xxxxxxxxxxx
                                                  On 03/14/2016
                                                    08:11 AM, Niels de
                                                    Vos wrote:

                                                    On Mon, Mar 14, 2016 at 08:12:27AM +0530, Krutika Dhananjay wrote:

                                                      It would be better to use sharding over stripe for your vm use case. It
offers better distribution and utilisation of bricks and better heal
performance.
And it is well tested.

                                                    Basically the "striping" feature is deprecated, "sharding" is its
improved replacement. I expect to see "striping" completely dropped in
the next major release.

Niels

                                                      Couple of things to note before you do that:
1. Most of the bug fixes in sharding have gone into 3.7.8. So it is advised
that you use 3.7.8 or above.
2. When you enable sharding on a volume, already existing files in the
volume do not get sharded. Only the files that are newly created from the
time sharding is enabled will.
    If you do want to shard the existing files, then you would need to cp
them to a temp name within the volume, and then rename them back to the
original file name.

HTH,
Krutika

On Sun, Mar 13, 2016 at 11:49 PM, Mahdi Adnan <mahdi.adnan@xxxxxxxxxxxxxxxxx

                                                        wrote:

                                                        I couldn't find anything related to cache in the HBAs.
what logs are useful in my case ? i see only bricks logs which contains
nothing during the failure.

###
[2016-03-13 18:05:19.728614] E [MSGID: 113022] [posix.c:1232:posix_mknod]
0-vmware-posix: mknod on
/bricks/b003/vmware/.shard/17d75e20-16f1-405e-9fa5-99ee7b1bd7f1.511 failed
[File exists]
[2016-03-13 18:07:23.337086] E [MSGID: 113022] [posix.c:1232:posix_mknod]
0-vmware-posix: mknod on
/bricks/b003/vmware/.shard/eef2d538-8eee-4e58-bc88-fbf7dc03b263.4095 failed
[File exists]
[2016-03-13 18:07:55.027600] W [trash.c:1922:trash_rmdir] 0-vmware-trash:
rmdir issued on /.trashcan/, which is not permitted
[2016-03-13 18:07:55.027635] I [MSGID: 115056]
[server-rpc-fops.c:459:server_rmdir_cbk] 0-vmware-server: 41987: RMDIR
/.trashcan/internal_op (00000000-0000-0000-0000-000000000005/internal_op)
==> (Operation not permitted) [Operation not permitted]
[2016-03-13 18:11:34.353441] I [login.c:81:gf_auth] 0-auth/login: allowed
user names: c0c72c37-477a-49a5-a305-3372c1c2f2b4
[2016-03-13 18:11:34.353463] I [MSGID: 115029]
[server-handshake.c:612:server_setvolume] 0-vmware-server: accepted client
from gfs002-2727-2016/03/13-20:17:43:613597-vmware-client-4-0-0 (version:
3.7.8)
[2016-03-13 18:11:34.591139] I [login.c:81:gf_auth] 0-auth/login: allowed
user names: c0c72c37-477a-49a5-a305-3372c1c2f2b4
[2016-03-13 18:11:34.591173] I [MSGID: 115029]
[server-handshake.c:612:server_setvolume] 0-vmware-server: accepted client
from gfs002-2719-2016/03/13-20:17:42:609388-vmware-client-4-0-0 (version:
3.7.8)
###

ESXi just keeps telling me "Cannot clone T: The virtual disk is either
corrupted or not a supported format.
error
3/13/2016 9:06:20 PM
Clone virtual machine
T
VCENTER.LOCAL\Administrator
"

My setup is 2 servers with a floating ip controlled by CTDB and my ESXi
server mount the NFS via the floating ip.

On 03/13/2016 08:40 PM, pkoelle wrote:

                                                          Am 13.03.2016 um 18:22 schrieb David Gossage:

                                                          On Sun, Mar 13, 2016 at 11:07 AM, Mahdi Adnan <
mahdi.adnan@xxxxxxxxxxxxxxxxx

                                                          wrote:

                                                          My HBAs are LSISAS1068E, and the filesystem is XFS.

                                                          I tried EXT4 and it did not help.
I have created a stripted volume in one server with two bricks, same
issue.
and i tried a replicated volume with just "sharding enabled" same issue,
as soon as i disable the sharding it works just fine, niether sharding
nor
striping works for me.
i did follow up with some of threads in the mailing list and tried some
of
the fixes that worked with the others, none worked for me. :(

                                                          Is it possible the LSI has write-cache enabled?

                                                          Why is that relevant? Even the backing filesystem has no idea if there is
a RAID or write cache or whatever. There are blocks and sync(), end of
story.
If you lose power and screw up your recovery OR do funky stuff with SAS
multipathing that might be an issue with a controller cache. AFAIK thats
not what we are talking about.

I'm afraid but unless the OP has some logs from the server, a
reproducible testcase or a backtrace from client or server this isn't
getting us anywhere.

cheers
Paul

                                                          On 03/13/2016 06:54 PM, David Gossage wrote:

                                                          On Sun, Mar 13, 2016 at 8:16 AM, Mahdi Adnan <
mahdi.adnan@xxxxxxxxxxxxxxxxx> wrote:

Okay so i have enabled shard in my test volume and it did not help,

                                                          stupidly enough, i have enabled it in a production volume
"Distributed-Replicate" and it currpted  half of my VMs.
I have updated Gluster to the latest and nothing seems to be changed in
my situation.
below the info of my volume;

                                                          I was pointing at the settings in that email as an example for
corruption
fixing. I wouldn't recommend enabling sharding if you haven't gotten the
base working yet on that cluster. What HBA's are you using and what is
layout of filesystem for bricks?

Number of Bricks: 3 x 2 = 6

                                                          Transport-type: tcp
Bricks:
Brick1: gfs001:/bricks/b001/vmware
Brick2: gfs002:/bricks/b004/vmware
Brick3: gfs001:/bricks/b002/vmware
Brick4: gfs002:/bricks/b005/vmware
Brick5: gfs001:/bricks/b003/vmware
Brick6: gfs002:/bricks/b006/vmware
Options Reconfigured:
performance.strict-write-ordering: on
cluster.server-quorum-type: server
cluster.quorum-type: auto
network.remote-dio: enable
performance.stat-prefetch: disable
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
cluster.eager-lock: enable
features.shard-block-size: 16MB
features.shard: on
performance.readdir-ahead: off

On 03/12/2016 08:11 PM, David Gossage wrote:

On Sat, Mar 12, 2016 at 10:21 AM, Mahdi Adnan <
<mahdi.adnan@xxxxxxxxxxxxxxxxx>mahdi.adnan@xxxxxxxxxxxxxxxxx> wrote:

Both servers have HBA no RAIDs and i can setup a replicated or

                                                          dispensers without any issues.
Logs are clean and when i tried to migrate a vm and got the error,
nothing showed up in the logs.
i tried mounting the volume into my laptop and it mounted fine but,
if i
use dd to create a data file it just hang and i cant cancel it, and i
cant
unmount it or anything, i just have to reboot.
The same servers have another volume on other bricks in a distributed
replicas, works fine.
I have even tried the same setup in a virtual environment (created two
vms and install gluster and created a replicated striped) and again
same
thing, data corruption.

                                                          I'd look through mail archives for a topic "Shard in Production" I
think
it's called.  The shard portion may not be relevant but it does discuss
certain settings that had to be applied with regards to avoiding
corruption
with VM's.  You may want to try and disable the
performance.readdir-ahead
also.

                                                          On 03/12/2016 07:02 PM, David Gossage wrote:

On Sat, Mar 12, 2016 at 9:51 AM, Mahdi Adnan <
<mahdi.adnan@xxxxxxxxxxxxxxxxx>mahdi.adnan@xxxxxxxxxxxxxxxxx> wrote:

Thanks David,

                                                          My settings are all defaults, i have just created the pool and
started
it.
I have set the settings as your recommendation and it seems to be the
same issue;

Type: Striped-Replicate
Volume ID: 44adfd8c-2ed1-4aa5-b256-d12b64f7fc14
Status: Started
Number of Bricks: 1 x 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: gfs001:/bricks/t1/s
Brick2: gfs002:/bricks/t1/s
Brick3: gfs001:/bricks/t2/s
Brick4: gfs002:/bricks/t2/s
Options Reconfigured:
performance.stat-prefetch: off
network.remote-dio: on
cluster.eager-lock: enable
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
performance.readdir-ahead: on

                                                          Is their a raid controller perhaps doing any caching?

In the gluster logs any errors being reported during migration
process?
Since they aren't in use yet have you tested making just mirrored
bricks
using different pairings of servers two at a time to see if problem
follows
certain machine or network ports?

                                                          On 03/12/2016 03:25 PM, David Gossage wrote:

On Sat, Mar 12, 2016 at 1:55 AM, Mahdi Adnan <
<mahdi.adnan@xxxxxxxxxxxxxxxxx>mahdi.adnan@xxxxxxxxxxxxxxxxx> wrote:

Dears,

                                                          I have created a replicated striped volume with two bricks and two
servers but I can't use it because when I mount it in ESXi and try
to
migrate a VM to it, the data get corrupted.
Is any one have any idea why is this happening ?

Dell 2950 x2
Seagate 15k 600GB
CentOS 7.2
Gluster 3.7.8

Appreciate your help.

                                                          Most reports of this I have seen end up being settings related.  Post
gluster volume info. Below is what I have seen as most common
recommended
settings.
I'd hazard a guess you may have some the read ahead cache or prefetch
on.

quick-read=off
read-ahead=off
io-cache=off
stat-prefetch=off
eager-lock=enable
remote-dio=on

                                                          Mahdi Adnan
System Admin

_______________________________________________
Gluster-users mailing list
<Gluster-users@xxxxxxxxxxx>Gluster-users@xxxxxxxxxxx
<http://www.gluster.org/mailman/listinfo/gluster-users>
http://www.gluster.org/mailman/listinfo/gluster-users

                                                          _______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users

                                                          _______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users

                                                        _______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users

                                                      _______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users

                                                  _______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users

                                                _______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users

_______________________________________________

                                        Gluster-users mailing list

                                        Gluster-users@xxxxxxxxxxx

                                        http://www.gluster.org/mailman/listinfo/gluster-users

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users