Re: Replicated striped data lose

Niels de Vos <ndevos@xxxxxxxxxx> · Mon, 14 Mar 2016 06:11:37 +0100

On Mon, Mar 14, 2016 at 08:12:27AM +0530, Krutika Dhananjay wrote:
> It would be better to use sharding over stripe for your vm use case. It
> offers better distribution and utilisation of bricks and better heal
> performance.
> And it is well tested.

Basically the "striping" feature is deprecated, "sharding" is its
improved replacement. I expect to see "striping" completely dropped in
the next major release.

Niels

> Couple of things to note before you do that:
> 1. Most of the bug fixes in sharding have gone into 3.7.8. So it is advised
> that you use 3.7.8 or above.
> 2. When you enable sharding on a volume, already existing files in the
> volume do not get sharded. Only the files that are newly created from the
> time sharding is enabled will.
>     If you do want to shard the existing files, then you would need to cp
> them to a temp name within the volume, and then rename them back to the
> original file name.
> 
> HTH,
> Krutika
> 
> On Sun, Mar 13, 2016 at 11:49 PM, Mahdi Adnan <mahdi.adnan@xxxxxxxxxxxxxxxxx
> > wrote:
> 
> > I couldn't find anything related to cache in the HBAs.
> > what logs are useful in my case ? i see only bricks logs which contains
> > nothing during the failure.
> >
> > ###
> > [2016-03-13 18:05:19.728614] E [MSGID: 113022] [posix.c:1232:posix_mknod]
> > 0-vmware-posix: mknod on
> > /bricks/b003/vmware/.shard/17d75e20-16f1-405e-9fa5-99ee7b1bd7f1.511 failed
> > [File exists]
> > [2016-03-13 18:07:23.337086] E [MSGID: 113022] [posix.c:1232:posix_mknod]
> > 0-vmware-posix: mknod on
> > /bricks/b003/vmware/.shard/eef2d538-8eee-4e58-bc88-fbf7dc03b263.4095 failed
> > [File exists]
> > [2016-03-13 18:07:55.027600] W [trash.c:1922:trash_rmdir] 0-vmware-trash:
> > rmdir issued on /.trashcan/, which is not permitted
> > [2016-03-13 18:07:55.027635] I [MSGID: 115056]
> > [server-rpc-fops.c:459:server_rmdir_cbk] 0-vmware-server: 41987: RMDIR
> > /.trashcan/internal_op (00000000-0000-0000-0000-000000000005/internal_op)
> > ==> (Operation not permitted) [Operation not permitted]
> > [2016-03-13 18:11:34.353441] I [login.c:81:gf_auth] 0-auth/login: allowed
> > user names: c0c72c37-477a-49a5-a305-3372c1c2f2b4
> > [2016-03-13 18:11:34.353463] I [MSGID: 115029]
> > [server-handshake.c:612:server_setvolume] 0-vmware-server: accepted client
> > from gfs002-2727-2016/03/13-20:17:43:613597-vmware-client-4-0-0 (version:
> > 3.7.8)
> > [2016-03-13 18:11:34.591139] I [login.c:81:gf_auth] 0-auth/login: allowed
> > user names: c0c72c37-477a-49a5-a305-3372c1c2f2b4
> > [2016-03-13 18:11:34.591173] I [MSGID: 115029]
> > [server-handshake.c:612:server_setvolume] 0-vmware-server: accepted client
> > from gfs002-2719-2016/03/13-20:17:42:609388-vmware-client-4-0-0 (version:
> > 3.7.8)
> > ###
> >
> > ESXi just keeps telling me "Cannot clone T: The virtual disk is either
> > corrupted or not a supported format.
> > error
> > 3/13/2016 9:06:20 PM
> > Clone virtual machine
> > T
> > VCENTER.LOCAL\Administrator
> > "
> >
> > My setup is 2 servers with a floating ip controlled by CTDB and my ESXi
> > server mount the NFS via the floating ip.
> >
> >
> >
> >
> >
> > On 03/13/2016 08:40 PM, pkoelle wrote:
> >
> >> Am 13.03.2016 um 18:22 schrieb David Gossage:
> >>
> >>> On Sun, Mar 13, 2016 at 11:07 AM, Mahdi Adnan <
> >>> mahdi.adnan@xxxxxxxxxxxxxxxxx
> >>>
> >>>> wrote:
> >>>>
> >>>
> >>> My HBAs are LSISAS1068E, and the filesystem is XFS.
> >>>> I tried EXT4 and it did not help.
> >>>> I have created a stripted volume in one server with two bricks, same
> >>>> issue.
> >>>> and i tried a replicated volume with just "sharding enabled" same issue,
> >>>> as soon as i disable the sharding it works just fine, niether sharding
> >>>> nor
> >>>> striping works for me.
> >>>> i did follow up with some of threads in the mailing list and tried some
> >>>> of
> >>>> the fixes that worked with the others, none worked for me. :(
> >>>>
> >>>>
> >>> Is it possible the LSI has write-cache enabled?
> >>>
> >> Why is that relevant? Even the backing filesystem has no idea if there is
> >> a RAID or write cache or whatever. There are blocks and sync(), end of
> >> story.
> >> If you lose power and screw up your recovery OR do funky stuff with SAS
> >> multipathing that might be an issue with a controller cache. AFAIK thats
> >> not what we are talking about.
> >>
> >> I'm afraid but unless the OP has some logs from the server, a
> >> reproducible testcase or a backtrace from client or server this isn't
> >> getting us anywhere.
> >>
> >> cheers
> >> Paul
> >>
> >>
> >>>
> >>>
> >>>
> >>> On 03/13/2016 06:54 PM, David Gossage wrote:
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Sun, Mar 13, 2016 at 8:16 AM, Mahdi Adnan <
> >>>> mahdi.adnan@xxxxxxxxxxxxxxxxx> wrote:
> >>>>
> >>>> Okay so i have enabled shard in my test volume and it did not help,
> >>>>> stupidly enough, i have enabled it in a production volume
> >>>>> "Distributed-Replicate" and it currpted  half of my VMs.
> >>>>> I have updated Gluster to the latest and nothing seems to be changed in
> >>>>> my situation.
> >>>>> below the info of my volume;
> >>>>>
> >>>>>
> >>>> I was pointing at the settings in that email as an example for
> >>>> corruption
> >>>> fixing. I wouldn't recommend enabling sharding if you haven't gotten the
> >>>> base working yet on that cluster. What HBA's are you using and what is
> >>>> layout of filesystem for bricks?
> >>>>
> >>>>
> >>>> Number of Bricks: 3 x 2 = 6
> >>>>> Transport-type: tcp
> >>>>> Bricks:
> >>>>> Brick1: gfs001:/bricks/b001/vmware
> >>>>> Brick2: gfs002:/bricks/b004/vmware
> >>>>> Brick3: gfs001:/bricks/b002/vmware
> >>>>> Brick4: gfs002:/bricks/b005/vmware
> >>>>> Brick5: gfs001:/bricks/b003/vmware
> >>>>> Brick6: gfs002:/bricks/b006/vmware
> >>>>> Options Reconfigured:
> >>>>> performance.strict-write-ordering: on
> >>>>> cluster.server-quorum-type: server
> >>>>> cluster.quorum-type: auto
> >>>>> network.remote-dio: enable
> >>>>> performance.stat-prefetch: disable
> >>>>> performance.io-cache: off
> >>>>> performance.read-ahead: off
> >>>>> performance.quick-read: off
> >>>>> cluster.eager-lock: enable
> >>>>> features.shard-block-size: 16MB
> >>>>> features.shard: on
> >>>>> performance.readdir-ahead: off
> >>>>>
> >>>>>
> >>>>> On 03/12/2016 08:11 PM, David Gossage wrote:
> >>>>>
> >>>>>
> >>>>> On Sat, Mar 12, 2016 at 10:21 AM, Mahdi Adnan <
> >>>>> <mahdi.adnan@xxxxxxxxxxxxxxxxx>mahdi.adnan@xxxxxxxxxxxxxxxxx> wrote:
> >>>>>
> >>>>> Both servers have HBA no RAIDs and i can setup a replicated or
> >>>>>> dispensers without any issues.
> >>>>>> Logs are clean and when i tried to migrate a vm and got the error,
> >>>>>> nothing showed up in the logs.
> >>>>>> i tried mounting the volume into my laptop and it mounted fine but,
> >>>>>> if i
> >>>>>> use dd to create a data file it just hang and i cant cancel it, and i
> >>>>>> cant
> >>>>>> unmount it or anything, i just have to reboot.
> >>>>>> The same servers have another volume on other bricks in a distributed
> >>>>>> replicas, works fine.
> >>>>>> I have even tried the same setup in a virtual environment (created two
> >>>>>> vms and install gluster and created a replicated striped) and again
> >>>>>> same
> >>>>>> thing, data corruption.
> >>>>>>
> >>>>>>
> >>>>> I'd look through mail archives for a topic "Shard in Production" I
> >>>>> think
> >>>>> it's called.  The shard portion may not be relevant but it does discuss
> >>>>> certain settings that had to be applied with regards to avoiding
> >>>>> corruption
> >>>>> with VM's.  You may want to try and disable the
> >>>>> performance.readdir-ahead
> >>>>> also.
> >>>>>
> >>>>>
> >>>>>
> >>>>>> On 03/12/2016 07:02 PM, David Gossage wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Sat, Mar 12, 2016 at 9:51 AM, Mahdi Adnan <
> >>>>>> <mahdi.adnan@xxxxxxxxxxxxxxxxx>mahdi.adnan@xxxxxxxxxxxxxxxxx> wrote:
> >>>>>>
> >>>>>> Thanks David,
> >>>>>>>
> >>>>>>> My settings are all defaults, i have just created the pool and
> >>>>>>> started
> >>>>>>> it.
> >>>>>>> I have set the settings as your recommendation and it seems to be the
> >>>>>>> same issue;
> >>>>>>>
> >>>>>>> Type: Striped-Replicate
> >>>>>>> Volume ID: 44adfd8c-2ed1-4aa5-b256-d12b64f7fc14
> >>>>>>> Status: Started
> >>>>>>> Number of Bricks: 1 x 2 x 2 = 4
> >>>>>>> Transport-type: tcp
> >>>>>>> Bricks:
> >>>>>>> Brick1: gfs001:/bricks/t1/s
> >>>>>>> Brick2: gfs002:/bricks/t1/s
> >>>>>>> Brick3: gfs001:/bricks/t2/s
> >>>>>>> Brick4: gfs002:/bricks/t2/s
> >>>>>>> Options Reconfigured:
> >>>>>>> performance.stat-prefetch: off
> >>>>>>> network.remote-dio: on
> >>>>>>> cluster.eager-lock: enable
> >>>>>>> performance.io-cache: off
> >>>>>>> performance.read-ahead: off
> >>>>>>> performance.quick-read: off
> >>>>>>> performance.readdir-ahead: on
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>> Is their a raid controller perhaps doing any caching?
> >>>>>>
> >>>>>> In the gluster logs any errors being reported during migration
> >>>>>> process?
> >>>>>> Since they aren't in use yet have you tested making just mirrored
> >>>>>> bricks
> >>>>>> using different pairings of servers two at a time to see if problem
> >>>>>> follows
> >>>>>> certain machine or network ports?
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On 03/12/2016 03:25 PM, David Gossage wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Sat, Mar 12, 2016 at 1:55 AM, Mahdi Adnan <
> >>>>>>> <mahdi.adnan@xxxxxxxxxxxxxxxxx>mahdi.adnan@xxxxxxxxxxxxxxxxx> wrote:
> >>>>>>>
> >>>>>>> Dears,
> >>>>>>>>
> >>>>>>>> I have created a replicated striped volume with two bricks and two
> >>>>>>>> servers but I can't use it because when I mount it in ESXi and try
> >>>>>>>> to
> >>>>>>>> migrate a VM to it, the data get corrupted.
> >>>>>>>> Is any one have any idea why is this happening ?
> >>>>>>>>
> >>>>>>>> Dell 2950 x2
> >>>>>>>> Seagate 15k 600GB
> >>>>>>>> CentOS 7.2
> >>>>>>>> Gluster 3.7.8
> >>>>>>>>
> >>>>>>>> Appreciate your help.
> >>>>>>>>
> >>>>>>>>
> >>>>>>> Most reports of this I have seen end up being settings related.  Post
> >>>>>>> gluster volume info. Below is what I have seen as most common
> >>>>>>> recommended
> >>>>>>> settings.
> >>>>>>> I'd hazard a guess you may have some the read ahead cache or prefetch
> >>>>>>> on.
> >>>>>>>
> >>>>>>> quick-read=off
> >>>>>>> read-ahead=off
> >>>>>>> io-cache=off
> >>>>>>> stat-prefetch=off
> >>>>>>> eager-lock=enable
> >>>>>>> remote-dio=on
> >>>>>>>
> >>>>>>>
> >>>>>>>> Mahdi Adnan
> >>>>>>>> System Admin
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> _______________________________________________
> >>>>>>>> Gluster-users mailing list
> >>>>>>>> <Gluster-users@xxxxxxxxxxx>Gluster-users@xxxxxxxxxxx
> >>>>>>>> <http://www.gluster.org/mailman/listinfo/gluster-users>
> >>>>>>>> http://www.gluster.org/mailman/listinfo/gluster-users
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >>>
> >>> _______________________________________________
> >>> Gluster-users mailing list
> >>> Gluster-users@xxxxxxxxxxx
> >>> http://www.gluster.org/mailman/listinfo/gluster-users
> >>>
> >>>
> >> _______________________________________________
> >> Gluster-users mailing list
> >> Gluster-users@xxxxxxxxxxx
> >> http://www.gluster.org/mailman/listinfo/gluster-users
> >>
> >
> > _______________________________________________
> > Gluster-users mailing list
> > Gluster-users@xxxxxxxxxxx
> > http://www.gluster.org/mailman/listinfo/gluster-users
> >

> _______________________________________________
> Gluster-users mailing list
> Gluster-users@xxxxxxxxxxx
> http://www.gluster.org/mailman/listinfo/gluster-users

Attachment:
signature.asc

Description: PGP signature
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users