+ Justin
> -b
>
> On Thu, Feb 5, 2015 at 6:04 PM, David F. Robinson
> <david.robinson@xxxxxxxxxxxxx> wrote:
> Isn't rsync what geo-rep uses?
>
> David (Sent from mobile)
>
> ===============================
> David F. Robinson, Ph.D.
> President - Corvid Technologies
> 704.799.6944 x101 [office]
> 704.252.1310 [cell]
> 704.799.7974 [fax]
> David.Robinson@xxxxxxxxxxxxx
> http://www.corvidtechnologies.com
>
> > On Feb 5, 2015, at 5:41 PM, Ben Turner <bturner@xxxxxxxxxx>
wrote:
> >
> > ----- Original Message -----
> >> From: "Ben Turner" <bturner@xxxxxxxxxx>
> >> To: "David F. Robinson" <david.robinson@xxxxxxxxxxxxx>
> >> Cc: "Pranith Kumar Karampuri" <pkarampu@xxxxxxxxxx>, "Xavier
Hernandez"
> >> <xhernandez@xxxxxxxxxx>, "Benjamin Turner"
> >> <bennyturns@xxxxxxxxx>, gluster-users@xxxxxxxxxxx, "Gluster
Devel"
> >> <gluster-devel@xxxxxxxxxxx>
> >> Sent: Thursday, February 5, 2015 5:22:26 PM
> >> Subject: Re: [Gluster-devel] missing files
> >>
> >> ----- Original Message -----
> >>> From: "David F. Robinson" <david.robinson@xxxxxxxxxxxxx>
> >>> To: "Ben Turner" <bturner@xxxxxxxxxx>
> >>> Cc: "Pranith Kumar Karampuri" <pkarampu@xxxxxxxxxx>, "Xavier
Hernandez"
> >>> <xhernandez@xxxxxxxxxx>, "Benjamin Turner"
> >>> <bennyturns@xxxxxxxxx>, gluster-users@xxxxxxxxxxx, "Gluster
Devel"
> >>> <gluster-devel@xxxxxxxxxxx>
> >>> Sent: Thursday, February 5, 2015 5:01:13 PM
> >>> Subject: Re: [Gluster-devel] missing files
> >>>
> >>> I'll send you the emails I sent Pranith with the logs. What
causes
> >>> these
> >>> disconnects?
> >>
> >> Thanks David! Disconnects happen when there are interruption in
> >> communication between peers, normally there is ping timeout that
> >> happens.
> >> It could be anything from a flaky NW to the system was to busy
to
> >> respond
> >> to the pings. My initial take is more towards the ladder as
rsync is
> >> absolutely the worst use case for gluster - IIRC it writes in
4kb
> >> blocks. I
> >> try to keep my writes at least 64KB as in my testing that is the
> >> smallest
> >> block size I can write with before perf starts to really drop
off. I'll
> >> try
> >> something similar in the lab.
> >
> > Ok I do think that the file being self healed is RCA for what you
were
> > seeing. Lets look at one of the disconnects:
> >
> > data-brick02a-homegfs.log:[2015-02-03 20:54:02.772180] I
> > [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting
> > connection from
> >
gfs01b.corvidtec.com-4175-2015/02/02-16:44:31:179119-homegfs-client-2-0-1
> >
> > And in the glustershd.log from the gfs01b_glustershd.log file:
> >
> > [2015-02-03 20:55:48.001797] I
> > [afr-self-heal-entry.c:554:afr_selfheal_entry_do]
0-homegfs-replicate-0:
> > performing entry selfheal on 6c79a368-edaa-432b-bef9-ec690ab42448
> > [2015-02-03 20:55:49.341996] I
> > [afr-self-heal-common.c:476:afr_log_selfheal]
0-homegfs-replicate-0:
> > Completed entry selfheal on 6c79a368-edaa-432b-bef9-ec690ab42448.
> > source=1 sinks=0
> > [2015-02-03 20:55:49.343093] I
> > [afr-self-heal-entry.c:554:afr_selfheal_entry_do]
0-homegfs-replicate-0:
> > performing entry selfheal on 792cb0d6-9290-4447-8cd7-2b2d7a116a69
> > [2015-02-03 20:55:50.463652] I
> > [afr-self-heal-common.c:476:afr_log_selfheal]
0-homegfs-replicate-0:
> > Completed entry selfheal on 792cb0d6-9290-4447-8cd7-2b2d7a116a69.
> > source=1 sinks=0
> > [2015-02-03 20:55:51.465289] I
> > [afr-self-heal-metadata.c:54:__afr_selfheal_metadata_do]
> > 0-homegfs-replicate-0: performing metadata selfheal on
> > 403e661a-1c27-4e79-9867-c0572aba2b3c
> > [2015-02-03 20:55:51.466515] I
> > [afr-self-heal-common.c:476:afr_log_selfheal]
0-homegfs-replicate-0:
> > Completed metadata selfheal on
403e661a-1c27-4e79-9867-c0572aba2b3c.
> > source=1 sinks=0
> > [2015-02-03 20:55:51.467098] I
> > [afr-self-heal-entry.c:554:afr_selfheal_entry_do]
0-homegfs-replicate-0:
> > performing entry selfheal on 403e661a-1c27-4e79-9867-c0572aba2b3c
> > [2015-02-03 20:55:55.257808] I
> > [afr-self-heal-common.c:476:afr_log_selfheal]
0-homegfs-replicate-0:
> > Completed entry selfheal on 403e661a-1c27-4e79-9867-c0572aba2b3c.
> > source=1 sinks=0
> > [2015-02-03 20:55:55.258548] I
> > [afr-self-heal-metadata.c:54:__afr_selfheal_metadata_do]
> > 0-homegfs-replicate-0: performing metadata selfheal on
> > c612ee2f-2fb4-4157-a9ab-5a2d5603c541
> > [2015-02-03 20:55:55.259367] I
> > [afr-self-heal-common.c:476:afr_log_selfheal]
0-homegfs-replicate-0:
> > Completed metadata selfheal on
c612ee2f-2fb4-4157-a9ab-5a2d5603c541.
> > source=1 sinks=0
> > [2015-02-03 20:55:55.259980] I
> > [afr-self-heal-entry.c:554:afr_selfheal_entry_do]
0-homegfs-replicate-0:
> > performing entry selfheal on c612ee2f-2fb4-4157-a9ab-5a2d5603c541
> >
> > As you can see the self heal logs are just spammed with files
being
> > healed, and I looked at a couple of disconnects and I see self
heals
> > getting run shortly after on the bricks that were down. Now we
need to
> > find the cause of the disconnects, I am thinking once the
disconnects
> > are resolved the files should be properly copied over without SH
having
> > to fix things. Like I said I'll give this a go on my lab systems
and
> > see if I can repro the disconnects, I'll have time to run through
it
> > tomorrow. If in the mean time anyone else has a theory / anything
to
> > add here it would be appreciated.
> >
> > -b
> >
> >> -b
> >>
> >>> David (Sent from mobile)
> >>>
> >>> ===============================
> >>> David F. Robinson, Ph.D.
> >>> President - Corvid Technologies
> >>> 704.799.6944 x101 [office]
> >>> 704.252.1310 [cell]
> >>> 704.799.7974 [fax]
> >>> David.Robinson@xxxxxxxxxxxxx
> >>> http://www.corvidtechnologies.com
> >>>
> >>>> On Feb 5, 2015, at 4:55 PM, Ben Turner <bturner@xxxxxxxxxx>
wrote:
> >>>>
> >>>> ----- Original Message -----
> >>>>> From: "Pranith Kumar Karampuri" <pkarampu@xxxxxxxxxx>
> >>>>> To: "Xavier Hernandez" <xhernandez@xxxxxxxxxx>, "David F.
Robinson"
> >>>>> <david.robinson@xxxxxxxxxxxxx>, "Benjamin Turner"
> >>>>> <bennyturns@xxxxxxxxx>
> >>>>> Cc: gluster-users@xxxxxxxxxxx, "Gluster Devel"
> >>>>> <gluster-devel@xxxxxxxxxxx>
> >>>>> Sent: Thursday, February 5, 2015 5:30:04 AM
> >>>>> Subject: Re: [Gluster-devel] missing files
> >>>>>
> >>>>>
> >>>>>> On 02/05/2015 03:48 PM, Pranith Kumar Karampuri wrote:
> >>>>>> I believe David already fixed this. I hope this is the same
issue he
> >>>>>> told about permissions issue.
> >>>>> Oops, it is not. I will take a look.
> >>>>
> >>>> Yes David exactly like these:
> >>>>
> >>>> data-brick02a-homegfs.log:[2015-02-03 19:09:34.568842] I
> >>>> [server.c:518:server_rpc_notify] 0-homegfs-server:
disconnecting
> >>>> connection from
> >>>>
gfs02a.corvidtec.com-18563-2015/02/03-19:07:58:519134-homegfs-client-2-0-0
> >>>> data-brick02a-homegfs.log:[2015-02-03 19:09:41.286551] I
> >>>> [server.c:518:server_rpc_notify] 0-homegfs-server:
disconnecting
> >>>> connection from
> >>>>
gfs01a.corvidtec.com-12804-2015/02/03-19:09:38:497808-homegfs-client-2-0-0
> >>>> data-brick02a-homegfs.log:[2015-02-03 19:16:35.906412] I
> >>>> [server.c:518:server_rpc_notify] 0-homegfs-server:
disconnecting
> >>>> connection from
> >>>>
gfs02b.corvidtec.com-27190-2015/02/03-19:15:53:458467-homegfs-client-2-0-0
> >>>> data-brick02a-homegfs.log:[2015-02-03 19:51:22.761293] I
> >>>> [server.c:518:server_rpc_notify] 0-homegfs-server:
disconnecting
> >>>> connection from
> >>>>
gfs01a.corvidtec.com-25926-2015/02/03-19:51:02:89070-homegfs-client-2-0-0
> >>>> data-brick02a-homegfs.log:[2015-02-03 20:54:02.772180] I
> >>>> [server.c:518:server_rpc_notify] 0-homegfs-server:
disconnecting
> >>>> connection from
> >>>>
gfs01b.corvidtec.com-4175-2015/02/02-16:44:31:179119-homegfs-client-2-0-1
> >>>>
> >>>> You can 100% verify my theory if you can correlate the time on
the
> >>>> disconnects to the time that the missing files were healed.
Can you
> >>>> have
> >>>> a look at /var/log/glusterfs/glustershd.log? That has all of
the
> >>>> healed
> >>>> files + timestamps, if we can see a disconnect during the
rsync and a
> >>>> self
> >>>> heal of the missing file I think we can safely assume that the
> >>>> disconnects
> >>>> may have caused this. I'll try this on my test systems, how
much data
> >>>> did
> >>>> you rsync? What size ish of files / an idea of the dir layout?
> >>>>
> >>>> @Pranith - Could bricks flapping up and down during the rsync
cause
> >>>> the
> >>>> files to be missing on the first ls(written to 1 subvol but
not the
> >>>> other
> >>>> cause it was down), the ls triggered SH, and thats why the
files were
> >>>> there for the second ls be a possible cause here?
> >>>>
> >>>> -b
> >>>>
> >>>>
> >>>>> Pranith
> >>>>>>
> >>>>>> Pranith
> >>>>>>> On 02/05/2015 03:44 PM, Xavier Hernandez wrote:
> >>>>>>> Is the failure repeatable ? with the same directories ?
> >>>>>>>
> >>>>>>> It's very weird that the directories appear on the volume
when you
> >>>>>>> do
> >>>>>>> an 'ls' on the bricks. Could it be that you only made a
single 'ls'
> >>>>>>> on fuse mount which not showed the directory ? Is it
possible that
> >>>>>>> this 'ls' triggered a self-heal that repaired the problem,
whatever
> >>>>>>> it was, and when you did another 'ls' on the fuse mount
after the
> >>>>>>> 'ls' on the bricks, the directories were there ?
> >>>>>>>
> >>>>>>> The first 'ls' could have healed the files, causing that
the
> >>>>>>> following 'ls' on the bricks showed the files as if nothing
were
> >>>>>>> damaged. If that's the case, it's possible that there were
some
> >>>>>>> disconnections during the copy.
> >>>>>>>
> >>>>>>> Added Pranith because he knows better replication and
self-heal
> >>>>>>> details.
> >>>>>>>
> >>>>>>> Xavi
> >>>>>>>
> >>>>>>>> On 02/04/2015 07:23 PM, David F. Robinson wrote:
> >>>>>>>> Distributed/replicated
> >>>>>>>>
> >>>>>>>> Volume Name: homegfs
> >>>>>>>> Type: Distributed-Replicate
> >>>>>>>> Volume ID: 1e32672a-f1b7-4b58-ba94-58c085e59071
> >>>>>>>> Status: Started
> >>>>>>>> Number of Bricks: 4 x 2 = 8
> >>>>>>>> Transport-type: tcp
> >>>>>>>> Bricks:
> >>>>>>>> Brick1: gfsib01a.corvidtec.com:/data/brick01a/homegfs
> >>>>>>>> Brick2: gfsib01b.corvidtec.com:/data/brick01b/homegfs
> >>>>>>>> Brick3: gfsib01a.corvidtec.com:/data/brick02a/homegfs
> >>>>>>>> Brick4: gfsib01b.corvidtec.com:/data/brick02b/homegfs
> >>>>>>>> Brick5: gfsib02a.corvidtec.com:/data/brick01a/homegfs
> >>>>>>>> Brick6: gfsib02b.corvidtec.com:/data/brick01b/homegfs
> >>>>>>>> Brick7: gfsib02a.corvidtec.com:/data/brick02a/homegfs
> >>>>>>>> Brick8: gfsib02b.corvidtec.com:/data/brick02b/homegfs
> >>>>>>>> Options Reconfigured:
> >>>>>>>> performance.io-thread-count: 32
> >>>>>>>> performance.cache-size: 128MB
> >>>>>>>> performance.write-behind-window-size: 128MB
> >>>>>>>> server.allow-insecure: on
> >>>>>>>> network.ping-timeout: 10
> >>>>>>>> storage.owner-gid: 100
> >>>>>>>> geo-replication.indexing: off
> >>>>>>>> geo-replication.ignore-pid-check: on
> >>>>>>>> changelog.changelog: on
> >>>>>>>> changelog.fsync-interval: 3
> >>>>>>>> changelog.rollover-time: 15
> >>>>>>>> server.manage-gids: on
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> ------ Original Message ------
> >>>>>>>> From: "Xavier Hernandez" <xhernandez@xxxxxxxxxx>
> >>>>>>>> To: "David F. Robinson" <david.robinson@xxxxxxxxxxxxx>;
"Benjamin
> >>>>>>>> Turner" <bennyturns@xxxxxxxxx>
> >>>>>>>> Cc: "gluster-users@xxxxxxxxxxx"
<gluster-users@xxxxxxxxxxx>;
> >>>>>>>> "Gluster
> >>>>>>>> Devel" <gluster-devel@xxxxxxxxxxx>
> >>>>>>>> Sent: 2/4/2015 6:03:45 AM
> >>>>>>>> Subject: Re: [Gluster-devel] missing files
> >>>>>>>>
> >>>>>>>>>> On 02/04/2015 01:30 AM, David F. Robinson wrote:
> >>>>>>>>>> Sorry. Thought about this a little more. I should have
been
> >>>>>>>>>> clearer.
> >>>>>>>>>> The files were on both bricks of the replica, not just
one side.
> >>>>>>>>>> So,
> >>>>>>>>>> both bricks had to have been up... The files/directories
just
> >>>>>>>>>> don't show
> >>>>>>>>>> up on the mount.
> >>>>>>>>>> I was reading and saw a related bug
> >>>>>>>>>> (https://bugzilla.redhat.com/show_bug.cgi?id=1159484). I
saw it
> >>>>>>>>>> suggested to run:
> >>>>>>>>>> find <mount> -d -exec getfattr -h -n trusted.ec.heal {}
> >>>>>>>>>> \;
> >>>>>>>>>
> >>>>>>>>> This command is specific for a dispersed volume. It won't
do
> >>>>>>>>> anything
> >>>>>>>>> (aside from the error you are seeing) on a replicated
volume.
> >>>>>>>>>
> >>>>>>>>> I think you are using a replicated volume, right ?
> >>>>>>>>>
> >>>>>>>>> In this case I'm not sure what can be happening. Is your
volume a
> >>>>>>>>> pure
> >>>>>>>>> replicated one or a distributed-replicated ? on a pure
replicated
> >>>>>>>>> it
> >>>>>>>>> doesn't make sense that some entries do not show in an
'ls' when
> >>>>>>>>> the
> >>>>>>>>> file is in both replicas (at least without any error
message in
> >>>>>>>>> the
> >>>>>>>>> logs). On a distributed-replicated it could be caused by
some
> >>>>>>>>> problem
> >>>>>>>>> while combining contents of each replica set.
> >>>>>>>>>
> >>>>>>>>> What's the configuration of your volume ?
> >>>>>>>>>
> >>>>>>>>> Xavi
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> I get a bunch of errors for operation not supported:
> >>>>>>>>>> [root@gfs02a homegfs]# find wks_backup -d -exec getfattr
-h -n
> >>>>>>>>>> trusted.ec.heal {} \;
> >>>>>>>>>> find: warning: the -d option is deprecated; please use
-depth
> >>>>>>>>>> instead,
> >>>>>>>>>> because the latter is a POSIX-compliant feature.
> >>>>>>>>>> wks_backup/homer_backup/backup: trusted.ec.heal:
Operation not
> >>>>>>>>>> supported
> >>>>>>>>>> wks_backup/homer_backup/logs/2014_05_20.log:
trusted.ec.heal:
> >>>>>>>>>> Operation
> >>>>>>>>>> not supported
> >>>>>>>>>> wks_backup/homer_backup/logs/2014_05_21.log:
trusted.ec.heal:
> >>>>>>>>>> Operation
> >>>>>>>>>> not supported
> >>>>>>>>>> wks_backup/homer_backup/logs/2014_05_18.log:
trusted.ec.heal:
> >>>>>>>>>> Operation
> >>>>>>>>>> not supported
> >>>>>>>>>> wks_backup/homer_backup/logs/2014_05_19.log:
trusted.ec.heal:
> >>>>>>>>>> Operation
> >>>>>>>>>> not supported
> >>>>>>>>>> wks_backup/homer_backup/logs/2014_05_22.log:
trusted.ec.heal:
> >>>>>>>>>> Operation
> >>>>>>>>>> not supported
> >>>>>>>>>> wks_backup/homer_backup/logs: trusted.ec.heal: Operation
not
> >>>>>>>>>> supported
> >>>>>>>>>> wks_backup/homer_backup: trusted.ec.heal: Operation not
> >>>>>>>>>> supported
> >>>>>>>>>> ------ Original Message ------
> >>>>>>>>>> From: "Benjamin Turner" <bennyturns@xxxxxxxxx
> >>>>>>>>>> <mailto:bennyturns@xxxxxxxxx>>
> >>>>>>>>>> To: "David F. Robinson" <david.robinson@xxxxxxxxxxxxx
> >>>>>>>>>> <mailto:david.robinson@xxxxxxxxxxxxx>>
> >>>>>>>>>> Cc: "Gluster Devel" <gluster-devel@xxxxxxxxxxx
> >>>>>>>>>> <mailto:gluster-devel@xxxxxxxxxxx>>;
"gluster-users@xxxxxxxxxxx"
> >>>>>>>>>> <gluster-users@xxxxxxxxxxx
<mailto:gluster-users@xxxxxxxxxxx>>
> >>>>>>>>>> Sent: 2/3/2015 7:12:34 PM
> >>>>>>>>>> Subject: Re: [Gluster-devel] missing files
> >>>>>>>>>>> It sounds to me like the files were only copied to one
replica,
> >>>>>>>>>>> werent
> >>>>>>>>>>> there for the initial for the initial ls which
triggered a self
> >>>>>>>>>>> heal,
> >>>>>>>>>>> and were there for the last ls because they were
healed. Is
> >>>>>>>>>>> there
> >>>>>>>>>>> any
> >>>>>>>>>>> chance that one of the replicas was down during the
rsync? It
> >>>>>>>>>>> could
> >>>>>>>>>>> be that you lost a brick during copy or something like
that. To
> >>>>>>>>>>> confirm I would look for disconnects in the brick logs
as well
> >>>>>>>>>>> as
> >>>>>>>>>>> checking glusterfshd.log to verify the missing files
were
> >>>>>>>>>>> actually
> >>>>>>>>>>> healed.
> >>>>>>>>>>>
> >>>>>>>>>>> -b
> >>>>>>>>>>>
> >>>>>>>>>>> On Tue, Feb 3, 2015 at 5:37 PM, David F. Robinson
> >>>>>>>>>>> <david.robinson@xxxxxxxxxxxxx
> >>>>>>>>>>> <mailto:david.robinson@xxxxxxxxxxxxx>>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> I rsync'd 20-TB over to my gluster system and noticed
that I
> >>>>>>>>>>> had
> >>>>>>>>>>> some directories missing even though the rsync
completed
> >>>>>>>>>>> normally.
> >>>>>>>>>>> The rsync logs showed that the missing files were
> >>>>>>>>>>> transferred.
> >>>>>>>>>>> I went to the bricks and did an 'ls -al
> >>>>>>>>>>> /data/brick*/homegfs/dir/*' the files were on the
bricks.
> >>>>>>>>>>> After I
> >>>>>>>>>>> did this 'ls', the files then showed up on the FUSE
mounts.
> >>>>>>>>>>> 1) Why are the files hidden on the fuse mount?
> >>>>>>>>>>> 2) Why does the ls make them show up on the FUSE mount?
> >>>>>>>>>>> 3) How can I prevent this from happening again?
> >>>>>>>>>>> Note, I also mounted the gluster volume using NFS and
saw the
> >>>>>>>>>>> same
> >>>>>>>>>>> behavior. The files/directories were not shown until I
did
> >>>>>>>>>>> the
> >>>>>>>>>>> "ls" on the bricks.
> >>>>>>>>>>> David
> >>>>>>>>>>> ===============================
> >>>>>>>>>>> David F. Robinson, Ph.D.
> >>>>>>>>>>> President - Corvid Technologies
> >>>>>>>>>>> 704.799.6944 x101 <tel:704.799.6944%20x101> [office]
> >>>>>>>>>>> 704.252.1310 <tel:704.252.1310> [cell]
> >>>>>>>>>>> 704.799.7974 <tel:704.799.7974> [fax]
> >>>>>>>>>>> David.Robinson@xxxxxxxxxxxxx
> >>>>>>>>>>> <mailto:David.Robinson@xxxxxxxxxxxxx>
> >>>>>>>>>>> http://www.corvidtechnologies.com
> >>>>>>>>>>> <http://www.corvidtechnologies.com/>
> >>>>>>>>>>>
> >>>>>>>>>>> _______________________________________________
> >>>>>>>>>>> Gluster-devel mailing list
> >>>>>>>>>>> Gluster-devel@xxxxxxxxxxx
<mailto:Gluster-devel@xxxxxxxxxxx>
> >>>>>>>>>>> http://www.gluster.org/mailman/listinfo/gluster-devel
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> _______________________________________________
> >>>>>>>>>> Gluster-devel mailing list
> >>>>>>>>>> Gluster-devel@xxxxxxxxxxx
> >>>>>>>>>> http://www.gluster.org/mailman/listinfo/gluster-devel
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> Gluster-users mailing list
> >>>>>> Gluster-users@xxxxxxxxxxx
> >>>>>> http://www.gluster.org/mailman/listinfo/gluster-users
> >>>>>
> >>>>> _______________________________________________
> >>>>> Gluster-users mailing list
> >>>>> Gluster-users@xxxxxxxxxxx
> >>>>> http://www.gluster.org/mailman/listinfo/gluster-users
> >>
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel@xxxxxxxxxxx
> http://www.gluster.org/mailman/listinfo/gluster-devel
--
GlusterFS - http://www.gluster.org
An open source, distributed file system scaling to several
petabytes, and handling thousands of clients.
My personal twitter: twitter.com/realjustinclift