Re: replicate background threads

"Ian Latter" <ian.latter@xxxxxxxxxxxxxxxx> · Wed, 14 Mar 2012 06:17:14 +1000

Thanks for the info Pranith;

<pranithk> the option to increase the max num of background
self-heals
is cluster.background-self-heal-count. Default value of
which is 16. I
assume you know what you are doing to the performance of the
system by
increasing this number.

I didn't know this.  Is there a queue length for what 
is yet to be handled by the background self heal
count?  If so, can it also be adjusted?

----- Original Message -----
>From: "Pranith Kumar K" <pranithk@xxxxxxxxxxx>
>To: "Ian Latter" <ian.latter@xxxxxxxxxxxxxxxx>
>Subject:  Re: replicate background threads
>Date: Tue, 13 Mar 2012 21:07:53 +0530
>
> On 03/13/2012 07:52 PM, Ian Latter wrote:
> > Hello,
> >
> >
> >    Well we've been privy to our first true error in
> > Gluster now, and we're not sure of the cause.
> >
> >    The SaturnI machine with 1Gbyte of RAM was
> > exhausting its memory and crashing and we saw
> > core dumps on SaturnM and MMC.  Replacing
> > the SaturnI hardware with identical hardware to
> > SaturnM, but retaining SaturnI's original disks,
> > (so fixing the memory capacity problem) we saw
> > crashes randomly at all nodes.
> >
> >    Looking for irregularities at the file system
> > we noticed that (we'd estimate) about 60% of
> > the files at the OS/EXT3 layer of SaturnI
> > (sourced via replicate from SaturnM) were of
> > size 2147483648 (2^31) where they should
> > have been substantially larger.  While we would
> > happily accept "you shouldn't be using a 32bit
> > gluster package" as the answer, we note two
> > deltas;
> >    1) All files used in testing were copied on from
> >         32 bit clients to 32 bit servers, with no
> >         observable errors
> >    2) Of the file that were replicated, not all were
> >         corrupted (capped at 2G -- note that we
> >         confirmed that this was the first 2G of the
> >         source file contents).
> >
> >
> > So is there a known replicate issue with files
> > greater than 2GB?  Has anyone done any
> > serious testing with significant numbers of files
> > of this size?  Are there configurations specific
> > to files/structures of these dimensions?
> >
> > We noted that reversing the configuration, such
> > that SaturnI provides the replicate Brick amongst
> > a local distribute and a remote map to SaturnM
> > where SaturnM simply serves a local distribute;
> > that the data served to MMC is accurate (it
> > continues to show 15GB files, even where there
> > is a local 2GB copy).  Further, a client "cp" at
> > MMC, of a file with a 2GB local replicate of a
> > 15GB file, will result in a 15GB file being
> > created and replicated via Gluster (i.e. the
> > correct specification at both server nodes).
> >
> > So my other question is; Is it possible that we've
> > managed to corrupt something in this
> > environment?  I.e. during the initial memory
> > exhaustion events?  And is there a robust way
> > to have the replicate files revalidated by gluster
> > as a stat doesn't seem to be correcting files in
> > this state (i.e. replicate on SaturnM results in
> > daemon crashes, replicate on SaturnI results
> > in files being left in the bad state).
> >
> >
> > Also, I'm not a member of the users list; if these
> > questions are better posed there then let me
> > know and I'll re-post them there.
> >
> >
> >
> > Thanks,
> >
> >
> >
> >
> >
> > ----- Original Message -----
> >> From: "Ian Latter"<ian.latter@xxxxxxxxxxxxxxxx>
> >> To:<gluster-devel@xxxxxxxxxx>
> >> Subject:  replicate background threads
> >> Date: Sun, 11 Mar 2012 20:17:15 +1000
> >>
> >> Hello,
> >>
> >>
> >>    My mate Michael and I have been steadily
> >> advancing our Gluster testing and today we finally
> >> reached some heavier conditions.  The outcome
> >> was different from expectations built from our more
> >> basic testing so I think we have a couple of
> >> questions regarding the AFR/Replicate background
> >> threads that may need a developer's contribution.
> >> Any help appreciated.
> >>
> >>
> >>    The setup is a 3 box environment, but lets start
> >> with two;
> >>
> >>      SaturnM (Server)
> >>         - 6core CPU, 16GB RAM, 1Gbps net
> >>         - 3.2.6 Kernel (custom distro)
> >>         - 3.2.5 Gluster (32bit)
> >>         - 3x2TB drives, CFQ, EXT3
> >>         - Bricked up into a single local 6TB
> >>            "distribute" brick
> >>         - "brick" served to the network
> >>
> >>      MMC (Client)
> >>         - 4core CPU, 8GB RAM, 1Gbps net
> >>         - Ubuntu
> >>         - 3.2.5 Gluster (32bit)
> >>         - Don't recall the disk space locally
> >>         - "brick" from SaturnM mounted
> >>
> >>      500 x 15Gbyte files were copied from MMC
> >> to a single sub-directory on the brick served from
> >> SaturnM, all went well and dandy.  So then we
> >> moved on to a 3 box environment;
> >>
> >>      SaturnI (Server)
> >>         = 1core CPU, 1GB RAM, 1Gbps net
> >>         = 3.2.6 Kernel (custom distro)
> >>         = 3.2.5 Gluster (32bit)
> >>         = 4x2TB drives, CFQ, EXT3
> >>         = Bricked up into a single local 8TB
> >>            "distribute" brick
> >>         = "brick" served to the network
> >>
> >>      SaturnM (Server/Client)
> >>         - 6core CPU, 16GB RAM, 1Gbps net
> >>         - 3.2.6 Kernel (custom distro)
> >>         - 3.2.5 Gluster (32bit)
> >>         - 3x2TB drives, CFQ, EXT3
> >>         - Bricked up into a single local 6TB
> >>            "distribute" brick
> >>         = Replicate brick added to sit over
> >>            the local distribute brick and a
> >>            client "brick" mapped from SaturnI
> >>         - Replicate "brick" served to the network
> >>
> >>      MMC (Client)
> >>         - 4core CPU, 8GB RAM, 1Gbps net
> >>         - Ubuntu
> >>         - 3.2.5 Gluster (32bit)
> >>         - Don't recall the disk space locally
> >>         - "brick" from SaturnM mounted
> >>         = "brick" from SaturnI mounted
> >>
> >>
> >>    Now, in lesser testing in this scenario all was
> >> well - any files on SaturnI would appear on SaturnM
> >> (not a functional part of our test) and the content on
> >> SaturnM would appear on SaturnI (the real
> >> objective).
> >>
> >>    Earlier testing used a handful of smaller files (10s
> >> to 100s of Mbytes) and a single 15Gbyte file.  The
> >> 15Gbyte file would be "stat" via an "ls", which would
> >> kick off a background replication (ls appeared un-
> >> blocked) and the file would be transferred.  Also,
> >> interrupting the transfer (pulling the LAN cable)
> >> would result in a partial 15Gbyte file being corrected
> >> in a subsequent background process on another
> >> stat.
> >>
> >>    *However* .. when confronted with 500 x 15Gbyte
> >> files, in a single directory (but not the root directory)
> >> things don't quite work out as nicely.
> >>    - First, the "ls" (at MMC against the SaturnM brick)
> >>      is blocking and hangs the terminal (ctrl-c doesn't
> >>      kill it).
> <pranithk> At max 16 files can be self-healed in the
back-ground in 
> parallel. 17th file self-heal will happen in the foreground.
> >>    - Then, looking from MMC at the SaturnI file
> >>       system (ls -s) once per second, and then
> >>       comparing the output (diff ls1.txt ls2.txt |
> >>       grep -v '>') we can see that between 10 and 17
> >>       files are being updated simultaneously by the
> >>       background process
> <pranithk> This is expected.
> >>    - Further, a request at MMC for a single file that
> >>      was originally in the 500 x 15Gbyte sub-dir on
> >>      SaturnM (which should return unblocked with
> >>      correct results) will;
> >>        a) work as expected if there are less than 17
> >>            active background file tasks
> >>        b) block/hang if there are more
> >>    - Where-as a stat (ls) outside of the 500 x 15
> >>       sub-directory, such as the root of that brick,
> >>       would always work as expected (return
> >>       immediately, unblocked).
> <pranithk> stat on the directory will only create the
files with '0' 
> file size. Then when you ls/stat the actual file the
self-heal for the 
> file gets triggered.
> >>
> >>
> >>    Thus, to us, it appears as though there is a
> >> queue feeding a set of (around) 16 worker threads
> >> in AFR.  If your request was to the loaded directory
> >> then you would be blocked until a worker was
> >> available, and if your request was to any other
> >> location, it would return unblocked regardless of
> >> the worker pool state.
> >>
> >>    The only thread metric that we could find to tweak
> >> was performance/io-threads (which was set to
> >> 16 per physical disk; well per locks + posix brick
> >> stacks) but increasing this to 64 per stack didn't
> >> change the outcome (10 to 17 active background
> >> transfers).
> <pranithk> the option to increase the max num of
background self-heals 
> is cluster.background-self-heal-count. Default value of
which is 16. I 
> assume you know what you are doing to the performance of
the system by 
> increasing this number.
> >>
> >>
> >>    So, given the above, is our analysis sound, and
> >> if so, is there a way to increase the size of the
> >> pool of active worker threads?  The objective
> >> being to allow unblocked access to an existing
> >> repository of files (on SaturnM) while a
> >> secondary/back-up is being filled, via GlusterFS?
> >>
> >>    Note that I understand that performance
> >> (through-put) will be an issue in the described
> >> environment: this replication process is
> >> estimated to run for between 10 and 40 hours,
> >> which is acceptable so long as it isn't blocking
> >> (there's a production-capable file set in place).
> >>
> >>
> >>
> >>
> >>
> >> Any help appreciated.
> >>
> Please let us know how it goes.
> >>
> >> Thanks,
> >>
> >>
> >>
> >>
> >>
> >>
> >> --
> >> Ian Latter
> >> Late night coder ..
> >> http://midnightcode.org/
> >>
> >> _______________________________________________
> >> Gluster-devel mailing list
> >> Gluster-devel@xxxxxxxxxx
> >> https://lists.nongnu.org/mailman/listinfo/gluster-devel
> >>
> >
> > --
> > Ian Latter
> > Late night coder ..
> > http://midnightcode.org/
> >
> > _______________________________________________
> > Gluster-devel mailing list
> > Gluster-devel@xxxxxxxxxx
> > https://lists.nongnu.org/mailman/listinfo/gluster-devel
> hi Ian,
>       inline replies with <pranithk>.
> 
> Pranith.
> 

--
Ian Latter
Late night coder ..
http://midnightcode.org/