Re: replicate background threads

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 03/14/2012 01:47 AM, Ian Latter wrote:
Thanks for the info Pranith;

<pranithk>  the option to increase the max num of background
self-heals
is cluster.background-self-heal-count. Default value of
which is 16. I
assume you know what you are doing to the performance of the
system by
increasing this number.


I didn't know this.  Is there a queue length for what
is yet to be handled by the background self heal
count?  If so, can it also be adjusted?


----- Original Message -----
From: "Pranith Kumar K"<pranithk@xxxxxxxxxxx>
To: "Ian Latter"<ian.latter@xxxxxxxxxxxxxxxx>
Subject:  Re: replicate background threads
Date: Tue, 13 Mar 2012 21:07:53 +0530

On 03/13/2012 07:52 PM, Ian Latter wrote:
Hello,


    Well we've been privy to our first true error in
Gluster now, and we're not sure of the cause.

    The SaturnI machine with 1Gbyte of RAM was
exhausting its memory and crashing and we saw
core dumps on SaturnM and MMC.  Replacing
the SaturnI hardware with identical hardware to
SaturnM, but retaining SaturnI's original disks,
(so fixing the memory capacity problem) we saw
crashes randomly at all nodes.

    Looking for irregularities at the file system
we noticed that (we'd estimate) about 60% of
the files at the OS/EXT3 layer of SaturnI
(sourced via replicate from SaturnM) were of
size 2147483648 (2^31) where they should
have been substantially larger.  While we would
happily accept "you shouldn't be using a 32bit
gluster package" as the answer, we note two
deltas;
    1) All files used in testing were copied on from
         32 bit clients to 32 bit servers, with no
         observable errors
    2) Of the file that were replicated, not all were
         corrupted (capped at 2G -- note that we
         confirmed that this was the first 2G of the
         source file contents).


So is there a known replicate issue with files
greater than 2GB?  Has anyone done any
serious testing with significant numbers of files
of this size?  Are there configurations specific
to files/structures of these dimensions?

We noted that reversing the configuration, such
that SaturnI provides the replicate Brick amongst
a local distribute and a remote map to SaturnM
where SaturnM simply serves a local distribute;
that the data served to MMC is accurate (it
continues to show 15GB files, even where there
is a local 2GB copy).  Further, a client "cp" at
MMC, of a file with a 2GB local replicate of a
15GB file, will result in a 15GB file being
created and replicated via Gluster (i.e. the
correct specification at both server nodes).

So my other question is; Is it possible that we've
managed to corrupt something in this
environment?  I.e. during the initial memory
exhaustion events?  And is there a robust way
to have the replicate files revalidated by gluster
as a stat doesn't seem to be correcting files in
this state (i.e. replicate on SaturnM results in
daemon crashes, replicate on SaturnI results
in files being left in the bad state).


Also, I'm not a member of the users list; if these
questions are better posed there then let me
know and I'll re-post them there.



Thanks,





----- Original Message -----
From: "Ian Latter"<ian.latter@xxxxxxxxxxxxxxxx>
To:<gluster-devel@xxxxxxxxxx>
Subject:  replicate background threads
Date: Sun, 11 Mar 2012 20:17:15 +1000

Hello,


    My mate Michael and I have been steadily
advancing our Gluster testing and today we finally
reached some heavier conditions.  The outcome
was different from expectations built from our more
basic testing so I think we have a couple of
questions regarding the AFR/Replicate background
threads that may need a developer's contribution.
Any help appreciated.


    The setup is a 3 box environment, but lets start
with two;

      SaturnM (Server)
         - 6core CPU, 16GB RAM, 1Gbps net
         - 3.2.6 Kernel (custom distro)
         - 3.2.5 Gluster (32bit)
         - 3x2TB drives, CFQ, EXT3
         - Bricked up into a single local 6TB
            "distribute" brick
         - "brick" served to the network

      MMC (Client)
         - 4core CPU, 8GB RAM, 1Gbps net
         - Ubuntu
         - 3.2.5 Gluster (32bit)
         - Don't recall the disk space locally
         - "brick" from SaturnM mounted

      500 x 15Gbyte files were copied from MMC
to a single sub-directory on the brick served from
SaturnM, all went well and dandy.  So then we
moved on to a 3 box environment;

      SaturnI (Server)
         = 1core CPU, 1GB RAM, 1Gbps net
         = 3.2.6 Kernel (custom distro)
         = 3.2.5 Gluster (32bit)
         = 4x2TB drives, CFQ, EXT3
         = Bricked up into a single local 8TB
            "distribute" brick
         = "brick" served to the network

      SaturnM (Server/Client)
         - 6core CPU, 16GB RAM, 1Gbps net
         - 3.2.6 Kernel (custom distro)
         - 3.2.5 Gluster (32bit)
         - 3x2TB drives, CFQ, EXT3
         - Bricked up into a single local 6TB
            "distribute" brick
         = Replicate brick added to sit over
            the local distribute brick and a
            client "brick" mapped from SaturnI
         - Replicate "brick" served to the network

      MMC (Client)
         - 4core CPU, 8GB RAM, 1Gbps net
         - Ubuntu
         - 3.2.5 Gluster (32bit)
         - Don't recall the disk space locally
         - "brick" from SaturnM mounted
         = "brick" from SaturnI mounted


    Now, in lesser testing in this scenario all was
well - any files on SaturnI would appear on SaturnM
(not a functional part of our test) and the content on
SaturnM would appear on SaturnI (the real
objective).

    Earlier testing used a handful of smaller files (10s
to 100s of Mbytes) and a single 15Gbyte file.  The
15Gbyte file would be "stat" via an "ls", which would
kick off a background replication (ls appeared un-
blocked) and the file would be transferred.  Also,
interrupting the transfer (pulling the LAN cable)
would result in a partial 15Gbyte file being corrected
in a subsequent background process on another
stat.

    *However* .. when confronted with 500 x 15Gbyte
files, in a single directory (but not the root directory)
things don't quite work out as nicely.
    - First, the "ls" (at MMC against the SaturnM brick)
      is blocking and hangs the terminal (ctrl-c doesn't
      kill it).
<pranithk>  At max 16 files can be self-healed in the
back-ground in
parallel. 17th file self-heal will happen in the foreground.
    - Then, looking from MMC at the SaturnI file
       system (ls -s) once per second, and then
       comparing the output (diff ls1.txt ls2.txt |
       grep -v '>') we can see that between 10 and 17
       files are being updated simultaneously by the
       background process
<pranithk>  This is expected.
    - Further, a request at MMC for a single file that
      was originally in the 500 x 15Gbyte sub-dir on
      SaturnM (which should return unblocked with
      correct results) will;
        a) work as expected if there are less than 17
            active background file tasks
        b) block/hang if there are more
    - Where-as a stat (ls) outside of the 500 x 15
       sub-directory, such as the root of that brick,
       would always work as expected (return
       immediately, unblocked).
<pranithk>  stat on the directory will only create the
files with '0'
file size. Then when you ls/stat the actual file the
self-heal for the
file gets triggered.

    Thus, to us, it appears as though there is a
queue feeding a set of (around) 16 worker threads
in AFR.  If your request was to the loaded directory
then you would be blocked until a worker was
available, and if your request was to any other
location, it would return unblocked regardless of
the worker pool state.

    The only thread metric that we could find to tweak
was performance/io-threads (which was set to
16 per physical disk; well per locks + posix brick
stacks) but increasing this to 64 per stack didn't
change the outcome (10 to 17 active background
transfers).
<pranithk>  the option to increase the max num of
background self-heals
is cluster.background-self-heal-count. Default value of
which is 16. I
assume you know what you are doing to the performance of
the system by
increasing this number.

    So, given the above, is our analysis sound, and
if so, is there a way to increase the size of the
pool of active worker threads?  The objective
being to allow unblocked access to an existing
repository of files (on SaturnM) while a
secondary/back-up is being filled, via GlusterFS?

    Note that I understand that performance
(through-put) will be an issue in the described
environment: this replication process is
estimated to run for between 10 and 40 hours,
which is acceptable so long as it isn't blocking
(there's a production-capable file set in place).





Any help appreciated.

Please let us know how it goes.
Thanks,






--
Ian Latter
Late night coder ..
http://midnightcode.org/

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxx
https://lists.nongnu.org/mailman/listinfo/gluster-devel

--
Ian Latter
Late night coder ..
http://midnightcode.org/

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxx
https://lists.nongnu.org/mailman/listinfo/gluster-devel
hi Ian,
       inline replies with<pranithk>.

Pranith.


--
Ian Latter
Late night coder ..
http://midnightcode.org/
hi Ian,
Maintaining a queue of files that need to be self-healed does not scale in practice, in cases where there are millions of files that need self-heal. So such a thing is not implemented. The idea is to make self-heal foreground after a certain-limit (background-self-heal-count) so there is no necessity for such a queue.

Pranith.



[Index of Archives]     [Gluster Users]     [Ceph Users]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux