Re: replicate background threads

"Ian Latter" <ian.latter@xxxxxxxxxxxxxxxx> · Tue, 03 Apr 2012 19:46:51 +1000

FYI - untested patch attached.



----- Original Message -----
>From: "Ian Latter" <ian.latter@xxxxxxxxxxxxxxxx>
>To: "Pranith Kumar K" <pranithk@xxxxxxxxxxx>
>Subject:  Re: replicate background threads
>Date: Tue, 03 Apr 2012 18:50:11 +1000
>
> 
> FYI - I can see that this option doesn't exist, I'm adding it
> now.
> 
> 
> ----- Original Message -----
> >From: "Ian Latter" <ian.latter@xxxxxxxxxxxxxxxx>
> >To: "Pranith Kumar K" <pranithk@xxxxxxxxxxx>
> >Subject:  Re: replicate background threads
> >Date: Mon, 02 Apr 2012 18:02:26 +1000
> >
> > 
> > Hello Pranith,
> > 
> > 
> >   Michael has come back from his business trip and
> > we're about to start testing again (though now under 
> > kernel 3.2.13 and GlusterFS 3.2.6).  
> > 
> >   I've published the 32bit (i586) client on the Saturn 
> > project site if anyone is chasing it;
> >   http://midnightcode.org/projects/saturn/
> > 
> >   One quick question, is there a tune-able parameter
> > that will allow a stat to be non blocking (i.e. to stop
> > self-heal going foreground) when the background
> > self heal count is reached?  
> >   I.e. rather than having the stat hang for 2 days 
> > while the files are replicated, we'd rather it fell 
> > through and allowed subsequent stats to attempt 
> > background self healing (perhaps at a time when 
> > background self heal slots are available).
> > 
> > 
> > Thanks,
> > 
> > 
> > 
> > ----- Original Message -----
> > >From: "Ian Latter" <ian.latter@xxxxxxxxxxxxxxxx>
> > >To: "Pranith Kumar K" <pranithk@xxxxxxxxxxx>
> > >Subject:  Re: replicate background threads
> > >Date: Wed, 14 Mar 2012 19:36:24 +1000
> > >
> > > Hello,
> > > 
> > > > hi Ian,
> > > >      Maintaining a queue of files that need to be 
> > > > self-healed does not scale in practice, in cases
> > > > where there are millions of files that need self-
> > > > heal. So such a thing is not implemented. The 
> > > > idea is to make self-heal foreground after a 
> > > > certain-limit (background-self-heal-count) so 
> > > > there is no necessity for such a queue.
> > > > 
> > > > Pranith.
> > > 
> > > Ok, I understand - it will be interesting to observe
> > > the system with the new knowledge from your
> > > messages - thanks for your help, appreciate it.
> > > 
> > > 
> > > Cheers,
> > > 
> > > ----- Original Message -----
> > > >From: "Pranith Kumar K" <pranithk@xxxxxxxxxxx>
> > > >To: "Ian Latter" <ian.latter@xxxxxxxxxxxxxxxx>
> > > >Subject:  Re: replicate background
threads
> > > >Date: Wed, 14 Mar 2012 07:33:32 +0530
> > > >
> > > > On 03/14/2012 01:47 AM, Ian Latter wrote:
> > > > > Thanks for the info Pranith;
> > > > >
> > > > > <pranithk>  the option to increase the max num of
> > background
> > > > > self-heals
> > > > > is cluster.background-self-heal-count. Default
value of
> > > > > which is 16. I
> > > > > assume you know what you are doing to the performance
> > of the
> > > > > system by
> > > > > increasing this number.
> > > > >
> > > > >
> > > > > I didn't know this.  Is there a queue length for what
> > > > > is yet to be handled by the background self heal
> > > > > count?  If so, can it also be adjusted?
> > > > >
> > > > >
> > > > > ----- Original Message -----
> > > > >> From: "Pranith Kumar K"<pranithk@xxxxxxxxxxx>
> > > > >> To: "Ian Latter"<ian.latter@xxxxxxxxxxxxxxxx>
> > > > >> Subject:  Re: replicate background
> > threads
> > > > >> Date: Tue, 13 Mar 2012 21:07:53 +0530
> > > > >>
> > > > >> On 03/13/2012 07:52 PM, Ian Latter wrote:
> > > > >>> Hello,
> > > > >>>
> > > > >>>
> > > > >>>     Well we've been privy to our first true error in
> > > > >>> Gluster now, and we're not sure of the cause.
> > > > >>>
> > > > >>>     The SaturnI machine with 1Gbyte of RAM was
> > > > >>> exhausting its memory and crashing and we saw
> > > > >>> core dumps on SaturnM and MMC.  Replacing
> > > > >>> the SaturnI hardware with identical hardware to
> > > > >>> SaturnM, but retaining SaturnI's original disks,
> > > > >>> (so fixing the memory capacity problem) we saw
> > > > >>> crashes randomly at all nodes.
> > > > >>>
> > > > >>>     Looking for irregularities at the file system
> > > > >>> we noticed that (we'd estimate) about 60% of
> > > > >>> the files at the OS/EXT3 layer of SaturnI
> > > > >>> (sourced via replicate from SaturnM) were of
> > > > >>> size 2147483648 (2^31) where they should
> > > > >>> have been substantially larger.  While we would
> > > > >>> happily accept "you shouldn't be using a 32bit
> > > > >>> gluster package" as the answer, we note two
> > > > >>> deltas;
> > > > >>>     1) All files used in testing were copied on from
> > > > >>>          32 bit clients to 32 bit servers, with no
> > > > >>>          observable errors
> > > > >>>     2) Of the file that were replicated, not all
were
> > > > >>>          corrupted (capped at 2G -- note that we
> > > > >>>          confirmed that this was the first 2G of the
> > > > >>>          source file contents).
> > > > >>>
> > > > >>>
> > > > >>> So is there a known replicate issue with files
> > > > >>> greater than 2GB?  Has anyone done any
> > > > >>> serious testing with significant numbers of files
> > > > >>> of this size?  Are there configurations specific
> > > > >>> to files/structures of these dimensions?
> > > > >>>
> > > > >>> We noted that reversing the configuration, such
> > > > >>> that SaturnI provides the replicate Brick amongst
> > > > >>> a local distribute and a remote map to SaturnM
> > > > >>> where SaturnM simply serves a local distribute;
> > > > >>> that the data served to MMC is accurate (it
> > > > >>> continues to show 15GB files, even where there
> > > > >>> is a local 2GB copy).  Further, a client "cp" at
> > > > >>> MMC, of a file with a 2GB local replicate of a
> > > > >>> 15GB file, will result in a 15GB file being
> > > > >>> created and replicated via Gluster (i.e. the
> > > > >>> correct specification at both server nodes).
> > > > >>>
> > > > >>> So my other question is; Is it possible that we've
> > > > >>> managed to corrupt something in this
> > > > >>> environment?  I.e. during the initial memory
> > > > >>> exhaustion events?  And is there a robust way
> > > > >>> to have the replicate files revalidated by gluster
> > > > >>> as a stat doesn't seem to be correcting files in
> > > > >>> this state (i.e. replicate on SaturnM results in
> > > > >>> daemon crashes, replicate on SaturnI results
> > > > >>> in files being left in the bad state).
> > > > >>>
> > > > >>>
> > > > >>> Also, I'm not a member of the users list; if these
> > > > >>> questions are better posed there then let me
> > > > >>> know and I'll re-post them there.
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>> Thanks,
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>> ----- Original Message -----
> > > > >>>> From: "Ian Latter"<ian.latter@xxxxxxxxxxxxxxxx>
> > > > >>>> To:<gluster-devel@xxxxxxxxxx>
> > > > >>>> Subject:  replicate background
> threads
> > > > >>>> Date: Sun, 11 Mar 2012 20:17:15 +1000
> > > > >>>>
> > > > >>>> Hello,
> > > > >>>>
> > > > >>>>
> > > > >>>>     My mate Michael and I have been steadily
> > > > >>>> advancing our Gluster testing and today we finally
> > > > >>>> reached some heavier conditions.  The outcome
> > > > >>>> was different from expectations built from our more
> > > > >>>> basic testing so I think we have a couple of
> > > > >>>> questions regarding the AFR/Replicate background
> > > > >>>> threads that may need a developer's contribution.
> > > > >>>> Any help appreciated.
> > > > >>>>
> > > > >>>>
> > > > >>>>     The setup is a 3 box environment, but lets
start
> > > > >>>> with two;
> > > > >>>>
> > > > >>>>       SaturnM (Server)
> > > > >>>>          - 6core CPU, 16GB RAM, 1Gbps net
> > > > >>>>          - 3.2.6 Kernel (custom distro)
> > > > >>>>          - 3.2.5 Gluster (32bit)
> > > > >>>>          - 3x2TB drives, CFQ, EXT3
> > > > >>>>          - Bricked up into a single local 6TB
> > > > >>>>             "distribute" brick
> > > > >>>>          - "brick" served to the network
> > > > >>>>
> > > > >>>>       MMC (Client)
> > > > >>>>          - 4core CPU, 8GB RAM, 1Gbps net
> > > > >>>>          - Ubuntu
> > > > >>>>          - 3.2.5 Gluster (32bit)
> > > > >>>>          - Don't recall the disk space locally
> > > > >>>>          - "brick" from SaturnM mounted
> > > > >>>>
> > > > >>>>       500 x 15Gbyte files were copied from MMC
> > > > >>>> to a single sub-directory on the brick served from
> > > > >>>> SaturnM, all went well and dandy.  So then we
> > > > >>>> moved on to a 3 box environment;
> > > > >>>>
> > > > >>>>       SaturnI (Server)
> > > > >>>>          = 1core CPU, 1GB RAM, 1Gbps net
> > > > >>>>          = 3.2.6 Kernel (custom distro)
> > > > >>>>          = 3.2.5 Gluster (32bit)
> > > > >>>>          = 4x2TB drives, CFQ, EXT3
> > > > >>>>          = Bricked up into a single local 8TB
> > > > >>>>             "distribute" brick
> > > > >>>>          = "brick" served to the network
> > > > >>>>
> > > > >>>>       SaturnM (Server/Client)
> > > > >>>>          - 6core CPU, 16GB RAM, 1Gbps net
> > > > >>>>          - 3.2.6 Kernel (custom distro)
> > > > >>>>          - 3.2.5 Gluster (32bit)
> > > > >>>>          - 3x2TB drives, CFQ, EXT3
> > > > >>>>          - Bricked up into a single local 6TB
> > > > >>>>             "distribute" brick
> > > > >>>>          = Replicate brick added to sit over
> > > > >>>>             the local distribute brick and a
> > > > >>>>             client "brick" mapped from SaturnI
> > > > >>>>          - Replicate "brick" served to the network
> > > > >>>>
> > > > >>>>       MMC (Client)
> > > > >>>>          - 4core CPU, 8GB RAM, 1Gbps net
> > > > >>>>          - Ubuntu
> > > > >>>>          - 3.2.5 Gluster (32bit)
> > > > >>>>          - Don't recall the disk space locally
> > > > >>>>          - "brick" from SaturnM mounted
> > > > >>>>          = "brick" from SaturnI mounted
> > > > >>>>
> > > > >>>>
> > > > >>>>     Now, in lesser testing in this scenario all was
> > > > >>>> well - any files on SaturnI would appear on SaturnM
> > > > >>>> (not a functional part of our test) and the
> content on
> > > > >>>> SaturnM would appear on SaturnI (the real
> > > > >>>> objective).
> > > > >>>>
> > > > >>>>     Earlier testing used a handful of smaller files
> > (10s
> > > > >>>> to 100s of Mbytes) and a single 15Gbyte file.  The
> > > > >>>> 15Gbyte file would be "stat" via an "ls", which
would
> > > > >>>> kick off a background replication (ls appeared un-
> > > > >>>> blocked) and the file would be transferred.  Also,
> > > > >>>> interrupting the transfer (pulling the LAN cable)
> > > > >>>> would result in a partial 15Gbyte file being
> corrected
> > > > >>>> in a subsequent background process on another
> > > > >>>> stat.
> > > > >>>>
> > > > >>>>     *However* .. when confronted with 500 x 15Gbyte
> > > > >>>> files, in a single directory (but not the root
> > directory)
> > > > >>>> things don't quite work out as nicely.
> > > > >>>>     - First, the "ls" (at MMC against the SaturnM
> > brick)
> > > > >>>>       is blocking and hangs the terminal (ctrl-c
> > doesn't
> > > > >>>>       kill it).
> > > > >> <pranithk>  At max 16 files can be self-healed in the
> > > > > back-ground in
> > > > >> parallel. 17th file self-heal will happen in the
> > > foreground.
> > > > >>>>     - Then, looking from MMC at the SaturnI file
> > > > >>>>        system (ls -s) once per second, and then
> > > > >>>>        comparing the output (diff ls1.txt ls2.txt |
> > > > >>>>        grep -v '>') we can see that between 10
and 17
> > > > >>>>        files are being updated simultaneously
by the
> > > > >>>>        background process
> > > > >> <pranithk>  This is expected.
> > > > >>>>     - Further, a request at MMC for a single file
> that
> > > > >>>>       was originally in the 500 x 15Gbyte
sub-dir on
> > > > >>>>       SaturnM (which should return unblocked with
> > > > >>>>       correct results) will;
> > > > >>>>         a) work as expected if there are less
than 17
> > > > >>>>             active background file tasks
> > > > >>>>         b) block/hang if there are more
> > > > >>>>     - Where-as a stat (ls) outside of the 500 x 15
> > > > >>>>        sub-directory, such as the root of that
brick,
> > > > >>>>        would always work as expected (return
> > > > >>>>        immediately, unblocked).
> > > > >> <pranithk>  stat on the directory will only
create the
> > > > > files with '0'
> > > > >> file size. Then when you ls/stat the actual file the
> > > > > self-heal for the
> > > > >> file gets triggered.
> > > > >>>>
> > > > >>>>     Thus, to us, it appears as though there is a
> > > > >>>> queue feeding a set of (around) 16 worker threads
> > > > >>>> in AFR.  If your request was to the loaded
directory
> > > > >>>> then you would be blocked until a worker was
> > > > >>>> available, and if your request was to any other
> > > > >>>> location, it would return unblocked regardless of
> > > > >>>> the worker pool state.
> > > > >>>>
> > > > >>>>     The only thread metric that we could find to
> tweak
> > > > >>>> was performance/io-threads (which was set to
> > > > >>>> 16 per physical disk; well per locks + posix brick
> > > > >>>> stacks) but increasing this to 64 per stack didn't
> > > > >>>> change the outcome (10 to 17 active background
> > > > >>>> transfers).
> > > > >> <pranithk>  the option to increase the max num of
> > > > > background self-heals
> > > > >> is cluster.background-self-heal-count. Default
value of
> > > > > which is 16. I
> > > > >> assume you know what you are doing to the
> performance of
> > > > > the system by
> > > > >> increasing this number.
> > > > >>>>
> > > > >>>>     So, given the above, is our analysis sound, and
> > > > >>>> if so, is there a way to increase the size of the
> > > > >>>> pool of active worker threads?  The objective
> > > > >>>> being to allow unblocked access to an existing
> > > > >>>> repository of files (on SaturnM) while a
> > > > >>>> secondary/back-up is being filled, via GlusterFS?
> > > > >>>>
> > > > >>>>     Note that I understand that performance
> > > > >>>> (through-put) will be an issue in the described
> > > > >>>> environment: this replication process is
> > > > >>>> estimated to run for between 10 and 40 hours,
> > > > >>>> which is acceptable so long as it isn't blocking
> > > > >>>> (there's a production-capable file set in place).
> > > > >>>>
> > > > >>>>
> > > > >>>>
> > > > >>>>
> > > > >>>>
> > > > >>>> Any help appreciated.
> > > > >>>>
> > > > >> Please let us know how it goes.
> > > > >>>> Thanks,
> > > > >>>>
> > > > >>>>
> > > > >>>>
> > > > >>>>
> > > > >>>>
> > > > >>>>
> > > > >>>> --
> > > > >>>> Ian Latter
> > > > >>>> Late night coder ..
> > > > >>>> http://midnightcode.org/
> > > > >>>>
> > > > >>>> _______________________________________________
> > > > >>>> Gluster-devel mailing list
> > > > >>>> Gluster-devel@xxxxxxxxxx
> > > > >>>>
> https://lists.nongnu.org/mailman/listinfo/gluster-devel
> > > > >>>>
> > > > >>> --
> > > > >>> Ian Latter
> > > > >>> Late night coder ..
> > > > >>> http://midnightcode.org/
> > > > >>>
> > > > >>> _______________________________________________
> > > > >>> Gluster-devel mailing list
> > > > >>> Gluster-devel@xxxxxxxxxx
> > > > >>>
> https://lists.nongnu.org/mailman/listinfo/gluster-devel
> > > > >> hi Ian,
> > > > >>        inline replies with<pranithk>.
> > > > >>
> > > > >> Pranith.
> > > > >>
> > > > >
> > > > > --
> > > > > Ian Latter
> > > > > Late night coder ..
> > > > > http://midnightcode.org/
> > > > hi Ian,
> > > >       Maintaining a queue of files that need to be
> > > self-healed does not 
> > > > scale in practice, in cases where there are millions of
> > > files that need 
> > > > self-heal. So such a thing is not implemented. The
idea is
> > > to make 
> > > > self-heal foreground after a certain-limit
> > > (background-self-heal-count) 
> > > > so there is no necessity for such a queue.
> > > > 
> > > > Pranith.
> > > > 
> > > 
> > > 
> > > --
> > > Ian Latter
> > > Late night coder ..
> > > http://midnightcode.org/
> > > 
> > > _______________________________________________
> > > Gluster-devel mailing list
> > > Gluster-devel@xxxxxxxxxx
> > > https://lists.nongnu.org/mailman/listinfo/gluster-devel
> > > 
> > 
> > 
> > --
> > Ian Latter
> > Late night coder ..
> > http://midnightcode.org/
> > 
> > _______________________________________________
> > Gluster-devel mailing list
> > Gluster-devel@xxxxxxxxxx
> > https://lists.nongnu.org/mailman/listinfo/gluster-devel
> > 
> 
> 
> --
> Ian Latter
> Late night coder ..
> http://midnightcode.org/
> 
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel@xxxxxxxxxx
> https://lists.nongnu.org/mailman/listinfo/gluster-devel
> 


--
Ian Latter
Late night coder ..
http://midnightcode.org/

--- xlators/cluster/afr/src/afr.h-3.2.6	2012-04-03 18:18:57.738893869 +1000
+++ xlators/cluster/afr/src/afr.h	2012-04-03 18:20:40.631877186 +1000
@@ -74,6 +74,7 @@
 
         unsigned int background_self_heal_count;
         unsigned int background_self_heals_started;
+        gf_boolean_t background_self_heal_only;   /* on/off */
         gf_boolean_t metadata_self_heal;   /* on/off */
         gf_boolean_t entry_self_heal;      /* on/off */
 
--- xlators/cluster/afr/src/afr.c-3.2.6	2012-04-03 18:17:51.133895275 +1000
+++ xlators/cluster/afr/src/afr.c	2012-04-03 18:56:57.748850077 +1000
@@ -103,6 +103,7 @@
         gf_boolean_t metadata_change_log;   /* on/off */
         gf_boolean_t entry_change_log;      /* on/off */
         gf_boolean_t strict_readdir;
+        gf_boolean_t background_self_heal_only; /* on/off */
 
         afr_private_t * priv        = NULL;
         xlator_list_t * trav        = NULL;
@@ -112,6 +113,7 @@
         char * change_log      = NULL;
         char * str_readdir     = NULL;
         char * self_heal_algo  = NULL;
+        char * background_only = NULL;
 
         int32_t background_count  = 0;
         int32_t window_size       = 0;
@@ -134,6 +136,26 @@
                 priv->background_self_heal_count = background_count;
         }
 
+        dict_ret = dict_get_str (options, "background-self-heal-only",
+                                 &background_only);
+        if (dict_ret == 0) {
+                temp_ret = gf_string2boolean (background_only,
+				&background_self_heal_only);
+                if (temp_ret < 0) {
+                        gf_log (this->name, GF_LOG_WARNING,
+                                "Reconfiguration Invalid 'option background"
+                                "-self-heal-only %s'. Defaulting to off.",
+                                background_only);
+                        ret = -1;
+                        goto out;
+                }
+
+                priv->background_self_heal_only = background_self_heal_only;
+                gf_log (this->name, GF_LOG_DEBUG,
+                        "Reconfiguring 'option background"
+                        "-self-heal-only %s'.", background_only);
+        }
+
         dict_ret = dict_get_str (options, "metadata-self-heal",
                                  &self_heal);
         if (dict_ret == 0) {
@@ -380,6 +402,7 @@
         char * inodelk_trace   = NULL;
         char * entrylk_trace   = NULL;
         char * def_val         = NULL;
+        char * background_only = NULL;
         int32_t background_count  = 0;
         int32_t lock_server_count = 1;
         int32_t window_size       = 0;
@@ -422,6 +445,23 @@
                 priv->background_self_heal_count = background_count;
         }
 
+        priv->background_self_heal_only = 0;
+
+        dict_ret = dict_get_str (this->options, "background-self-heal-only",
+                                 &background_only);
+        if (dict_ret == 0) {
+                ret = gf_string2boolean (background_only,
+                        &priv->background_self_heal_only);
+                if (ret < 0) {
+                        gf_log (this->name, GF_LOG_WARNING,
+                                "Invalid 'option background-self-heal-only %s'"
+                                ". Defaulting to background-self-heal-only as"
+                                " 'off'.",
+                                background_only);
+                        priv->background_self_heal_only = 0;
+                }
+        }
+
         /* Default values */
 
         priv->data_self_heal     = 1;
@@ -828,6 +868,16 @@
           .type = GF_OPTION_TYPE_INT,
           .min  = 0
         },
+        { .key  = {"background-self-heal-only"},
+          .type = GF_OPTION_TYPE_BOOL,
+          .default_value = "0",
+          .description = "Action to take once background-self-heal-count has "
+                         "been reached. The default is \"off\" which blocks "
+                         "subsequent requests, by self healing in the "
+                         "foreground. Setting this to \"on\" ensures that "
+                         "subsequent requests are passed through without "
+                         "triggering the self heal process."
+        },
         { .key  = {"data-self-heal"},
           .type = GF_OPTION_TYPE_BOOL
         },
--- xlators/cluster/afr/src/afr-common.c-3.2.6	2012-04-03 18:11:58.288906265 +1000
+++ xlators/cluster/afr/src/afr-common.c	2012-04-03 19:12:52.838829695 +1000
@@ -1298,6 +1298,19 @@
                 goto out;
         }
 
+	if(priv->background_self_heal_only) {
+		LOCK (&priv->lock);
+		{
+			if (priv->background_self_heals_started
+			    >= priv->background_self_heal_count) {
+				gf_log (this->name, GF_LOG_DEBUG,
+				    "Max background self heals reached - do not attempt to detect self heal");
+		                goto out;
+			}
+		}
+		UNLOCK (&priv->lock);
+	}
+
         afr_lookup_set_self_heal_data (local, this);
         if (afr_can_self_heal_proceed (&local->self_heal, priv)) {
                 if  (afr_is_self_heal_running (local))
--- xlators/cluster/afr/src/pump.c-3.2.6	2012-04-03 18:44:14.933864242 +1000
+++ xlators/cluster/afr/src/pump.c	2012-04-03 18:44:38.479865804 +1000
@@ -2311,6 +2311,7 @@
         priv->read_child = source_child;
         priv->favorite_child = source_child;
         priv->background_self_heal_count = 0;
+        priv->background_self_heal_only = 0;
 
 	priv->data_self_heal     = 1;
 	priv->metadata_self_heal = 1;