Re: replicate background threads

Anand Avati <anand.avati@xxxxxxxxx> · Wed, 4 Apr 2012 12:43:16 -0700

Can you submit this patch to gerrit? Instructions -
http://www.gluster.org/community/documentation/index.php/Development_Work_Flow

Avati

On Tue, Apr 3, 2012 at 3:41 AM, Ian Latter <ian.latter@xxxxxxxxxxxxxxxx> wrote:

Pizza reveals all ;-)

There's an error in there with the LOCK going

without a paired UNLOCK in the afr-common

test.  Revised (untested) patch attached.

----- Original Message -----

>From: "Ian Latter" <ian.latter@xxxxxxxxxxxxxxxx>

>To: "Pranith Kumar K" <pranithk@xxxxxxxxxxx>

>Subject:  Re: replicate background threads

>Date: Tue, 03 Apr 2012 19:46:51 +1000

>

>

> FYI - untested patch attached.

>

>

>

> ----- Original Message -----

> >From: "Ian Latter" <ian.latter@xxxxxxxxxxxxxxxx>

> >To: "Pranith Kumar K" <pranithk@xxxxxxxxxxx>

> >Subject:  Re: replicate background threads

> >Date: Tue, 03 Apr 2012 18:50:11 +1000

> >

> >

> > FYI - I can see that this option doesn't exist, I'm

adding it

> > now.

> >

> >

> > ----- Original Message -----

> > >From: "Ian Latter" <ian.latter@xxxxxxxxxxxxxxxx>

> > >To: "Pranith Kumar K" <pranithk@xxxxxxxxxxx>

> > >Subject:  Re: replicate background threads

> > >Date: Mon, 02 Apr 2012 18:02:26 +1000

> > >

> > >

> > > Hello Pranith,

> > >

> > >

> > >   Michael has come back from his business trip and

> > > we're about to start testing again (though now under

> > > kernel 3.2.13 and GlusterFS 3.2.6).

> > >

> > >   I've published the 32bit (i586) client on the Saturn

> > > project site if anyone is chasing it;

> > >   http://midnightcode.org/projects/saturn/

> > >

> > >   One quick question, is there a tune-able parameter

> > > that will allow a stat to be non blocking (i.e. to stop

> > > self-heal going foreground) when the background

> > > self heal count is reached?

> > >   I.e. rather than having the stat hang for 2 days

> > > while the files are replicated, we'd rather it fell

> > > through and allowed subsequent stats to attempt

> > > background self healing (perhaps at a time when

> > > background self heal slots are available).

> > >

> > >

> > > Thanks,

> > >

> > >

> > >

> > > ----- Original Message -----

> > > >From: "Ian Latter" <ian.latter@xxxxxxxxxxxxxxxx>

> > > >To: "Pranith Kumar K" <pranithk@xxxxxxxxxxx>

> > > >Subject:  Re: replicate background

threads

> > > >Date: Wed, 14 Mar 2012 19:36:24 +1000

> > > >

> > > > Hello,

> > > >

> > > > > hi Ian,

> > > > >      Maintaining a queue of files that need to be

> > > > > self-healed does not scale in practice, in cases

> > > > > where there are millions of files that need self-

> > > > > heal. So such a thing is not implemented. The

> > > > > idea is to make self-heal foreground after a

> > > > > certain-limit (background-self-heal-count) so

> > > > > there is no necessity for such a queue.

> > > > >

> > > > > Pranith.

> > > >

> > > > Ok, I understand - it will be interesting to observe

> > > > the system with the new knowledge from your

> > > > messages - thanks for your help, appreciate it.

> > > >

> > > >

> > > > Cheers,

> > > >

> > > > ----- Original Message -----

> > > > >From: "Pranith Kumar K" <pranithk@xxxxxxxxxxx>

> > > > >To: "Ian Latter" <ian.latter@xxxxxxxxxxxxxxxx>

> > > > >Subject:  Re: replicate background

> threads

> > > > >Date: Wed, 14 Mar 2012 07:33:32 +0530

> > > > >

> > > > > On 03/14/2012 01:47 AM, Ian Latter wrote:

> > > > > > Thanks for the info Pranith;

> > > > > >

> > > > > > <pranithk>  the option to increase the max num of

> > > background

> > > > > > self-heals

> > > > > > is cluster.background-self-heal-count. Default

> value of

> > > > > > which is 16. I

> > > > > > assume you know what you are doing to the

performance

> > > of the

> > > > > > system by

> > > > > > increasing this number.

> > > > > >

> > > > > >

> > > > > > I didn't know this.  Is there a queue length for

what

> > > > > > is yet to be handled by the background self heal

> > > > > > count?  If so, can it also be adjusted?

> > > > > >

> > > > > >

> > > > > > ----- Original Message -----

> > > > > >> From: "Pranith Kumar K"<pranithk@xxxxxxxxxxx>

> > > > > >> To: "Ian Latter"<ian.latter@xxxxxxxxxxxxxxxx>

> > > > > >> Subject:  Re: replicate background

> > > threads

> > > > > >> Date: Tue, 13 Mar 2012 21:07:53 +0530

> > > > > >>

> > > > > >> On 03/13/2012 07:52 PM, Ian Latter wrote:

> > > > > >>> Hello,

> > > > > >>>

> > > > > >>>

> > > > > >>>     Well we've been privy to our first true

error in

> > > > > >>> Gluster now, and we're not sure of the cause.

> > > > > >>>

> > > > > >>>     The SaturnI machine with 1Gbyte of RAM was

> > > > > >>> exhausting its memory and crashing and we saw

> > > > > >>> core dumps on SaturnM and MMC.  Replacing

> > > > > >>> the SaturnI hardware with identical hardware to

> > > > > >>> SaturnM, but retaining SaturnI's original disks,

> > > > > >>> (so fixing the memory capacity problem) we saw

> > > > > >>> crashes randomly at all nodes.

> > > > > >>>

> > > > > >>>     Looking for irregularities at the file system

> > > > > >>> we noticed that (we'd estimate) about 60% of

> > > > > >>> the files at the OS/EXT3 layer of SaturnI

> > > > > >>> (sourced via replicate from SaturnM) were of

> > > > > >>> size 2147483648 (2^31) where they should

> > > > > >>> have been substantially larger.  While we would

> > > > > >>> happily accept "you shouldn't be using a 32bit

> > > > > >>> gluster package" as the answer, we note two

> > > > > >>> deltas;

> > > > > >>>     1) All files used in testing were copied

on from

> > > > > >>>          32 bit clients to 32 bit servers, with no

> > > > > >>>          observable errors

> > > > > >>>     2) Of the file that were replicated, not all

> were

> > > > > >>>          corrupted (capped at 2G -- note that we

> > > > > >>>          confirmed that this was the first 2G

of the

> > > > > >>>          source file contents).

> > > > > >>>

> > > > > >>>

> > > > > >>> So is there a known replicate issue with files

> > > > > >>> greater than 2GB?  Has anyone done any

> > > > > >>> serious testing with significant numbers of files

> > > > > >>> of this size?  Are there configurations specific

> > > > > >>> to files/structures of these dimensions?

> > > > > >>>

> > > > > >>> We noted that reversing the configuration, such

> > > > > >>> that SaturnI provides the replicate Brick amongst

> > > > > >>> a local distribute and a remote map to SaturnM

> > > > > >>> where SaturnM simply serves a local distribute;

> > > > > >>> that the data served to MMC is accurate (it

> > > > > >>> continues to show 15GB files, even where there

> > > > > >>> is a local 2GB copy).  Further, a client "cp" at

> > > > > >>> MMC, of a file with a 2GB local replicate of a

> > > > > >>> 15GB file, will result in a 15GB file being

> > > > > >>> created and replicated via Gluster (i.e. the

> > > > > >>> correct specification at both server nodes).

> > > > > >>>

> > > > > >>> So my other question is; Is it possible that we've

> > > > > >>> managed to corrupt something in this

> > > > > >>> environment?  I.e. during the initial memory

> > > > > >>> exhaustion events?  And is there a robust way

> > > > > >>> to have the replicate files revalidated by gluster

> > > > > >>> as a stat doesn't seem to be correcting files in

> > > > > >>> this state (i.e. replicate on SaturnM results in

> > > > > >>> daemon crashes, replicate on SaturnI results

> > > > > >>> in files being left in the bad state).

> > > > > >>>

> > > > > >>>

> > > > > >>> Also, I'm not a member of the users list; if these

> > > > > >>> questions are better posed there then let me

> > > > > >>> know and I'll re-post them there.

> > > > > >>>

> > > > > >>>

> > > > > >>>

> > > > > >>> Thanks,

> > > > > >>>

> > > > > >>>

> > > > > >>>

> > > > > >>>

> > > > > >>>

> > > > > >>> ----- Original Message -----

> > > > > >>>> From: "Ian Latter"<ian.latter@xxxxxxxxxxxxxxxx>

> > > > > >>>> To:<gluster-devel@xxxxxxxxxx>

> > > > > >>>> Subject:  replicate background

> > threads

> > > > > >>>> Date: Sun, 11 Mar 2012 20:17:15 +1000

> > > > > >>>>

> > > > > >>>> Hello,

> > > > > >>>>

> > > > > >>>>

> > > > > >>>>     My mate Michael and I have been steadily

> > > > > >>>> advancing our Gluster testing and today we

finally

> > > > > >>>> reached some heavier conditions.  The outcome

> > > > > >>>> was different from expectations built from

our more

> > > > > >>>> basic testing so I think we have a couple of

> > > > > >>>> questions regarding the AFR/Replicate background

> > > > > >>>> threads that may need a developer's contribution.

> > > > > >>>> Any help appreciated.

> > > > > >>>>

> > > > > >>>>

> > > > > >>>>     The setup is a 3 box environment, but lets

> start

> > > > > >>>> with two;

> > > > > >>>>

> > > > > >>>>       SaturnM (Server)

> > > > > >>>>          - 6core CPU, 16GB RAM, 1Gbps net

> > > > > >>>>          - 3.2.6 Kernel (custom distro)

> > > > > >>>>          - 3.2.5 Gluster (32bit)

> > > > > >>>>          - 3x2TB drives, CFQ, EXT3

> > > > > >>>>          - Bricked up into a single local 6TB

> > > > > >>>>             "distribute" brick

> > > > > >>>>          - "brick" served to the network

> > > > > >>>>

> > > > > >>>>       MMC (Client)

> > > > > >>>>          - 4core CPU, 8GB RAM, 1Gbps net

> > > > > >>>>          - Ubuntu

> > > > > >>>>          - 3.2.5 Gluster (32bit)

> > > > > >>>>          - Don't recall the disk space locally

> > > > > >>>>          - "brick" from SaturnM mounted

> > > > > >>>>

> > > > > >>>>       500 x 15Gbyte files were copied from MMC

> > > > > >>>> to a single sub-directory on the brick served

from

> > > > > >>>> SaturnM, all went well and dandy.  So then we

> > > > > >>>> moved on to a 3 box environment;

> > > > > >>>>

> > > > > >>>>       SaturnI (Server)

> > > > > >>>>          = 1core CPU, 1GB RAM, 1Gbps net

> > > > > >>>>          = 3.2.6 Kernel (custom distro)

> > > > > >>>>          = 3.2.5 Gluster (32bit)

> > > > > >>>>          = 4x2TB drives, CFQ, EXT3

> > > > > >>>>          = Bricked up into a single local 8TB

> > > > > >>>>             "distribute" brick

> > > > > >>>>          = "brick" served to the network

> > > > > >>>>

> > > > > >>>>       SaturnM (Server/Client)

> > > > > >>>>          - 6core CPU, 16GB RAM, 1Gbps net

> > > > > >>>>          - 3.2.6 Kernel (custom distro)

> > > > > >>>>          - 3.2.5 Gluster (32bit)

> > > > > >>>>          - 3x2TB drives, CFQ, EXT3

> > > > > >>>>          - Bricked up into a single local 6TB

> > > > > >>>>             "distribute" brick

> > > > > >>>>          = Replicate brick added to sit over

> > > > > >>>>             the local distribute brick and a

> > > > > >>>>             client "brick" mapped from SaturnI

> > > > > >>>>          - Replicate "brick" served to the

network

> > > > > >>>>

> > > > > >>>>       MMC (Client)

> > > > > >>>>          - 4core CPU, 8GB RAM, 1Gbps net

> > > > > >>>>          - Ubuntu

> > > > > >>>>          - 3.2.5 Gluster (32bit)

> > > > > >>>>          - Don't recall the disk space locally

> > > > > >>>>          - "brick" from SaturnM mounted

> > > > > >>>>          = "brick" from SaturnI mounted

> > > > > >>>>

> > > > > >>>>

> > > > > >>>>     Now, in lesser testing in this scenario

all was

> > > > > >>>> well - any files on SaturnI would appear on

SaturnM

> > > > > >>>> (not a functional part of our test) and the

> > content on

> > > > > >>>> SaturnM would appear on SaturnI (the real

> > > > > >>>> objective).

> > > > > >>>>

> > > > > >>>>     Earlier testing used a handful of smaller

files

> > > (10s

> > > > > >>>> to 100s of Mbytes) and a single 15Gbyte file.

 The

> > > > > >>>> 15Gbyte file would be "stat" via an "ls", which

> would

> > > > > >>>> kick off a background replication (ls

appeared un-

> > > > > >>>> blocked) and the file would be transferred.

Also,

> > > > > >>>> interrupting the transfer (pulling the LAN cable)

> > > > > >>>> would result in a partial 15Gbyte file being

> > corrected

> > > > > >>>> in a subsequent background process on another

> > > > > >>>> stat.

> > > > > >>>>

> > > > > >>>>     *However* .. when confronted with 500 x

15Gbyte

> > > > > >>>> files, in a single directory (but not the root

> > > directory)

> > > > > >>>> things don't quite work out as nicely.

> > > > > >>>>     - First, the "ls" (at MMC against the SaturnM

> > > brick)

> > > > > >>>>       is blocking and hangs the terminal (ctrl-c

> > > doesn't

> > > > > >>>>       kill it).

> > > > > >> <pranithk>  At max 16 files can be self-healed

in the

> > > > > > back-ground in

> > > > > >> parallel. 17th file self-heal will happen in the

> > > > foreground.

> > > > > >>>>     - Then, looking from MMC at the SaturnI file

> > > > > >>>>        system (ls -s) once per second, and then

> > > > > >>>>        comparing the output (diff ls1.txt

ls2.txt |

> > > > > >>>>        grep -v '>') we can see that between 10

> and 17

> > > > > >>>>        files are being updated simultaneously

> by the

> > > > > >>>>        background process

> > > > > >> <pranithk>  This is expected.

> > > > > >>>>     - Further, a request at MMC for a single file

> > that

> > > > > >>>>       was originally in the 500 x 15Gbyte

> sub-dir on

> > > > > >>>>       SaturnM (which should return unblocked with

> > > > > >>>>       correct results) will;

> > > > > >>>>         a) work as expected if there are less

> than 17

> > > > > >>>>             active background file tasks

> > > > > >>>>         b) block/hang if there are more

> > > > > >>>>     - Where-as a stat (ls) outside of the 500

x 15

> > > > > >>>>        sub-directory, such as the root of that

> brick,

> > > > > >>>>        would always work as expected (return

> > > > > >>>>        immediately, unblocked).

> > > > > >> <pranithk>  stat on the directory will only

> create the

> > > > > > files with '0'

> > > > > >> file size. Then when you ls/stat the actual

file the

> > > > > > self-heal for the

> > > > > >> file gets triggered.

> > > > > >>>>

> > > > > >>>>     Thus, to us, it appears as though there is a

> > > > > >>>> queue feeding a set of (around) 16 worker threads

> > > > > >>>> in AFR.  If your request was to the loaded

> directory

> > > > > >>>> then you would be blocked until a worker was

> > > > > >>>> available, and if your request was to any other

> > > > > >>>> location, it would return unblocked regardless of

> > > > > >>>> the worker pool state.

> > > > > >>>>

> > > > > >>>>     The only thread metric that we could find to

> > tweak

> > > > > >>>> was performance/io-threads (which was set to

> > > > > >>>> 16 per physical disk; well per locks + posix

brick

> > > > > >>>> stacks) but increasing this to 64 per stack

didn't

> > > > > >>>> change the outcome (10 to 17 active background

> > > > > >>>> transfers).

> > > > > >> <pranithk>  the option to increase the max num of

> > > > > > background self-heals

> > > > > >> is cluster.background-self-heal-count. Default

> value of

> > > > > > which is 16. I

> > > > > >> assume you know what you are doing to the

> > performance of

> > > > > > the system by

> > > > > >> increasing this number.

> > > > > >>>>

> > > > > >>>>     So, given the above, is our analysis

sound, and

> > > > > >>>> if so, is there a way to increase the size of the

> > > > > >>>> pool of active worker threads?  The objective

> > > > > >>>> being to allow unblocked access to an existing

> > > > > >>>> repository of files (on SaturnM) while a

> > > > > >>>> secondary/back-up is being filled, via GlusterFS?

> > > > > >>>>

> > > > > >>>>     Note that I understand that performance

> > > > > >>>> (through-put) will be an issue in the described

> > > > > >>>> environment: this replication process is

> > > > > >>>> estimated to run for between 10 and 40 hours,

> > > > > >>>> which is acceptable so long as it isn't blocking

> > > > > >>>> (there's a production-capable file set in place).

> > > > > >>>>

> > > > > >>>>

> > > > > >>>>

> > > > > >>>>

> > > > > >>>>

> > > > > >>>> Any help appreciated.

> > > > > >>>>

> > > > > >> Please let us know how it goes.

> > > > > >>>> Thanks,

> > > > > >>>>

> > > > > >>>>

> > > > > >>>>

> > > > > >>>>

> > > > > >>>>

> > > > > >>>>

> > > > > >>>> --

> > > > > >>>> Ian Latter

> > > > > >>>> Late night coder ..

> > > > > >>>> http://midnightcode.org/

> > > > > >>>>

> > > > > >>>> _______________________________________________

> > > > > >>>> Gluster-devel mailing list

> > > > > >>>> Gluster-devel@xxxxxxxxxx

> > > > > >>>>

> > https://lists.nongnu.org/mailman/listinfo/gluster-devel

> > > > > >>>>

> > > > > >>> --

> > > > > >>> Ian Latter

> > > > > >>> Late night coder ..

> > > > > >>> http://midnightcode.org/

> > > > > >>>

> > > > > >>> _______________________________________________

> > > > > >>> Gluster-devel mailing list

> > > > > >>> Gluster-devel@xxxxxxxxxx

> > > > > >>>

> > https://lists.nongnu.org/mailman/listinfo/gluster-devel

> > > > > >> hi Ian,

> > > > > >>        inline replies with<pranithk>.

> > > > > >>

> > > > > >> Pranith.

> > > > > >>

> > > > > >

> > > > > > --

> > > > > > Ian Latter

> > > > > > Late night coder ..

> > > > > > http://midnightcode.org/

> > > > > hi Ian,

> > > > >       Maintaining a queue of files that need to be

> > > > self-healed does not

> > > > > scale in practice, in cases where there are

millions of

> > > > files that need

> > > > > self-heal. So such a thing is not implemented. The

> idea is

> > > > to make

> > > > > self-heal foreground after a certain-limit

> > > > (background-self-heal-count)

> > > > > so there is no necessity for such a queue.

> > > > >

> > > > > Pranith.

> > > > >

> > > >

> > > >

> > > > --

> > > > Ian Latter

> > > > Late night coder ..

> > > > http://midnightcode.org/

> > > >

> > > > _______________________________________________

> > > > Gluster-devel mailing list

> > > > Gluster-devel@xxxxxxxxxx

> > > > https://lists.nongnu.org/mailman/listinfo/gluster-devel

> > > >

> > >

> > >

> > > --

> > > Ian Latter

> > > Late night coder ..

> > > http://midnightcode.org/

> > >

> > > _______________________________________________

> > > Gluster-devel mailing list

> > > Gluster-devel@xxxxxxxxxx

> > > https://lists.nongnu.org/mailman/listinfo/gluster-devel

> > >

> >

> >

> > --

> > Ian Latter

> > Late night coder ..

> > http://midnightcode.org/

> >

> > _______________________________________________

> > Gluster-devel mailing list

> > Gluster-devel@xxxxxxxxxx

> > https://lists.nongnu.org/mailman/listinfo/gluster-devel

> >

>

>

> --

> Ian Latter

> Late night coder ..

> http://midnightcode.org/

> _______________________________________________

> Gluster-devel mailing list

> Gluster-devel@xxxxxxxxxx

> https://lists.nongnu.org/mailman/listinfo/gluster-devel

>

--

Ian Latter

Late night coder ..

http://midnightcode.org/

_______________________________________________

Gluster-devel mailing list

Gluster-devel@xxxxxxxxxx

https://lists.nongnu.org/mailman/listinfo/gluster-devel