Re: disperse volume file to subvolume mapping

Serkan Çoban <cobanserkan@xxxxxxxxx> · Fri, 22 Apr 2016 09:24:17 +0300

Not only skipped column but all columns are 0 in rebalance status
command. It seems rebalance does not to anything. All '---------T'
files are there. Anyway we wrote our custom mapreduce tool and it is
copying files right now to gluster and it is utilizing all 60 nodes as
expected. I will delete distcp folder and continue if you don't need
any further log/debug files to examine the issue.

Thanks for help,
Serkan

On Fri, Apr 22, 2016 at 9:15 AM, Xavier Hernandez <xhernandez@xxxxxxxxxx> wrote:
> When you execute a rebalance 'force' the skipped column should be 0 for all
> nodes and all '---------T' files must have disappeared. Otherwise something
> failed. Is this true in your case ?
>
>
> On 21/04/16 15:19, Serkan Çoban wrote:
>>
>> Same result. Also checked the rebalance.log file, it has also no
>> reference to part files...
>>
>> On Thu, Apr 21, 2016 at 3:34 PM, Xavier Hernandez <xhernandez@xxxxxxxxxx>
>> wrote:
>>>
>>> Can you try a 'gluster volume rebalance v0 start force' ?
>>>
>>>
>>> On 21/04/16 14:23, Serkan Çoban wrote:
>>>>>
>>>>>
>>>>> Has the rebalance operation finished successfully ? has it skipped any
>>>>> files ?
>>>>
>>>>
>>>> Yes according to gluster v rebalance status it is completed without any
>>>> errors.
>>>> rebalance status report is like:
>>>> Node         Rebalanced files   size               Scanned
>>>> failures  skipped
>>>> 1.1.1.185   158                      29GB             1720
>>>> 0           314
>>>> 1.1.1.205    93                       46.5GB           761
>>>> 0           95
>>>> 1.1.1.225    74                       37GB              779
>>>>    0           94
>>>>
>>>>
>>>> All other hosts has 0 values.
>>>>
>>>> I double check that files with '---------T' attributes are there,
>>>> maybe some of them deleted but I still see them in bricks...
>>>> I am also concerned why part files not distributed to all 60 nodes?
>>>> Rebalance should do that?
>>>>
>>>> On Thu, Apr 21, 2016 at 1:55 PM, Xavier Hernandez
>>>> <xhernandez@xxxxxxxxxx>
>>>> wrote:
>>>>>
>>>>>
>>>>> Hi Serkan,
>>>>>
>>>>> On 21/04/16 12:39, Serkan Çoban wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> I started a gluster v rebalance v0 start command hoping that it will
>>>>>> equally redistribute files across 60 nodes but it did not do that...
>>>>>> why it did not redistribute files? any thoughts?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Has the rebalance operation finished successfully ? has it skipped any
>>>>> files
>>>>> ?
>>>>>
>>>>> After a successful rebalance all files with attributes '---------T'
>>>>> should
>>>>> have disappeared.
>>>>>
>>>>>
>>>>>>
>>>>>> On Thu, Apr 21, 2016 at 11:24 AM, Xavier Hernandez
>>>>>> <xhernandez@xxxxxxxxxx> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Hi Serkan,
>>>>>>>
>>>>>>> On 21/04/16 10:07, Serkan Çoban wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I think the problem is in the temporary name that distcp gives to
>>>>>>>>> the
>>>>>>>>> file while it's being copied before renaming it to the real name.
>>>>>>>>> Do
>>>>>>>>> you
>>>>>>>>> know what is the structure of this name ?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Distcp temporary file name format is:
>>>>>>>> ".distcp.tmp.attempt_1460381790773_0248_m_000001_0" and the same
>>>>>>>> temporary file name used by one map process. For example I see in
>>>>>>>> the
>>>>>>>> logs that one map copies files
>>>>>>>> part-m-00031,part-m-00047,part-m-00063
>>>>>>>> sequentially and they all use same temporary file name above. So no
>>>>>>>> original file name appears in temporary file name.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> This explains the problem. With the default options, DHT sends all
>>>>>>> files
>>>>>>> to
>>>>>>> the subvolume that should store a file named 'distcp.tmp'.
>>>>>>>
>>>>>>> With this temporary name format, little can be done.
>>>>>>>
>>>>>>>>
>>>>>>>> I will check if we can modify distcp behaviour, or we have to write
>>>>>>>> our mapreduce procedures instead of using distcp.
>>>>>>>>
>>>>>>>>> 2. define the option 'extra-hash-regex' to an expression that
>>>>>>>>> matches
>>>>>>>>> your temporary file names and returns the same name that will
>>>>>>>>> finally
>>>>>>>>> have.
>>>>>>>>> Depending on the differences between original and temporary file
>>>>>>>>> names,
>>>>>>>>> this
>>>>>>>>> option could be useless.
>>>>>>>>> 3. set the option 'rsync-hash-regex' to 'none'. This will prevent
>>>>>>>>> the
>>>>>>>>> name conversion, so the files will be evenly distributed. However
>>>>>>>>> this
>>>>>>>>> will
>>>>>>>>> cause a lot of files placed in incorrect subvolumes, creating a lot
>>>>>>>>> of
>>>>>>>>> link
>>>>>>>>> files until a rebalance is executed.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> How can I set these options?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> You can set gluster options using:
>>>>>>>
>>>>>>> gluster volume set <volname> <option> <value>
>>>>>>>
>>>>>>> for example:
>>>>>>>
>>>>>>> gluster volume set v0 rsync-hash-regex none
>>>>>>>
>>>>>>> Xavi
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Apr 21, 2016 at 10:00 AM, Xavier Hernandez
>>>>>>>> <xhernandez@xxxxxxxxxx> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi Serkan,
>>>>>>>>>
>>>>>>>>> I think the problem is in the temporary name that distcp gives to
>>>>>>>>> the
>>>>>>>>> file
>>>>>>>>> while it's being copied before renaming it to the real name. Do you
>>>>>>>>> know
>>>>>>>>> what is the structure of this name ?
>>>>>>>>>
>>>>>>>>> DHT selects the subvolume (in this case the ec set) on which the
>>>>>>>>> file
>>>>>>>>> will
>>>>>>>>> be stored based on the name of the file. This has a problem when a
>>>>>>>>> file
>>>>>>>>> is
>>>>>>>>> being renamed, because this could change the subvolume where the
>>>>>>>>> file
>>>>>>>>> should
>>>>>>>>> be found.
>>>>>>>>>
>>>>>>>>> DHT has a feature to avoid incorrect file placements when executing
>>>>>>>>> renames
>>>>>>>>> for the rsync case. What it does is to check if the file matches
>>>>>>>>> the
>>>>>>>>> following regular expression:
>>>>>>>>>
>>>>>>>>>         ^\.(.+)\.[^.]+$
>>>>>>>>>
>>>>>>>>> If a match is found, it only considers the part between parenthesis
>>>>>>>>> to
>>>>>>>>> calculate the destination subvolume.
>>>>>>>>>
>>>>>>>>> This is useful for rsync because temporary file names are
>>>>>>>>> constructed
>>>>>>>>> in
>>>>>>>>> the
>>>>>>>>> following way: suppose the original filename is 'test'. The
>>>>>>>>> temporary
>>>>>>>>> filename while rsync is being executed is made by prepending a dot
>>>>>>>>> and
>>>>>>>>> appending '.<random chars>': .test.712hd
>>>>>>>>>
>>>>>>>>> As you can see, the original name and the part of the name between
>>>>>>>>> parenthesis that matches the regular expression are the same. This
>>>>>>>>> causes
>>>>>>>>> that, after renaming the temporary file to its original filename,
>>>>>>>>> both
>>>>>>>>> files
>>>>>>>>> will be considered to belong to the same subvolume by DHT.
>>>>>>>>>
>>>>>>>>> In your case it's very probable that distcp uses a temporary name
>>>>>>>>> like
>>>>>>>>> '.part.<number>'. In this case the portion of the name used to
>>>>>>>>> select
>>>>>>>>> the
>>>>>>>>> subvolume is always 'part'. This would explain why all files go to
>>>>>>>>> the
>>>>>>>>> same
>>>>>>>>> subvolume. Once the file is renamed to another name, DHT realizes
>>>>>>>>> that
>>>>>>>>> it
>>>>>>>>> should go to another subvolume. At this point it creates a link
>>>>>>>>> file
>>>>>>>>> (those
>>>>>>>>> files with access rights = '---------T') in the correct subvolume
>>>>>>>>> but
>>>>>>>>> it
>>>>>>>>> doesn't move it. As you can see, this kind of files are better
>>>>>>>>> balanced.
>>>>>>>>>
>>>>>>>>> To solve this problem you have three options:
>>>>>>>>>
>>>>>>>>> 1. change the temporary filename used by distcp to correctly match
>>>>>>>>> the
>>>>>>>>> regular expression. I'm not sure if this can be configured, but if
>>>>>>>>> this
>>>>>>>>> is
>>>>>>>>> possible, this is the best option.
>>>>>>>>>
>>>>>>>>> 2. define the option 'extra-hash-regex' to an expression that
>>>>>>>>> matches
>>>>>>>>> your
>>>>>>>>> temporary file names and returns the same name that will finally
>>>>>>>>> have.
>>>>>>>>> Depending on the differences between original and temporary file
>>>>>>>>> names,
>>>>>>>>> this
>>>>>>>>> option could be useless.
>>>>>>>>>
>>>>>>>>> 3. set the option 'rsync-hash-regex' to 'none'. This will prevent
>>>>>>>>> the
>>>>>>>>> name
>>>>>>>>> conversion, so the files will be evenly distributed. However this
>>>>>>>>> will
>>>>>>>>> cause
>>>>>>>>> a lot of files placed in incorrect subvolumes, creating a lot of
>>>>>>>>> link
>>>>>>>>> files
>>>>>>>>> until a rebalance is executed.
>>>>>>>>>
>>>>>>>>> Xavi
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 20/04/16 14:13, Serkan Çoban wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Here is the steps that I do in detail and relevant output from
>>>>>>>>>> bricks:
>>>>>>>>>>
>>>>>>>>>> I am using below command for volume creation:
>>>>>>>>>> gluster volume create v0 disperse 20 redundancy 4 \
>>>>>>>>>> 1.1.1.{185..204}:/bricks/02 \
>>>>>>>>>> 1.1.1.{205..224}:/bricks/02 \
>>>>>>>>>> 1.1.1.{225..244}:/bricks/02 \
>>>>>>>>>> 1.1.1.{185..204}:/bricks/03 \
>>>>>>>>>> 1.1.1.{205..224}:/bricks/03 \
>>>>>>>>>> 1.1.1.{225..244}:/bricks/03 \
>>>>>>>>>> 1.1.1.{185..204}:/bricks/04 \
>>>>>>>>>> 1.1.1.{205..224}:/bricks/04 \
>>>>>>>>>> 1.1.1.{225..244}:/bricks/04 \
>>>>>>>>>> 1.1.1.{185..204}:/bricks/05 \
>>>>>>>>>> 1.1.1.{205..224}:/bricks/05 \
>>>>>>>>>> 1.1.1.{225..244}:/bricks/05 \
>>>>>>>>>> 1.1.1.{185..204}:/bricks/06 \
>>>>>>>>>> 1.1.1.{205..224}:/bricks/06 \
>>>>>>>>>> 1.1.1.{225..244}:/bricks/06 \
>>>>>>>>>> 1.1.1.{185..204}:/bricks/07 \
>>>>>>>>>> 1.1.1.{205..224}:/bricks/07 \
>>>>>>>>>> 1.1.1.{225..244}:/bricks/07 \
>>>>>>>>>> 1.1.1.{185..204}:/bricks/08 \
>>>>>>>>>> 1.1.1.{205..224}:/bricks/08 \
>>>>>>>>>> 1.1.1.{225..244}:/bricks/08 \
>>>>>>>>>> 1.1.1.{185..204}:/bricks/09 \
>>>>>>>>>> 1.1.1.{205..224}:/bricks/09 \
>>>>>>>>>> 1.1.1.{225..244}:/bricks/09 \
>>>>>>>>>> 1.1.1.{185..204}:/bricks/10 \
>>>>>>>>>> 1.1.1.{205..224}:/bricks/10 \
>>>>>>>>>> 1.1.1.{225..244}:/bricks/10 \
>>>>>>>>>> 1.1.1.{185..204}:/bricks/11 \
>>>>>>>>>> 1.1.1.{205..224}:/bricks/11 \
>>>>>>>>>> 1.1.1.{225..244}:/bricks/11 \
>>>>>>>>>> 1.1.1.{185..204}:/bricks/12 \
>>>>>>>>>> 1.1.1.{205..224}:/bricks/12 \
>>>>>>>>>> 1.1.1.{225..244}:/bricks/12 \
>>>>>>>>>> 1.1.1.{185..204}:/bricks/13 \
>>>>>>>>>> 1.1.1.{205..224}:/bricks/13 \
>>>>>>>>>> 1.1.1.{225..244}:/bricks/13 \
>>>>>>>>>> 1.1.1.{185..204}:/bricks/14 \
>>>>>>>>>> 1.1.1.{205..224}:/bricks/14 \
>>>>>>>>>> 1.1.1.{225..244}:/bricks/14 \
>>>>>>>>>> 1.1.1.{185..204}:/bricks/15 \
>>>>>>>>>> 1.1.1.{205..224}:/bricks/15 \
>>>>>>>>>> 1.1.1.{225..244}:/bricks/15 \
>>>>>>>>>> 1.1.1.{185..204}:/bricks/16 \
>>>>>>>>>> 1.1.1.{205..224}:/bricks/16 \
>>>>>>>>>> 1.1.1.{225..244}:/bricks/16 \
>>>>>>>>>> 1.1.1.{185..204}:/bricks/17 \
>>>>>>>>>> 1.1.1.{205..224}:/bricks/17 \
>>>>>>>>>> 1.1.1.{225..244}:/bricks/17 \
>>>>>>>>>> 1.1.1.{185..204}:/bricks/18 \
>>>>>>>>>> 1.1.1.{205..224}:/bricks/18 \
>>>>>>>>>> 1.1.1.{225..244}:/bricks/18 \
>>>>>>>>>> 1.1.1.{185..204}:/bricks/19 \
>>>>>>>>>> 1.1.1.{205..224}:/bricks/19 \
>>>>>>>>>> 1.1.1.{225..244}:/bricks/19 \
>>>>>>>>>> 1.1.1.{185..204}:/bricks/20 \
>>>>>>>>>> 1.1.1.{205..224}:/bricks/20 \
>>>>>>>>>> 1.1.1.{225..244}:/bricks/20 \
>>>>>>>>>> 1.1.1.{185..204}:/bricks/21 \
>>>>>>>>>> 1.1.1.{205..224}:/bricks/21 \
>>>>>>>>>> 1.1.1.{225..244}:/bricks/21 \
>>>>>>>>>> 1.1.1.{185..204}:/bricks/22 \
>>>>>>>>>> 1.1.1.{205..224}:/bricks/22 \
>>>>>>>>>> 1.1.1.{225..244}:/bricks/22 \
>>>>>>>>>> 1.1.1.{185..204}:/bricks/23 \
>>>>>>>>>> 1.1.1.{205..224}:/bricks/23 \
>>>>>>>>>> 1.1.1.{225..244}:/bricks/23 \
>>>>>>>>>> 1.1.1.{185..204}:/bricks/24 \
>>>>>>>>>> 1.1.1.{205..224}:/bricks/24 \
>>>>>>>>>> 1.1.1.{225..244}:/bricks/24 \
>>>>>>>>>> 1.1.1.{185..204}:/bricks/25 \
>>>>>>>>>> 1.1.1.{205..224}:/bricks/25 \
>>>>>>>>>> 1.1.1.{225..244}:/bricks/25 \
>>>>>>>>>> 1.1.1.{185..204}:/bricks/26 \
>>>>>>>>>> 1.1.1.{205..224}:/bricks/26 \
>>>>>>>>>> 1.1.1.{225..244}:/bricks/26 \
>>>>>>>>>> 1.1.1.{185..204}:/bricks/27 \
>>>>>>>>>> 1.1.1.{205..224}:/bricks/27 \
>>>>>>>>>> 1.1.1.{225..244}:/bricks/27 force
>>>>>>>>>>
>>>>>>>>>> then I mount volume on 50 clients:
>>>>>>>>>> mount -t glusterfs 1.1.1.185:/v0 /mnt/gluster
>>>>>>>>>>
>>>>>>>>>> then I make a directory from one of the clients and chmod it.
>>>>>>>>>> mkdir /mnt/gluster/s1 && chmod 777 /mnt/gluster/s1
>>>>>>>>>>
>>>>>>>>>> then I start distcp on clients, there are 1059X8.8GB files in one
>>>>>>>>>> folder
>>>>>>>>>> and
>>>>>>>>>> they will be copied to /mnt/gluster/s1 with 100 parallel which
>>>>>>>>>> means
>>>>>>>>>> 2
>>>>>>>>>> copy jobs per client at same time.
>>>>>>>>>> hadoop distcp -m 100 http://nn1:8020/path/to/teragen-10tb
>>>>>>>>>> file:///mnt/gluster/s1
>>>>>>>>>>
>>>>>>>>>> After job finished here is the status of s1 directory from bricks:
>>>>>>>>>> s1 directory is present in all 1560 brick.
>>>>>>>>>> s1/teragen-10tb folder is present in all 1560 brick.
>>>>>>>>>>
>>>>>>>>>> full listing of files in bricks:
>>>>>>>>>> https://www.dropbox.com/s/rbgdxmrtwz8oya8/teragen_list.zip?dl=0
>>>>>>>>>>
>>>>>>>>>> You can ignore the .crc files in the brick output above, they are
>>>>>>>>>> checksum files...
>>>>>>>>>>
>>>>>>>>>> As you can see part-m-xxxx files written only some bricks in nodes
>>>>>>>>>> 0205..0224
>>>>>>>>>> All bricks have some files but they have zero size.
>>>>>>>>>>
>>>>>>>>>> I increase file descriptors to 65k so it is not the issue...
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Apr 20, 2016 at 9:34 AM, Xavier Hernandez
>>>>>>>>>> <xhernandez@xxxxxxxxxx>
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Hi Serkan,
>>>>>>>>>>>
>>>>>>>>>>> On 19/04/16 15:16, Serkan Çoban wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I assume that gluster is used to store the intermediate files
>>>>>>>>>>>>>>> before
>>>>>>>>>>>>>>> the reduce phase
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Nope, gluster is the destination for distcp command. hadoop
>>>>>>>>>>>> distcp
>>>>>>>>>>>> -m
>>>>>>>>>>>> 50 http://nn1:8020/path/to/folder file:///mnt/gluster
>>>>>>>>>>>> This run maps on datanodes which have /mnt/gluster mounted on
>>>>>>>>>>>> all
>>>>>>>>>>>> of
>>>>>>>>>>>> them.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I don't know hadoop, so I'm of little help here. However it seems
>>>>>>>>>>> that
>>>>>>>>>>> -m
>>>>>>>>>>> 50
>>>>>>>>>>> means to execute 50 copies in parallel. This means that even if
>>>>>>>>>>> the
>>>>>>>>>>> distribution worked fine, at most 50 (much probably less) of the
>>>>>>>>>>> 78
>>>>>>>>>>> ec
>>>>>>>>>>> sets
>>>>>>>>>>> would be used in parallel.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>>> This means that this is caused by some peculiarity of the
>>>>>>>>>>>>>>> mapreduce.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Yes but how a client write 500 files to gluster mount and those
>>>>>>>>>>>> file
>>>>>>>>>>>> just written only to subset of subvolumes? I cannot use gluster
>>>>>>>>>>>> as
>>>>>>>>>>>> a
>>>>>>>>>>>> backup cluster if I cannot write with distcp.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> All 500 files were created only on one of the 78 ec sets and the
>>>>>>>>>>> remaining
>>>>>>>>>>> 77 got empty ?
>>>>>>>>>>>
>>>>>>>>>>>>>>> You should look which files are created in each brick and how
>>>>>>>>>>>>>>> many
>>>>>>>>>>>>>>> while the process is running.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Files only created on nodes 185..204 or 205..224 or 225..244.
>>>>>>>>>>>> Only
>>>>>>>>>>>> on
>>>>>>>>>>>> 20 nodes in each test.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> How many files there were in each brick ?
>>>>>>>>>>>
>>>>>>>>>>> Not sure if this can be related, but standard linux distributions
>>>>>>>>>>> have
>>>>>>>>>>> a
>>>>>>>>>>> default limit of 1024 open file descriptors. Having a so big
>>>>>>>>>>> volume
>>>>>>>>>>> and
>>>>>>>>>>> doing a massive copy, maybe this limit is affecting something ?
>>>>>>>>>>>
>>>>>>>>>>> Are there any error or warning messages in the mount or bricks
>>>>>>>>>>> logs
>>>>>>>>>>> ?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Xavi
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Apr 19, 2016 at 1:05 PM, Xavier Hernandez
>>>>>>>>>>>> <xhernandez@xxxxxxxxxx>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Serkan,
>>>>>>>>>>>>>
>>>>>>>>>>>>> moved to gluster-users since this doesn't belong to devel list.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 19/04/16 11:24, Serkan Çoban wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I am copying 10.000 files to gluster volume using mapreduce on
>>>>>>>>>>>>>> clients. Each map process took one file at a time and copy it
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>> gluster volume.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I assume that gluster is used to store the intermediate files
>>>>>>>>>>>>> before
>>>>>>>>>>>>> the
>>>>>>>>>>>>> reduce phase.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> My disperse volume consist of 78 subvolumes of 16+4 disk each.
>>>>>>>>>>>>>> So
>>>>>>>>>>>>>> If
>>>>>>>>>>>>>> I
>>>>>>>>>>>>>> copy >78 files parallel I expect each file goes to different
>>>>>>>>>>>>>> subvolume
>>>>>>>>>>>>>> right?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> If you only copy 78 files, most probably you will get some
>>>>>>>>>>>>> subvolume
>>>>>>>>>>>>> empty
>>>>>>>>>>>>> and some other with more than one or two files. It's not an
>>>>>>>>>>>>> exact
>>>>>>>>>>>>> distribution, it's a statistially balanced distribution: over
>>>>>>>>>>>>> time
>>>>>>>>>>>>> and
>>>>>>>>>>>>> with
>>>>>>>>>>>>> enough files, each brick will contain an amount of files in the
>>>>>>>>>>>>> same
>>>>>>>>>>>>> order
>>>>>>>>>>>>> of magnitude, but they won't have the *same* number of files.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> In my tests during tests with fio I can see every file goes to
>>>>>>>>>>>>>> different subvolume, but when I start mapreduce process from
>>>>>>>>>>>>>> clients
>>>>>>>>>>>>>> only 78/3=26 subvolumes used for writing files.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> This means that this is caused by some peculiarity of the
>>>>>>>>>>>>> mapreduce.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I see that clearly from network traffic. Mapreduce on client
>>>>>>>>>>>>>> side
>>>>>>>>>>>>>> can
>>>>>>>>>>>>>> be run multi thread. I tested with 1-5-10 threads on each
>>>>>>>>>>>>>> client
>>>>>>>>>>>>>> but
>>>>>>>>>>>>>> every time only 26 subvolumes used.
>>>>>>>>>>>>>> How can I debug the issue further?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> You should look which files are created in each brick and how
>>>>>>>>>>>>> many
>>>>>>>>>>>>> while
>>>>>>>>>>>>> the
>>>>>>>>>>>>> process is running.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Xavi
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Apr 19, 2016 at 11:22 AM, Xavier Hernandez
>>>>>>>>>>>>>> <xhernandez@xxxxxxxxxx> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Serkan,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 19/04/16 09:18, Serkan Çoban wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi, I just reinstalled fresh 3.7.11 and I am seeing the same
>>>>>>>>>>>>>>>> behavior.
>>>>>>>>>>>>>>>> 50 clients copying part-0-xxxx named files using mapreduce
>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>> gluster
>>>>>>>>>>>>>>>> using one thread per server and they are using only 20
>>>>>>>>>>>>>>>> servers
>>>>>>>>>>>>>>>> out
>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>> 60. On the other hand fio tests use all the servers.
>>>>>>>>>>>>>>>> Anything
>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>> do
>>>>>>>>>>>>>>>> to solve the issue?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Distribution of files to ec sets is done by dht. In theory if
>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>> create
>>>>>>>>>>>>>>> many files each ec set will receive the same amount of files.
>>>>>>>>>>>>>>> However
>>>>>>>>>>>>>>> when
>>>>>>>>>>>>>>> the number of files is small enough, statistics can fail.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Not sure what you are doing exactly, but a mapreduce
>>>>>>>>>>>>>>> procedure
>>>>>>>>>>>>>>> generally
>>>>>>>>>>>>>>> only creates a single output. In that case it makes sense
>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>> ec
>>>>>>>>>>>>>>> set is used. If you want to use all ec sets for a single
>>>>>>>>>>>>>>> file,
>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>> enable sharding (I haven't tested that) or split the result
>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>> multiple
>>>>>>>>>>>>>>> files.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Xavi
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> Serkan
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> ---------- Forwarded message ----------
>>>>>>>>>>>>>>>> From: Serkan Çoban <cobanserkan@xxxxxxxxx>
>>>>>>>>>>>>>>>> Date: Mon, Apr 18, 2016 at 2:39 PM
>>>>>>>>>>>>>>>> Subject: disperse volume file to subvolume mapping
>>>>>>>>>>>>>>>> To: Gluster Users <gluster-users@xxxxxxxxxxx>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi, I have a problem where clients are using only 1/3 of
>>>>>>>>>>>>>>>> nodes
>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>> disperse volume for writing.
>>>>>>>>>>>>>>>> I am testing from 50 clients using 1 to 10 threads with file
>>>>>>>>>>>>>>>> names
>>>>>>>>>>>>>>>> part-0-xxxx.
>>>>>>>>>>>>>>>> What I see is clients only use 20 nodes for writing. How is
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> file
>>>>>>>>>>>>>>>> name to sub volume hashing is done? Is this related to file
>>>>>>>>>>>>>>>> names
>>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>> similar?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> My cluster is 3.7.10 with 60 nodes each has 26 disks.
>>>>>>>>>>>>>>>> Disperse
>>>>>>>>>>>>>>>> volume
>>>>>>>>>>>>>>>> is 78 x (16+4). Only 26 out of 78 sub volumes used during
>>>>>>>>>>>>>>>> writes..
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>>>
>
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users