Here is the steps that I do in detail and relevant output from bricks: I am using below command for volume creation: gluster volume create v0 disperse 20 redundancy 4 \ 1.1.1.{185..204}:/bricks/02 \ 1.1.1.{205..224}:/bricks/02 \ 1.1.1.{225..244}:/bricks/02 \ 1.1.1.{185..204}:/bricks/03 \ 1.1.1.{205..224}:/bricks/03 \ 1.1.1.{225..244}:/bricks/03 \ 1.1.1.{185..204}:/bricks/04 \ 1.1.1.{205..224}:/bricks/04 \ 1.1.1.{225..244}:/bricks/04 \ 1.1.1.{185..204}:/bricks/05 \ 1.1.1.{205..224}:/bricks/05 \ 1.1.1.{225..244}:/bricks/05 \ 1.1.1.{185..204}:/bricks/06 \ 1.1.1.{205..224}:/bricks/06 \ 1.1.1.{225..244}:/bricks/06 \ 1.1.1.{185..204}:/bricks/07 \ 1.1.1.{205..224}:/bricks/07 \ 1.1.1.{225..244}:/bricks/07 \ 1.1.1.{185..204}:/bricks/08 \ 1.1.1.{205..224}:/bricks/08 \ 1.1.1.{225..244}:/bricks/08 \ 1.1.1.{185..204}:/bricks/09 \ 1.1.1.{205..224}:/bricks/09 \ 1.1.1.{225..244}:/bricks/09 \ 1.1.1.{185..204}:/bricks/10 \ 1.1.1.{205..224}:/bricks/10 \ 1.1.1.{225..244}:/bricks/10 \ 1.1.1.{185..204}:/bricks/11 \ 1.1.1.{205..224}:/bricks/11 \ 1.1.1.{225..244}:/bricks/11 \ 1.1.1.{185..204}:/bricks/12 \ 1.1.1.{205..224}:/bricks/12 \ 1.1.1.{225..244}:/bricks/12 \ 1.1.1.{185..204}:/bricks/13 \ 1.1.1.{205..224}:/bricks/13 \ 1.1.1.{225..244}:/bricks/13 \ 1.1.1.{185..204}:/bricks/14 \ 1.1.1.{205..224}:/bricks/14 \ 1.1.1.{225..244}:/bricks/14 \ 1.1.1.{185..204}:/bricks/15 \ 1.1.1.{205..224}:/bricks/15 \ 1.1.1.{225..244}:/bricks/15 \ 1.1.1.{185..204}:/bricks/16 \ 1.1.1.{205..224}:/bricks/16 \ 1.1.1.{225..244}:/bricks/16 \ 1.1.1.{185..204}:/bricks/17 \ 1.1.1.{205..224}:/bricks/17 \ 1.1.1.{225..244}:/bricks/17 \ 1.1.1.{185..204}:/bricks/18 \ 1.1.1.{205..224}:/bricks/18 \ 1.1.1.{225..244}:/bricks/18 \ 1.1.1.{185..204}:/bricks/19 \ 1.1.1.{205..224}:/bricks/19 \ 1.1.1.{225..244}:/bricks/19 \ 1.1.1.{185..204}:/bricks/20 \ 1.1.1.{205..224}:/bricks/20 \ 1.1.1.{225..244}:/bricks/20 \ 1.1.1.{185..204}:/bricks/21 \ 1.1.1.{205..224}:/bricks/21 \ 1.1.1.{225..244}:/bricks/21 \ 1.1.1.{185..204}:/bricks/22 \ 1.1.1.{205..224}:/bricks/22 \ 1.1.1.{225..244}:/bricks/22 \ 1.1.1.{185..204}:/bricks/23 \ 1.1.1.{205..224}:/bricks/23 \ 1.1.1.{225..244}:/bricks/23 \ 1.1.1.{185..204}:/bricks/24 \ 1.1.1.{205..224}:/bricks/24 \ 1.1.1.{225..244}:/bricks/24 \ 1.1.1.{185..204}:/bricks/25 \ 1.1.1.{205..224}:/bricks/25 \ 1.1.1.{225..244}:/bricks/25 \ 1.1.1.{185..204}:/bricks/26 \ 1.1.1.{205..224}:/bricks/26 \ 1.1.1.{225..244}:/bricks/26 \ 1.1.1.{185..204}:/bricks/27 \ 1.1.1.{205..224}:/bricks/27 \ 1.1.1.{225..244}:/bricks/27 force then I mount volume on 50 clients: mount -t glusterfs 1.1.1.185:/v0 /mnt/gluster then I make a directory from one of the clients and chmod it. mkdir /mnt/gluster/s1 && chmod 777 /mnt/gluster/s1 then I start distcp on clients, there are 1059X8.8GB files in one folder and they will be copied to /mnt/gluster/s1 with 100 parallel which means 2 copy jobs per client at same time. hadoop distcp -m 100 http://nn1:8020/path/to/teragen-10tb file:///mnt/gluster/s1 After job finished here is the status of s1 directory from bricks: s1 directory is present in all 1560 brick. s1/teragen-10tb folder is present in all 1560 brick. full listing of files in bricks: https://www.dropbox.com/s/rbgdxmrtwz8oya8/teragen_list.zip?dl=0 You can ignore the .crc files in the brick output above, they are checksum files... As you can see part-m-xxxx files written only some bricks in nodes 0205..0224 All bricks have some files but they have zero size. I increase file descriptors to 65k so it is not the issue... On Wed, Apr 20, 2016 at 9:34 AM, Xavier Hernandez <xhernandez@xxxxxxxxxx> wrote: > Hi Serkan, > > On 19/04/16 15:16, Serkan Çoban wrote: >>>>> >>>>> I assume that gluster is used to store the intermediate files before >>>>> the reduce phase >> >> Nope, gluster is the destination for distcp command. hadoop distcp -m >> 50 http://nn1:8020/path/to/folder file:///mnt/gluster >> This run maps on datanodes which have /mnt/gluster mounted on all of them. > > > I don't know hadoop, so I'm of little help here. However it seems that -m 50 > means to execute 50 copies in parallel. This means that even if the > distribution worked fine, at most 50 (much probably less) of the 78 ec sets > would be used in parallel. > >> >>>>> This means that this is caused by some peculiarity of the mapreduce. >> >> Yes but how a client write 500 files to gluster mount and those file >> just written only to subset of subvolumes? I cannot use gluster as a >> backup cluster if I cannot write with distcp. >> > > All 500 files were created only on one of the 78 ec sets and the remaining > 77 got empty ? > >>>>> You should look which files are created in each brick and how many >>>>> while the process is running. >> >> Files only created on nodes 185..204 or 205..224 or 225..244. Only on >> 20 nodes in each test. > > > How many files there were in each brick ? > > Not sure if this can be related, but standard linux distributions have a > default limit of 1024 open file descriptors. Having a so big volume and > doing a massive copy, maybe this limit is affecting something ? > > Are there any error or warning messages in the mount or bricks logs ? > > > Xavi > >> >> On Tue, Apr 19, 2016 at 1:05 PM, Xavier Hernandez <xhernandez@xxxxxxxxxx> >> wrote: >>> >>> Hi Serkan, >>> >>> moved to gluster-users since this doesn't belong to devel list. >>> >>> On 19/04/16 11:24, Serkan Çoban wrote: >>>> >>>> >>>> I am copying 10.000 files to gluster volume using mapreduce on >>>> clients. Each map process took one file at a time and copy it to >>>> gluster volume. >>> >>> >>> >>> I assume that gluster is used to store the intermediate files before the >>> reduce phase. >>> >>>> My disperse volume consist of 78 subvolumes of 16+4 disk each. So If I >>>> copy >78 files parallel I expect each file goes to different subvolume >>>> right? >>> >>> >>> >>> If you only copy 78 files, most probably you will get some subvolume >>> empty >>> and some other with more than one or two files. It's not an exact >>> distribution, it's a statistially balanced distribution: over time and >>> with >>> enough files, each brick will contain an amount of files in the same >>> order >>> of magnitude, but they won't have the *same* number of files. >>> >>>> In my tests during tests with fio I can see every file goes to >>>> different subvolume, but when I start mapreduce process from clients >>>> only 78/3=26 subvolumes used for writing files. >>> >>> >>> >>> This means that this is caused by some peculiarity of the mapreduce. >>> >>>> I see that clearly from network traffic. Mapreduce on client side can >>>> be run multi thread. I tested with 1-5-10 threads on each client but >>>> every time only 26 subvolumes used. >>>> How can I debug the issue further? >>> >>> >>> >>> You should look which files are created in each brick and how many while >>> the >>> process is running. >>> >>> Xavi >>> >>> >>>> >>>> On Tue, Apr 19, 2016 at 11:22 AM, Xavier Hernandez >>>> <xhernandez@xxxxxxxxxx> wrote: >>>>> >>>>> >>>>> Hi Serkan, >>>>> >>>>> On 19/04/16 09:18, Serkan Çoban wrote: >>>>>> >>>>>> >>>>>> >>>>>> Hi, I just reinstalled fresh 3.7.11 and I am seeing the same behavior. >>>>>> 50 clients copying part-0-xxxx named files using mapreduce to gluster >>>>>> using one thread per server and they are using only 20 servers out of >>>>>> 60. On the other hand fio tests use all the servers. Anything I can do >>>>>> to solve the issue? >>>>> >>>>> >>>>> >>>>> >>>>> Distribution of files to ec sets is done by dht. In theory if you >>>>> create >>>>> many files each ec set will receive the same amount of files. However >>>>> when >>>>> the number of files is small enough, statistics can fail. >>>>> >>>>> Not sure what you are doing exactly, but a mapreduce procedure >>>>> generally >>>>> only creates a single output. In that case it makes sense that only one >>>>> ec >>>>> set is used. If you want to use all ec sets for a single file, you >>>>> should >>>>> enable sharding (I haven't tested that) or split the result in multiple >>>>> files. >>>>> >>>>> Xavi >>>>> >>>>> >>>>>> >>>>>> Thanks, >>>>>> Serkan >>>>>> >>>>>> >>>>>> ---------- Forwarded message ---------- >>>>>> From: Serkan Çoban <cobanserkan@xxxxxxxxx> >>>>>> Date: Mon, Apr 18, 2016 at 2:39 PM >>>>>> Subject: disperse volume file to subvolume mapping >>>>>> To: Gluster Users <gluster-users@xxxxxxxxxxx> >>>>>> >>>>>> >>>>>> Hi, I have a problem where clients are using only 1/3 of nodes in >>>>>> disperse volume for writing. >>>>>> I am testing from 50 clients using 1 to 10 threads with file names >>>>>> part-0-xxxx. >>>>>> What I see is clients only use 20 nodes for writing. How is the file >>>>>> name to sub volume hashing is done? Is this related to file names are >>>>>> similar? >>>>>> >>>>>> My cluster is 3.7.10 with 60 nodes each has 26 disks. Disperse volume >>>>>> is 78 x (16+4). Only 26 out of 78 sub volumes used during writes.. >>>>>> >>>>> >>> > _______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-users