Re: Gluter 3.12.12: performance during heal and in general

Hu Bert <revirii@xxxxxxxxxxxxxx> · Thu, 26 Jul 2018 07:10:09 +0200

Hi Pranith,

Sry, it took a while to count the directories. I'll try to answer your
questions as good as possible.

> What kind of data do you have?
> How many directories in the filesystem?
> On average how many files per directory?
> What is the depth of your directory hierarchy on average?
> What is average filesize?

We have mostly images (more than 95% of disk usage, 90% of file
count), some text files (like css, jsp, gpx etc.) and some binaries.

There are about 190.000 directories in the file system; maybe there
are some more because we're hit by bug 1512371 (parallel-readdir =
TRUE prevents directories listing). But the number of directories
could/will rise in the future (maybe millions).

files per directory: ranges from 0 to 100, on average it should be 20
files per directory (well, at least in the deepest dirs, see
explanation below).

Average filesize: ranges from a few hundred bytes up to 30 MB, on
average it should be 2-3 MB.

Directory hierarchy: maximum depth as seen from within the volume is
6, the average should be 3.

volume name: shared
mount point on clients: /data/repository/shared/
below /shared/ there are 2 directories:
- public/: mainly calculated images (file sizes from a few KB up to
max 1 MB) and some resouces (small PNGs with a size of a few hundred
bytes).
- private/: mainly source images; file sizes from 50 KB up to 30MB

We migrated from a NFS server (SPOF) to glusterfs and simply copied
our files. The images (which have an ID) are stored in the deepest
directories of the dir tree. I'll better explain it :-)

directory structure for the images (i'll omit some other miscellaneous
stuff, but it looks quite similar):
- ID of an image has 7 or 8 digits
- /shared/private/: /(first 3 digits of ID)/(next 3 digits of ID)/$ID.jpg
- /shared/public/: /(first 3 digits of ID)/(next 3 digits of
ID)/$ID/$misc_formats.jpg

That's why we have that many (sub-)directories. Files are only stored
in the lowest directory hierarchy. I hope i could make our structure
at least a bit more transparent.

i hope there's something we can do to raise performance a bit. thx in
advance :-)

2018-07-24 10:40 GMT+02:00 Pranith Kumar Karampuri <pkarampu@xxxxxxxxxx>:
>
>
> On Mon, Jul 23, 2018 at 4:16 PM, Hu Bert <revirii@xxxxxxxxxxxxxx> wrote:
>>
>> Well, over the weekend about 200GB were copied, so now there are
>> ~400GB copied to the brick. That's far beyond a speed of 10GB per
>> hour. If I copied the 1.6 TB directly, that would be done within max 2
>> days. But with the self heal this will take at least 20 days minimum.
>>
>> Why is the performance that bad? No chance of speeding this up?
>
>
> What kind of data do you have?
> How many directories in the filesystem?
> On average how many files per directory?
> What is the depth of your directory hierarchy on average?
> What is average filesize?
>
> Based on this data we can see if anything can be improved. Or if there are
> some
> enhancements that need to be implemented in gluster to address this kind of
> data layout
>>
>>
>> 2018-07-20 9:41 GMT+02:00 Hu Bert <revirii@xxxxxxxxxxxxxx>:
>> > hmm... no one any idea?
>> >
>> > Additional question: the hdd on server gluster12 was changed, so far
>> > ~220 GB were copied. On the other 2 servers i see a lot of entries in
>> > glustershd.log, about 312.000 respectively 336.000 entries there
>> > yesterday, most of them (current log output) looking like this:
>> >
>> > [2018-07-20 07:30:49.757595] I [MSGID: 108026]
>> > [afr-self-heal-common.c:1724:afr_log_selfheal] 0-shared-replicate-3:
>> > Completed data selfheal on 0d863a62-0dd8-401c-b699-2b642d9fd2b6.
>> > sources=0 [2]  sinks=1
>> > [2018-07-20 07:30:49.992398] I [MSGID: 108026]
>> > [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do]
>> > 0-shared-replicate-3: performing metadata selfheal on
>> > 0d863a62-0dd8-401c-b699-2b642d9fd2b6
>> > [2018-07-20 07:30:50.243551] I [MSGID: 108026]
>> > [afr-self-heal-common.c:1724:afr_log_selfheal] 0-shared-replicate-3:
>> > Completed metadata selfheal on 0d863a62-0dd8-401c-b699-2b642d9fd2b6.
>> > sources=0 [2]  sinks=1
>> >
>> > or like this:
>> >
>> > [2018-07-20 07:38:41.726943] I [MSGID: 108026]
>> > [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do]
>> > 0-shared-replicate-3: performing metadata selfheal on
>> > 9276097a-cdac-4d12-9dc6-04b1ea4458ba
>> > [2018-07-20 07:38:41.855737] I [MSGID: 108026]
>> > [afr-self-heal-common.c:1724:afr_log_selfheal] 0-shared-replicate-3:
>> > Completed metadata selfheal on 9276097a-cdac-4d12-9dc6-04b1ea4458ba.
>> > sources=[0] 2  sinks=1
>> > [2018-07-20 07:38:44.755800] I [MSGID: 108026]
>> > [afr-self-heal-entry.c:887:afr_selfheal_entry_do]
>> > 0-shared-replicate-3: performing entry selfheal on
>> > 9276097a-cdac-4d12-9dc6-04b1ea4458ba
>> >
>> > is this behaviour normal? I'd expect these messages on the server with
>> > the failed brick, not on the other ones.
>> >
>> > 2018-07-19 8:31 GMT+02:00 Hu Bert <revirii@xxxxxxxxxxxxxx>:
>> >> Hi there,
>> >>
>> >> sent this mail yesterday, but somehow it didn't work? Wasn't archived,
>> >> so please be indulgent it you receive this mail again :-)
>> >>
>> >> We are currently running a replicate setup and are experiencing a
>> >> quite poor performance. It got even worse when within a couple of
>> >> weeks 2 bricks (disks) crashed. Maybe some general information of our
>> >> setup:
>> >>
>> >> 3 Dell PowerEdge R530 (Xeon E5-1650 v3 Hexa-Core, 64 GB DDR4, OS on
>> >> separate disks); each server has 4 10TB disks -> each is a brick;
>> >> replica 3 setup (see gluster volume status below). Debian stretch,
>> >> kernel 4.9.0, gluster version 3.12.12. Servers and clients are
>> >> connected via 10 GBit ethernet.
>> >>
>> >> About a month ago and 2 days ago a disk died (on different servers);
>> >> disk were replaced, were brought back into the volume and full self
>> >> heal started. But the speed for this is quite... disappointing. Each
>> >> brick has ~1.6TB of data on it (mostly the infamous small files). The
>> >> full heal i started yesterday copied only ~50GB within 24 hours (48
>> >> hours: about 100GB) - with
>> >> this rate it would take weeks until the self heal finishes.
>> >>
>> >> After the first heal (started on gluster13 about a month ago, took
>> >> about 3 weeks) finished we had a terrible performance; CPU on one or
>> >> two of the nodes (gluster11, gluster12) was up to 1200%, consumed by
>> >> the brick process of the former crashed brick (bricksdd1),
>> >> interestingly not on the server with the failed this, but on the other
>> >> 2 ones...
>> >>
>> >> Well... am i doing something wrong? Some options wrongly configured?
>> >> Terrible setup? Anyone got an idea? Any additional information needed?
>> >>
>> >>
>> >> Thx in advance :-)
>> >>
>> >> gluster volume status
>> >>
>> >> Volume Name: shared
>> >> Type: Distributed-Replicate
>> >> Volume ID: e879d208-1d8c-4089-85f3-ef1b3aa45d36
>> >> Status: Started
>> >> Snapshot Count: 0
>> >> Number of Bricks: 4 x 3 = 12
>> >> Transport-type: tcp
>> >> Bricks:
>> >> Brick1: gluster11:/gluster/bricksda1/shared
>> >> Brick2: gluster12:/gluster/bricksda1/shared
>> >> Brick3: gluster13:/gluster/bricksda1/shared
>> >> Brick4: gluster11:/gluster/bricksdb1/shared
>> >> Brick5: gluster12:/gluster/bricksdb1/shared
>> >> Brick6: gluster13:/gluster/bricksdb1/shared
>> >> Brick7: gluster11:/gluster/bricksdc1/shared
>> >> Brick8: gluster12:/gluster/bricksdc1/shared
>> >> Brick9: gluster13:/gluster/bricksdc1/shared
>> >> Brick10: gluster11:/gluster/bricksdd1/shared
>> >> Brick11: gluster12:/gluster/bricksdd1_new/shared
>> >> Brick12: gluster13:/gluster/bricksdd1_new/shared
>> >> Options Reconfigured:
>> >> cluster.shd-max-threads: 4
>> >> performance.md-cache-timeout: 60
>> >> cluster.lookup-optimize: on
>> >> cluster.readdir-optimize: on
>> >> performance.cache-refresh-timeout: 4
>> >> performance.parallel-readdir: on
>> >> server.event-threads: 8
>> >> client.event-threads: 8
>> >> performance.cache-max-file-size: 128MB
>> >> performance.write-behind-window-size: 16MB
>> >> performance.io-thread-count: 64
>> >> cluster.min-free-disk: 1%
>> >> performance.cache-size: 24GB
>> >> nfs.disable: on
>> >> transport.address-family: inet
>> >> performance.high-prio-threads: 32
>> >> performance.normal-prio-threads: 32
>> >> performance.low-prio-threads: 32
>> >> performance.least-prio-threads: 8
>> >> performance.io-cache: on
>> >> server.allow-insecure: on
>> >> performance.strict-o-direct: off
>> >> transport.listen-backlog: 100
>> >> server.outstanding-rpc-limit: 128
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users@xxxxxxxxxxx
>> https://lists.gluster.org/mailman/listinfo/gluster-users
>
>
>
>
> --
> Pranith
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users