Regarding hardware the machines are identical. Intel Xeon E5-1650 v3 Hexa-Core; 64 GB DDR4 ECC; Dell PERC H330 8 Port SAS/SATA 12 GBit/s RAID Controller; operating system running on a raid1, then 4 disks (JBOD) as bricks. Ok, i ran perf for a few seconds. ------------------------ perf record --call-graph=dwarf -p 7897 -o /tmp/perf.gluster11.bricksdd1.out ^C[ perf record: Woken up 378 times to write data ] Warning: Processed 83690 events and lost 96 chunks! Check IO/CPU overload! [ perf record: Captured and wrote 423.087 MB /tmp/perf.gluster11.bricksdd1.out (51744 samples) ] ------------------------ I copied a couple of lines: + 8.10% 0.00% glusteriotwr22 [unknown] [k] 0xffffffffffffffff + 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k] iterate_dir + 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k] sys_getdents + 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k] filldir + 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k] do_syscall_64 + 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k] entry_SYSCALL_64_after_swapgs + 8.10% 0.00% glusteriotwr22 libc-2.24.so [.] 0xffff80c60db8ef2b + 8.10% 0.00% glusteriotwr22 libc-2.24.so [.] readdir64 + 8.10% 0.00% glusteriotwr22 index.so [.] 0xffff80c6192a1888 + 8.10% 0.04% glusteriotwr22 [kernel.kallsyms] [k] ext4_htree_fill_tree + 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k] ext4_readdir + 7.95% 0.12% glusteriotwr22 [kernel.kallsyms] [k] htree_dirblock_to_tree + 5.78% 0.96% glusteriotwr22 [kernel.kallsyms] [k] __ext4_read_dirblock + 4.80% 0.02% glusteriotwr22 [kernel.kallsyms] [k] ext4_bread + 4.78% 0.04% glusteriotwr22 [kernel.kallsyms] [k] ext4_getblk + 4.72% 0.02% glusteriotwr22 [kernel.kallsyms] [k] __getblk_gfp + 4.57% 0.00% glusteriotwr3 [unknown] [k] 0xffffffffffffffff + 4.55% 0.00% glusteriotwr3 [kernel.kallsyms] [k] do_syscall_64 Do you need different or additional information? 2018-08-20 11:20 GMT+02:00 Pranith Kumar Karampuri <pkarampu@xxxxxxxxxx>: > Even the brick which doesn't have high CPU seems to have same number of > lookups, so that's not it. > Is there any difference at all between the machines which have high CPU vs > low CPU? > I think the only other thing I would do is to install perf tools and try to > figure out the call-graph which is leading to so much CPU > > This affects performance of the brick I think, so you may have to do it > quickly and for less time. > > perf record --call-graph=dwarf -p <brick-pid> -o </path/to/output> > then > perf report -i </path/to/output/given/in/the/previous/command> > > > On Mon, Aug 20, 2018 at 2:40 PM Hu Bert <revirii@xxxxxxxxxxxxxx> wrote: >> >> gluster volume heal shared info | grep -i number >> Number of entries: 0 >> Number of entries: 0 >> Number of entries: 0 >> Number of entries: 0 >> Number of entries: 0 >> Number of entries: 0 >> Number of entries: 0 >> Number of entries: 0 >> Number of entries: 0 >> Number of entries: 0 >> Number of entries: 0 >> Number of entries: 0 >> >> Looks good to me. >> >> >> 2018-08-20 10:51 GMT+02:00 Pranith Kumar Karampuri <pkarampu@xxxxxxxxxx>: >> > There are a lot of Lookup operations in the system. But I am not able to >> > find why. Could you check the output of >> > >> > # gluster volume heal <volname> info | grep -i number >> > >> > it should print all zeros. >> > >> > On Fri, Aug 17, 2018 at 1:49 PM Hu Bert <revirii@xxxxxxxxxxxxxx> wrote: >> >> >> >> I don't know what you exactly mean with workload, but the main >> >> function of the volume is storing (incl. writing, reading) images >> >> (from hundreds of bytes up to 30 MBs, overall ~7TB). The work is done >> >> by apache tomcat servers writing to / reading from the volume. Besides >> >> images there are some text files and binaries that are stored on the >> >> volume and get updated regularly (every x hours); we'll try to migrate >> >> the latter ones to local storage asap. >> >> >> >> Interestingly it's only one process (and its threads) of the same >> >> brick on 2 of the gluster servers that consumes the CPU. >> >> >> >> gluster11: bricksdd1; not healed; full CPU >> >> gluster12: bricksdd1; got healed; normal CPU >> >> gluster13: bricksdd1; got healed; full CPU >> >> >> >> Besides: performance during heal (e.g. gluster12, bricksdd1) was way >> >> better than it is now. I've attached 2 pngs showing the differing cpu >> >> usage of last week before/after heal. >> >> >> >> 2018-08-17 9:30 GMT+02:00 Pranith Kumar Karampuri >> >> <pkarampu@xxxxxxxxxx>: >> >> > There seems to be too many lookup operations compared to any other >> >> > operations. What is the workload on the volume? >> >> > >> >> > On Fri, Aug 17, 2018 at 12:47 PM Hu Bert <revirii@xxxxxxxxxxxxxx> >> >> > wrote: >> >> >> >> >> >> i hope i did get it right. >> >> >> >> >> >> gluster volume profile shared start >> >> >> wait 10 minutes >> >> >> gluster volume profile shared info >> >> >> gluster volume profile shared stop >> >> >> >> >> >> If that's ok, i've attached the output of the info command. >> >> >> >> >> >> >> >> >> 2018-08-17 8:31 GMT+02:00 Pranith Kumar Karampuri >> >> >> <pkarampu@xxxxxxxxxx>: >> >> >> > Please do volume profile also for around 10 minutes when CPU% is >> >> >> > high. >> >> >> > >> >> >> > On Fri, Aug 17, 2018 at 11:56 AM Pranith Kumar Karampuri >> >> >> > <pkarampu@xxxxxxxxxx> wrote: >> >> >> >> >> >> >> >> As per the output, all io-threads are using a lot of CPU. It is >> >> >> >> better >> >> >> >> to >> >> >> >> check what the volume profile is to see what is leading to so >> >> >> >> much >> >> >> >> work >> >> >> >> for >> >> >> >> io-threads. Please follow the documentation at >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Monitoring%20Workload/ >> >> >> >> section: " >> >> >> >> >> >> >> >> Running GlusterFS Volume Profile Command" >> >> >> >> >> >> >> >> and attach output of "gluster volume profile info", >> >> >> >> >> >> >> >> On Fri, Aug 17, 2018 at 11:24 AM Hu Bert <revirii@xxxxxxxxxxxxxx> >> >> >> >> wrote: >> >> >> >>> >> >> >> >>> Good morning, >> >> >> >>> >> >> >> >>> i ran the command during 100% CPU usage and attached the file. >> >> >> >>> Hopefully it helps. >> >> >> >>> >> >> >> >>> 2018-08-17 7:33 GMT+02:00 Pranith Kumar Karampuri >> >> >> >>> <pkarampu@xxxxxxxxxx>: >> >> >> >>> > Could you do the following on one of the nodes where you are >> >> >> >>> > observing >> >> >> >>> > high >> >> >> >>> > CPU usage and attach that file to this thread? We can find >> >> >> >>> > what >> >> >> >>> > threads/processes are leading to high usage. Do this for say >> >> >> >>> > 10 >> >> >> >>> > minutes >> >> >> >>> > when >> >> >> >>> > you see the ~100% CPU. >> >> >> >>> > >> >> >> >>> > top -bHd 5 > /tmp/top.${HOSTNAME}.txt >> >> >> >>> > >> >> >> >>> > On Wed, Aug 15, 2018 at 2:37 PM Hu Bert >> >> >> >>> > <revirii@xxxxxxxxxxxxxx> >> >> >> >>> > wrote: >> >> >> >>> >> >> >> >> >>> >> Hello again :-) >> >> >> >>> >> >> >> >> >>> >> The self heal must have finished as there are no log entries >> >> >> >>> >> in >> >> >> >>> >> glustershd.log files anymore. According to munin disk latency >> >> >> >>> >> (average >> >> >> >>> >> io wait) has gone down to 100 ms, and disk utilization has >> >> >> >>> >> gone >> >> >> >>> >> down >> >> >> >>> >> to ~60% - both on all servers and hard disks. >> >> >> >>> >> >> >> >> >>> >> But now system load on 2 servers (which were in the good >> >> >> >>> >> state) >> >> >> >>> >> fluctuates between 60 and 100; the server with the formerly >> >> >> >>> >> failed >> >> >> >>> >> disk has a load of 20-30.I've uploaded some munin graphics of >> >> >> >>> >> the >> >> >> >>> >> cpu >> >> >> >>> >> usage: >> >> >> >>> >> >> >> >> >>> >> https://abload.de/img/gluster11_cpu31d3a.png >> >> >> >>> >> https://abload.de/img/gluster12_cpu8sem7.png >> >> >> >>> >> https://abload.de/img/gluster13_cpud7eni.png >> >> >> >>> >> >> >> >> >>> >> This can't be normal. 2 of the servers under heavy load and >> >> >> >>> >> one >> >> >> >>> >> not >> >> >> >>> >> that much. Does anyone have an explanation of this strange >> >> >> >>> >> behaviour? >> >> >> >>> >> >> >> >> >>> >> >> >> >> >>> >> Thx :-) >> >> >> >>> >> >> >> >> >>> >> 2018-08-14 9:37 GMT+02:00 Hu Bert <revirii@xxxxxxxxxxxxxx>: >> >> >> >>> >> > Hi there, >> >> >> >>> >> > >> >> >> >>> >> > well, it seems the heal has finally finished. Couldn't >> >> >> >>> >> > see/find >> >> >> >>> >> > any >> >> >> >>> >> > related log message; is there such a message in a specific >> >> >> >>> >> > log >> >> >> >>> >> > file? >> >> >> >>> >> > >> >> >> >>> >> > But i see the same behaviour when the last heal finished: >> >> >> >>> >> > all >> >> >> >>> >> > CPU >> >> >> >>> >> > cores are consumed by brick processes; not only by the >> >> >> >>> >> > formerly >> >> >> >>> >> > failed >> >> >> >>> >> > bricksdd1, but by all 4 brick processes (and their >> >> >> >>> >> > threads). >> >> >> >>> >> > Load >> >> >> >>> >> > goes >> >> >> >>> >> > up to > 100 on the 2 servers with the not-failed brick, and >> >> >> >>> >> > glustershd.log gets filled with a lot of entries. Load on >> >> >> >>> >> > the >> >> >> >>> >> > server >> >> >> >>> >> > with the then failed brick not that high, but still ~60. >> >> >> >>> >> > >> >> >> >>> >> > Is this behaviour normal? Is there some post-heal after a >> >> >> >>> >> > heal >> >> >> >>> >> > has >> >> >> >>> >> > finished? >> >> >> >>> >> > >> >> >> >>> >> > thx in advance :-) >> >> >> >>> > >> >> >> >>> > >> >> >> >>> > >> >> >> >>> > -- >> >> >> >>> > Pranith >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> -- >> >> >> >> Pranith >> >> >> > >> >> >> > >> >> >> > >> >> >> > -- >> >> >> > Pranith >> >> > >> >> > >> >> > >> >> > -- >> >> > Pranith >> > >> > >> > >> > -- >> > Pranith > > > > -- > Pranith _______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx https://lists.gluster.org/mailman/listinfo/gluster-users