Hi Geoffrey,
Sorry for the delayed response. We will look at the log
files you provided in your previous email and update you the
workaround as soon as possible.
Thanks,
Vijay
On Thursday 11 June 2015 05:43 PM,
Geoffrey Letessier wrote:
Hi Vijay,
Could you take a time to take a look at this? I
found only one thing about my issues in Red Hat bugzilla ( https://bugzilla.redhat.com/show_bug.cgi?id=917901)
But, my storage & computing clusters are still in production
now and I wonder if I should warn my community about of a needed
production break or can I apply a fix during production? (i.e.
without updating my GlusterFS version on my storage cluster).
Thanks in advance,
Geoffrey
------------------------------------------------------
Geoffrey Letessier
Responsable informatique & ingénieur système
UPR 9080 - CNRS - Laboratoire de Biochimie Théorique
Institut de Biologie Physico-Chimique
13, rue Pierre et Marie Curie - 75005 Paris
Tel: 01 58 41 50 93 - eMail: geoffrey.letessier@xxxxxxx
Hi Geoffrey,
grep for 'ERROR' from the log file,
and only these lines would be sufficient.
Thanks,
Vijay
On Wednesday 10 June 2015
04:38 AM, Geoffrey Letessier wrote:
Hello Vijay,
Quota-verify is still running since a
couple of hours (more than 10) and each output file
sizes (4 files because 4 bricks per replica) are
very huge: around 800MB per file in the first server
and 5GB per file in the second one. Do your still
want these? How can I send it to you?
Nice night (in France)
Geoffrey
------------------------------------------------------
Geoffrey Letessier
Responsable informatique & ingénieur système
UPR 9080 - CNRS - Laboratoire de Biochimie
Théorique
Institut de Biologie Physico-Chimique
13, rue Pierre et Marie Curie - 75005 Paris
Tel: 01 58 41 50 93 - eMail: geoffrey.letessier@xxxxxxx
Hi Geoffrey,
The file content deletion is
because of 'vi editor' behaviour of
truncating the file when writing the updated
content.
Regarding quota size/usage
problem, can you please execute the script
attached on each brick and provide us the
output generated, this will help us analyse
why quota list is showing wrong-size.
The script basically crawls the
directory given as argument.
It collects quota "contri"
and "size" extended attribute and also
"block size" from stat call.
Usage:
./quota-verify -b
<brick_path> | tee brick_name.log
Thanks,
Vijay
On Tuesday 09
June 2015 03:45 PM, Vijaikumar M wrote:
On Tuesday 09
June 2015 03:40 PM, Geoffrey Letessier
wrote:
Hi Vijay,
Thanks for having replied.
Unfortunately, i check each
bricks on my stockage pool and dont find
any backup file.. damage!
Please check backup file
on client machine where the file was
edited and on the home dir of a user (this
is the user login used
to edit a file).
Thanks,
Vijay
Thank you again!
Good luck and see you,
Geoffrey
------------------------------------------------------
Geoffrey Letessier
Responsable informatique &
ingénieur système
UPR 9080 - CNRS - Laboratoire de
Biochimie Théorique
Institut de Biologie
Physico-Chimique
13, rue Pierre et Marie Curie -
75005 Paris
Tel: 01 58 41 50 93 - eMail: geoffrey.letessier@xxxxxxx
On
Tuesday 09 June 2015 01:08 PM,
Geoffrey Letessier wrote:
Hi,
Yes of course:
[root@lucifer
~]# pdsh -w
cl-storage[1,3] du -s
/export/brick_home/brick*/amyloid_team
cl-storage1:
1608522280
/export/brick_home/brick1/amyloid_team
cl-storage3:
1619630616
/export/brick_home/brick1/amyloid_team
cl-storage1:
1614057836
/export/brick_home/brick2/amyloid_team
cl-storage3:
1602653808
/export/brick_home/brick2/amyloid_team
The sum
is: 6444864540 (around
6.4-6.5TB) while the quota
list displays 7.7TB.
So, the
mistake is roughly
1.2-1.3TB, in other words
around 16% -which is too
huge, no?
In addition,
since the quota is
exceeded, i note a lot of
files like following:
[root@lucifer
~]# pdsh -w
cl-storage[1,3] "cd
/export/brick_home/brick2/amyloid_team/tarus/project/ab1-40-x1_sen304-x2_inh3-x2/remd_charmm22star_scripts/;
ls -ail remd_100.sh
2> /dev/null"
2>/dev/null
cl-storage3:
133325688 ---------T 2
tarus amyloid_team 0 16
févr. 10:20 remd_100.sh
note the ’T’
at the end of perms and
the file size to 0B.
And,
yesterday, some files were
duplicated but not
anymore...
The worst is,
previously, all these
files were OK. In other
words, exceeding quota
made file or content
deletions or corruptions…
What can I do to prevent
to situation for the futur
-because I guess i cannot
do something to rollback
this situation now, right?
Hi Geoffrey,
I tried
re-creating the problem.
Here is the behaviour of vi
editor.
When a file is
saved in vi editor, it creates
a backup file under home dir
and opens the original file
with 'O_TRUNC' flag and hence
file was truncated.
Here is the strace of vi
editor when it gets 'EDQUOT'
error:
open("hello",
O_WRONLY|O_CREAT|O_TRUNC,
0644) = 3
write(3, "line
one\nline two\n", 18) = 18
fsync(3)
= 0
close(3)
= -1 EDQUOT (Disk quota
exceeded)
chmod("hello",
0100644) = 0
open("/root/hello~",
O_RDONLY) = 3
open("hello",
O_WRONLY|O_CREAT|O_TRUNC,
0644) = 7
read(3, "line
one\n", 256) = 9
write(7, "line
one\n", 9) = 9
read(3, "",
256) =
0
close(7)
= -1 EDQUOT (Disk quota
exceeded)
close(3)
= 0
To re-cover
the truncated file, please
find if there are any backup
file 'remd_115.sh~' under '~/'
or on the same dir where this
file exists.
If exists you can copy this
file.
Thanks,
Vijay
Geoffrey
------------------------------------------------------
Geoffrey
Letessier
Responsable
informatique &
ingénieur système
UPR 9080 - CNRS -
Laboratoire de
Biochimie Théorique
Institut de Biologie
Physico-Chimique
13, rue Pierre et Marie
Curie - 75005 Paris
Tel: 01 58 41 50 93 -
eMail: geoffrey.letessier@xxxxxxx
On
Monday 08 June
2015 07:11 PM,
Geoffrey Letessier
wrote:
In addition, i
notice a very big
difference between
the sum of DU on
each brick and
« quota list »
display, as you
can read below:
[root@lucifer
~]# pdsh -w
cl-storage[1,3]
du -sh
/export/brick_home/brick*/amyloid_team
cl-storage1:
1,6T /export/brick_home/brick1/amyloid_team
cl-storage3:
1,6T /export/brick_home/brick1/amyloid_team
cl-storage1:
1,6T /export/brick_home/brick2/amyloid_team
cl-storage3:
1,6T /export/brick_home/brick2/amyloid_team
[root@lucifer
~]# gluster
volume quota
vol_home list
/amyloid_team
Path
Hard-limit
Soft-limit
Used
Available
--------------------------------------------------------------------------------
/amyloid_team
9.0TB
90%
7.8TB 1.2TB
As
you can
notice, the
sum of all
bricks gives
me roughly
6.4TB and
« quota list »
around 7.8TB;
so there is a
difference of
1.4TB i’m not
able to
explain… Do
you have any
idea?
There
were few issues
when quota
accounting the size,
we have fixed some of
these issues in
3.7
'df
-h' will
round off the
values, can you
please provide the
output of 'df'
without -h option?
Thanks,
Geoffrey
------------------------------------------------------
Geoffrey
Letessier
Responsable
informatique &
ingénieur
système
UPR 9080 -
CNRS -
Laboratoire de
Biochimie Théorique
Institut de
Biologie
Physico-Chimique
13, rue Pierre
et Marie Curie
- 75005 Paris
Tel: 01 58 41
50 93 -
eMail: geoffrey.letessier@xxxxxxx
Hello,
Concerning
the 3.5.3
version of
GlusterFS, I
met this
morning a
strange issue
writing file
when quota is
exceeded.
One
person of my
lab, whose her
quota is
exceeded (but
she didn’t
know about)
try to modify
a file but,
because of
exceeded
quota, she was
unable to and
decided to
exit VI. Now,
her file is
empty/blank as
you can read
below:
we
suspect 'vi' might
have created tmp
file before
writing to a file.
We are working on
re-creating this
problem and will
update you on the
same.
pdsh@lucifer:
cl-storage3:
ssh exited
with exit code
2
cl-storage1:
---------T 2
tarus
amyloid_team 0
19 févr. 12:34
/export/brick_home/brick1/amyloid_team/tarus/project/ab1-40-x1_sen304-x2_inh3-x2/remd_charmm22star_scripts/remd_115.sh
cl-storage1:
-rwxrw-r-- 2
tarus
amyloid_team
0 8 juin
12:38
/export/brick_home/brick2/amyloid_team/tarus/project/ab1-40-x1_sen304-x2_inh3-x2/remd_charmm22star_scripts/remd_115.sh
In
addition, i
dont
understand
why, my volume
being a
distributed
volume inside
replica
(cl-storage[1,3]
is replicated
only on
cl-storage[2,4]),
i have 2
« same » files
(complete
path) in 2
different
bricks (as you
can read
above).
Thanks
by advance for
your help and
clarification.
Geoffrey
------------------------------------------------------
Geoffrey
Letessier
Responsable
informatique &
ingénieur
système
UPR 9080 -
CNRS -
Laboratoire de
Biochimie Théorique
Institut de
Biologie
Physico-Chimique
13, rue Pierre
et Marie Curie
- 75005 Paris
Tel: 01 58 41
50 93 -
eMail: geoffrey.letessier@xxxxxxx
Hi Ben,
I
just check my
messages log
files, both on
client and
server, and I
dont find any
hung task you
notice on
yours..
As
you can read
below, i dont
note the
performance
issue in a
simple DD but
I think my
issue is
concerning a
set of small
files (tens of
thousands nay
more)…
[root@nisus
test]# ddt -t
10g /mnt/test/
Writing
to
/mnt/test/ddt.8362
... syncing
... done.
sleeping
10 seconds ...
done.
Reading
from
/mnt/test/ddt.8362
... done.
10240MiB
KiB/s CPU%
Write
114770
4
Read
40675
4
for
info:
/mnt/test
concerns the
single v2 GlFS
volume
[root@nisus
test]# ddt -t
10g
/mnt/fhgfs/
Writing
to
/mnt/fhgfs/ddt.8380
... syncing
... done.
sleeping
10 seconds ...
done.
Reading
from
/mnt/fhgfs/ddt.8380
... done.
10240MiB
KiB/s CPU%
Write
102591
1
Read
98079
2
Do
you have a
idea how to
tune/optimize
performance
settings?
and/or TCP
settings (MTU,
etc.)?
---------------------------------------------------------------
|
|
UNTAR | DU
| FIND |
TAR | RM
|
---------------------------------------------------------------
|
single |
~3m45s |
~43s |
~47s |
~3m10s |
~3m15s |
---------------------------------------------------------------
|
replicated |
~5m10s |
~59s |
~1m6s |
~1m19s |
~1m49s |
---------------------------------------------------------------
|
distributed |
~4m18s |
~41s |
~57s |
~2m24s |
~1m38s |
---------------------------------------------------------------
|
dist-repl |
~8m18s |
~1m4s
| ~1m11s |
~1m24s |
~2m40s |
---------------------------------------------------------------
|
native FS |
~11s |
~4s |
~2s |
~56s | ~10s
|
---------------------------------------------------------------
|
BeeGFS |
~3m43s |
~15s |
~3s |
~1m33s |
~46s |
---------------------------------------------------------------
|
single (v2) |
~3m6s |
~14s |
~32s |
~1m2s | ~44s
|
---------------------------------------------------------------
for
info:
-BeeGFS is a
distributed FS
(4 bricks, 2
bricks per
server and 2
servers)
- single (v2):
simple gluster
volume with
default
settings
I
also note I
obtain the
same tar/untar
performance
issue with
FhGFS/BeeGFS
but the rest
(DU, FIND, RM)
looks like to
be OK.
Thank
you very much
for your reply
and help.
Geoffrey
-----------------------------------------------
Geoffrey
Letessier
Responsable
informatique
&
ingénieur
système
CNRS - UPR
9080 -
Laboratoire
de Biochimie
Théorique
Institut de
Biologie
Physico-Chimique
13, rue Pierre
et Marie Curie
- 75005 Paris
Tel: 01 58 41
50 93 -
eMail: geoffrey.letessier@xxxxxxx
I
am seeing
problems on
3.7 as well.
Can you check
/var/log/messages
on both the
clients and
servers for
hung tasks
like:
Jun 2
15:23:14
gqac006
kernel: "echo
0 >
/proc/sys/kernel/hung_task_timeout_secs"
disables this
message.
Jun 2
15:23:14
gqac006
kernel: iozone
D
0000000000000001
0 21999
1
0x00000080
Jun 2
15:23:14
gqac006
kernel:
ffff880611321cc8
0000000000000082
ffff880611321c18
ffffffffa027236e
Jun 2
15:23:14
gqac006
kernel:
ffff880611321c48
ffffffffa0272c10
ffff88052bd1e040
ffff880611321c78
Jun 2
15:23:14
gqac006
kernel:
ffff88052bd1e0f0
ffff88062080c7a0
ffff880625addaf8
ffff880611321fd8
Jun 2
15:23:14
gqac006
kernel: Call
Trace:
Jun 2
15:23:14
gqac006
kernel:
[<ffffffffa027236e>]
?
rpc_make_runnable+0x7e/0x80
[sunrpc]
Jun 2
15:23:14
gqac006
kernel:
[<ffffffffa0272c10>]
?
rpc_execute+0x50/0xa0
[sunrpc]
Jun 2
15:23:14
gqac006
kernel:
[<ffffffff810aaa21>]
?
ktime_get_ts+0xb1/0xf0
Jun 2
15:23:14
gqac006
kernel:
[<ffffffff811242d0>]
?
sync_page+0x0/0x50
Jun 2
15:23:14
gqac006
kernel:
[<ffffffff8152a1b3>]
io_schedule+0x73/0xc0
Jun 2
15:23:14
gqac006
kernel:
[<ffffffff8112430d>]
sync_page+0x3d/0x50
Jun 2
15:23:14
gqac006
kernel:
[<ffffffff8152ac7f>]
__wait_on_bit+0x5f/0x90
Jun 2
15:23:14
gqac006
kernel:
[<ffffffff81124543>]
wait_on_page_bit+0x73/0x80
Jun 2
15:23:14
gqac006
kernel:
[<ffffffff8109eb80>]
?
wake_bit_function+0x0/0x50
Jun 2
15:23:14
gqac006
kernel:
[<ffffffff8113a525>]
?
pagevec_lookup_tag+0x25/0x40
Jun 2
15:23:14
gqac006
kernel:
[<ffffffff8112496b>]
wait_on_page_writeback_range+0xfb/0x190
Jun 2
15:23:14
gqac006
kernel:
[<ffffffff81124b38>]
filemap_write_and_wait_range+0x78/0x90
Jun 2
15:23:14
gqac006
kernel:
[<ffffffff811c07ce>]
vfs_fsync_range+0x7e/0x100
Jun 2
15:23:14
gqac006
kernel:
[<ffffffff811c08bd>]
vfs_fsync+0x1d/0x20
Jun 2
15:23:14
gqac006
kernel:
[<ffffffff811c08fe>]
do_fsync+0x3e/0x60
Jun 2
15:23:14
gqac006
kernel:
[<ffffffff811c0950>]
sys_fsync+0x10/0x20
Jun 2
15:23:14
gqac006
kernel:
[<ffffffff8100b072>]
system_call_fastpath+0x16/0x1b
Do you see a
perf problem
with just a
simple DD or
do you need a
more complex
workload to
hit the issue?
I think I saw
an issue with
metadata
performance
that I am
trying to run
down, let me
know if you
can see the
problem with
simple DD
reads / writes
or if we need
to do some
sort of dir /
metadata
access as
well.
-b
----- Original
Message -----
From:
"Geoffrey
Letessier"
<geoffrey.letessier@xxxxxxx>
To: "Pranith
Kumar
Karampuri"
<pkarampu@xxxxxxxxxx>
Cc: gluster-users@xxxxxxxxxxx
Sent: Tuesday,
June 2, 2015
8:09:04 AM
Subject: Re:
GlusterFS 3.7
- slow/poor
performances
Hi Pranith,
I’m sorry but
I cannot bring
you any
comparison
because
comparison
will be
distorted by
the fact in my
HPC cluster in
production the
network
technology
is InfiniBand
QDR and my
volumes are
quite
different
(brick in
RAID6
(12x2TB), 2
bricks per
server and 4
servers into
my pool)
Concerning
your demand,
in attachments
you can find
all expected
results
hoping it can
help you to
solve this
serious
performance
issue (maybe I
need
play with
glusterfs
parameters?).
Thank you very
much by
advance,
Geoffrey
------------------------------------------------------
Geoffrey
Letessier
Responsable
informatique
&
ingénieur
système
UPR 9080 -
CNRS -
Laboratoire de
Biochimie
Théorique
Institut de
Biologie
Physico-Chimique
13, rue Pierre
et Marie Curie
- 75005 Paris
Tel: 01 58 41
50 93 - eMail:
geoffrey.letessier@xxxxxxx
Le 2 juin 2015
à 10:09,
Pranith Kumar
Karampuri <
pkarampu@xxxxxxxxxx >
a
écrit :
hi Geoffrey,
Since you are
saying it
happens on all
types of
volumes, lets
do the
following:
1) Create a
dist-repl
volume
2) Set the
options etc
you need.
3) enable
gluster volume
profile using
"gluster
volume profile
<volname>
start"
4) run the
work load
5) give output
of "gluster
volume profile
<volname>
info"
Repeat the
steps above on
new and old
version you
are comparing
this with.
That should
give us
insight into
what could be
causing the
slowness.
Pranith
On 06/02/2015
03:22 AM,
Geoffrey
Letessier
wrote:
Dear all,
I have a crash
test cluster
where i’ve
tested the new
version of
GlusterFS
(v3.7) before
upgrading my
HPC cluster in
production.
But… all my
tests show me
very very low
performances.
For my
benches, as
you can read
below, I do
some actions
(untar, du,
find,
tar, rm) with
linux kernel
sources,
dropping
cache, each on
distributed,
replicated,
distributed-replicated,
single (single
brick) volumes
and the
native FS of
one brick.
# time (echo 3
>
/proc/sys/vm/drop_caches;
tar xJf
~/linux-4.1-rc5.tar.xz;
sync; echo 3
>
/proc/sys/vm/drop_caches)
# time (echo 3
>
/proc/sys/vm/drop_caches;
du -sh
linux-4.1-rc5/;
echo 3 >
/proc/sys/vm/drop_caches)
# time (echo 3
>
/proc/sys/vm/drop_caches;
find
linux-4.1-rc5/|wc
-l; echo 3
/proc/sys/vm/drop_caches)
# time (echo 3
>
/proc/sys/vm/drop_caches;
tar czf
linux-4.1-rc5.tgz
linux-4.1-rc5/;
echo 3 >
/proc/sys/vm/drop_caches)
# time (echo 3
>
/proc/sys/vm/drop_caches;
rm -rf
linux-4.1-rc5.tgz
linux-4.1-rc5/;
echo 3 >
/proc/sys/vm/drop_caches)
And here are
the process
times:
---------------------------------------------------------------
| | UNTAR | DU
| FIND | TAR |
RM |
---------------------------------------------------------------
| single |
~3m45s | ~43s
| ~47s |
~3m10s |
~3m15s |
---------------------------------------------------------------
| replicated |
~5m10s | ~59s
| ~1m6s |
~1m19s |
~1m49s |
---------------------------------------------------------------
| distributed
| ~4m18s |
~41s | ~57s |
~2m24s |
~1m38s |
---------------------------------------------------------------
| dist-repl |
~8m18s | ~1m4s
| ~1m11s |
~1m24s |
~2m40s |
---------------------------------------------------------------
| native FS |
~11s | ~4s |
~2s | ~56s |
~10s |
---------------------------------------------------------------
I get the same
results,
whether with
default
configurations
with custom
configurations.
if I look at
the side of
the ifstat
command, I can
note my IO
write
processes
never exceed
3MBs...
EXT4 native FS
seems to be
faster
(roughly
15-20% but no
more) than XFS
one
My [test]
storage
cluster config
is composed by
2 identical
servers (biCPU
Intel Xeon
X5355, 8GB of
RAM, 2x2TB HDD
(no-RAID) and
Gb ethernet)
My volume
settings:
single:
1server 1
brick
replicated: 2
servers 1
brick each
distributed: 2
servers 2
bricks each
dist-repl: 2
bricks in the
same server
and replica 2
All seems to
be OK in
gluster status
command line.
Do you have an
idea why I
obtain so bad
results?
Thanks in
advance.
Geoffrey
-----------------------------------------------
Geoffrey
Letessier
Responsable
informatique
&
ingénieur
système
CNRS - UPR
9080 -
Laboratoire de
Biochimie
Théorique
Institut de
Biologie
Physico-Chimique
13, rue Pierre
et Marie Curie
- 75005 Paris
Tel: 01 58 41
50 93 - eMail:
geoffrey.letessier@xxxxxxx
_______________________________________________
Gluster-users
mailing list Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users
_______________________________________________
Gluster-users
mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users
<quota-verify.gz>
|