Hi John...
Regarding logs, we still do not have them available. We just
realized that ceph-fuse tries to log to /var/log/ceph, which in
our case didn't exist in the clients. So, we had to create that
directory everywhere, and we are in the process of remounting
every client so that they starts logging. Since by umounting we
are forcing the client to free the inodes, we have to wait for the
situation to reappear.
However. I have a bit further information. Maybe it can shed a
further light on the topic.
- If I currently loop through all my clients, I now get a total of
29604 inodes.
$ cat clients_inodes_20161216-0938.txt | grep inode_count |
awk '{print $2}' | sed 's/,//g' | awk '{s+=$1} END {print s}'
29604
- However, the mds reports '"inodes": 1779521' and
'"inodes_with_caps": 32823,'. Is there a need for the MDS to keep
in memory such a large amount of inodes without associated caps? I
also expect that these are the first ones to be trimmed once
inodes > inode_max.
"mds": {
(...)
"inode_max": 2000000,
"inodes": 1779521,
"inodes_top": 18119,
"inodes_bottom": 1594129,
"inodes_pin_tail": 167273,
"inodes_pinned": 182643,
"inodes_expired": 53391877,
"inodes_with_caps": 32823,
"caps": 35133,
(...)
},
- I am also seeing some false positives (I think). As I explained
before, we have currently umounted all clients except 2 (they are
interactive machines where our users run tmux sessions and so
on... So, it is hard to kick them out :-) ). One of those two is
still reported as problematic by MDS although inodes <
inodes_max. Looking to the number of inodes of that machine, I get
"inode_count": 13862. So, it seems that the client is still tagged
as problematic although it has an inode_count bellow 16384 and
inodes < inodes_max. Maybe a consequence of https://github.com/ceph/ceph/pull/11373
? And this fix seems to only go on Kraken?
Cheers
Goncalo
Cheers
Goncalo
On 12/14/2016 10:16 AM, Goncalo Borges
wrote:
Hi John.
Comments in line.
Hi Ceph(FS)ers...
I am currently running in production the following
environment:
- ceph/cephfs in 10.2.2.
- All infrastructure is in the same version (rados cluster,
mons, mds and
cephfs clients).
- We mount cephfs using ceph-fuse.
Since yesterday that we have our cluster in warning state with
the message
"mds0: Many clients (X) failing to respond to cache pressure".
X has been
changing with time, from ~130 to ~70. I am able to correlate
the appearance
of this message with burst of jobs in our cluster.
This subject has been discussed in the mailing list a lot of
times, and
normally, the recipe is to look for something wrong in the
clients. So, I
have tried to look to clients first:
1) I've started to loop through all my clients, and run 'ceph
--admin-daemon
/var/run/ceph/ceph-client.mount_user.asok status' to get the
inodes_count
reported in each client.
$ cat all.txt | grep inode_count | awk '{print $2}' | sed
's/,//g' | awk
'{s+=$1} END {print s}'
2407659
2) I've then compared with the number of inodes the mds had in
its cache
(obtained by a perf dump)
inode_max": 2000000 and "inodes": 2413826
3) I've tried to understand how many clients had a number of
inodes higher
than 16384 (the default) and got
$ for i in `cat all.txt | grep inode_count | awk '{print $2}'
| sed 's/,//g'
`; do if [ $i -ge 16384 ]; then echo $i; fi; done | wc -l
27
4) My conclusion is that the core of inodes is held by a
couple of machines.
However, while the majority is running user jobs, others are
not doing
anything at all. For example, an idle machine (which had no
users logged in,
no jobs running, updatedb does not search for cephfs
filesystem) reported
more than > 300000 inodes). To regain those inodes, I had
to umount and
remount cephfs in that machine.
5) Based on my previous observations I suspect that there are
still some
problems in the ceph-fuse client regarding recovering these
inodes (or it
happens at a very slow rate).
Seems that way. Can you come up with a reproducer for us,
and/or
gather some client+mds debug logs where a client is failing to
respond
to cache pressure?
I think I've nailed this down to a specific user workload.
Everytime this user runs, it lefts the client with a huge number
of inodes, normally more than 100000. The workload consists in the
generations of a big number of analysis files spread over multiple
directories. I am going to try to inject some debug parameters and
see what do we come up with. Will reply on this thread later on.
Also, what kernel is in use on the
clients? It's possible that the
issue is in FUSE itself (or the way that it responses to
ceph-fuse's
attempts to ask for some inodes to be released).
All our clusters run SL6 because CERN experiments software is only
certified to that OS flavour. Because of the SL6 restriction, to
enable pos infernalis ceph clients in those machines, we have to
recompile them as well as some of the dependencies it needs and
which are not available in SL6. In summary, we recompile ceph-fuse
10.2.2 with gcc 4.8.4 against boost-1.53.0-25 and fuse-2.9.7. The
kernel version in the clients is 2.6.32-642.6.2.el6.x86_64
Thanks for the explanations about the mds memory usage. I am glad
there is something on its way to trigger a more effective memory
usage
Cheers
Goncalo
However, I also do not completely
understand what is happening on the server
side:
6) The current memory usage of my mds is the following:
PID USER PR NI VIRT RES SHR S %CPU %MEM
TIME+ COMMAND
17831 ceph 20 0 13.667g 0.012t 10048 S 37.5 40.2
1068:47 ceph-mds
The mds cache size is set to 2000000. Running 'ceph daemon
mds.<id> perf
dump', I get "inode_max": 2000000 and "inodes": 2413826.
Assuming 4k per
each inode one gets ~10G. So why it is taking much more than
that?
7) I have been running cephfs for more than an year, and
looking to ganglia,
the mds memory never decreases but always increases (even in
cases when we
umount almost all the clients). Why does that happen?
Coincidentally someone posted about this on ceph-devel just
yesterday.
The answer is that the MDS uses memory pools for allocation, and
it
doesn't (currently) ever bother releasing memory back to the
operating
system because it's doing its own cache size enforcement.
However,
when the cache size limits aren't being enforced (for example
because
of clients failing to release caps) this becomes a problem.
There's a
patch for master (https://github.com/ceph/ceph/pull/12443)
8) I am running 2 mds, in active /
standby-replay mode. The memory of the
standby-replay is much lower
PID USER PR NI VIRT RES SHR S %CPU %MEM
TIME+ COMMAND
716 ceph 20 0 6149424 5.115g 8524 S 1.2 43.6
53:19.74 ceph-mds
If I trigger a restart on my active mds, the standby replay
will start
acting as active, but will continue with the same amount of
memory. Why the
second mds can become active, and do the same job but using
much more
memory?
Presumably this also makes sense once you know about the
allocator in use.
9) Finally, I am sending an extract of
'ceph daemon mds.<id> perf dump' from
my active and standby mdses. What is exactly the meaning of
inodes_pin_tail,
inodes_expired and inodes_with_caps? Is the standby mds
suppose to show the
same numbers? They don't...
It's not really possible to explain these counters without a
substantial explanation of MDS internals, sorry. I will say
though
that there is absolutely no guarantee of performance counters on
the
standby replay daemon matching those on the active daemon.
John
Thanks in advance for your answers /
suggestions
Cheers
Goncalo
active:
"mds": {
"request": 93941296,
"reply": 93940671,
"reply_latency": {
"avgcount": 93940671,
"sum": 188398.004552299
},
"forward": 0,
"dir_fetch": 309878,
"dir_commit": 1736194,
"dir_split": 0,
"inode_max": 2000000,
"inodes": 2413826,
"inodes_top": 201,
"inodes_bottom": 568,
"inodes_pin_tail": 2413057,
"inodes_pinned": 2413303,
"inodes_expired": 19693168,
"inodes_with_caps": 2409737,
"caps": 2440565,
"subtrees": 2,
"traverse": 113291068,
"traverse_hit": 57822611,
"traverse_forward": 0,
"traverse_discover": 0,
"traverse_dir_fetch": 154708,
"traverse_remote_ino": 1085,
"traverse_lock": 66063,
"load_cent": 9394314733,
"q": 22,
"exported": 0,
"exported_inodes": 0,
"imported": 0,
"imported_inodes": 0
},
standby-replay:
"mds": {
"request": 0,
"reply": 0,
"reply_latency": {
"avgcount": 0,
"sum": 0.000000000
},
"forward": 0,
"dir_fetch": 0,
"dir_commit": 0,
"dir_split": 0,
"inode_max": 2000000,
"inodes": 2000058,
"inodes_top": 0,
"inodes_bottom": 1993207,
"inodes_pin_tail": 6851,
"inodes_pinned": 124135,
"inodes_expired": 10651484,
"inodes_with_caps": 0,
"caps": 0,
"subtrees": 2,
"traverse": 0,
"traverse_hit": 0,
"traverse_forward": 0,
"traverse_discover": 0,
"traverse_dir_fetch": 0,
"traverse_remote_ino": 0,
"traverse_lock": 0,
"load_cent": 0,
"q": 0,
"exported": 0,
"exported_inodes": 0,
"imported": 0,
"imported_inodes": 0
},
--
Goncalo Borges
Research Computing
ARC Centre of Excellence for Particle Physics at the Terascale
School of Physics A28 | University of Sydney, NSW 2006
T: +61 2 93511937
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Goncalo Borges
Research Computing
ARC Centre of Excellence for Particle Physics at the Terascale
School of Physics A28 | University of Sydney, NSW 2006
T: +61 2 93511937
|