Unsolved questions

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi everyone,

I'm now running our two-node mini-cluster for some months. OSD, MDS and Monitor is running on both nodes. Additionally there is a very small third node which is only running a third monitor but no MDS/OSD. On both main servers, CephFS is mounted via FSTab/Kernel driver. The mounted folder is /var/www hosting many websites. We use Ceph in this situation to achieve redundancy so we can easily switch over to the other node in case one of them fails. Kernel version is 4.9.6. For the most part, it's running great and the performance of the filesystem is very good. Only some stubborn problems/questions have still remained over the whole time and I'd like to settle them once and for all:

1) Every once in a while, some processes (PHP) accessing the filesystem get stuck in a D-state (Uninterruptable sleep). I wonder if this happens due to network fluctuations (both server are connected via a simple Gigabit crosslink cable) or how to diagnose this. Why exactly does this happen in the first place? And what is the proper way to get these processes out of this situation? Why doesnt a timeout happen or anything else? I've read about client eviction, but when I enter "ceph daemon mds.node1 session ls" I only see two "entries" - one for each server. But I don't want to evict all processes on the server, obviously. Only the stuck process. So far, the only method I found to remove the D process is to reboot. Which is of course not a great solution. When I tried to only restart the MDS service instead of rebooting, many more processes got stuck and the load was >500 (not CPU most probably but due to processes waiting for I/O).

I found this thread here: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-May/001513.html

Is this (still) relevant for my problem? And I read somewhere that you should not mount the folder on the same server as the MDS is - except you have a "newer" kernel (can't find where I've read this). The information was a bit older, though, so I wondered if 4.9.6 isnt sufficient or whether this is still a problem at all...

2) A second, also still unsolved problem: Most of the time "ceph health" shows sth. like: "Client node2 failing to respond to cache pressure". Restarting the mds removes this message for a while before it appears again. I could remove the message by setting "mds cache size" higher than the total number of files/folder on the whole filesystem. Which is obviously not a great scalable solution. The message doesnt seem to cause any problems, though. Nevertheless, I'd like to solve this. BTW: When I run "session ls" I see the number of caps held (num_caps) very high (80000). Doesnt this mean that so many files are open/occupied by one ore more processes? Is this normal? I have some cronjobs running from time to time which run find or chmod over the filesystem. Could they be resposible for this? Is there some value to have Ceph release those "caps" faster/earlier?

Thank you / BR

Ranjan


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux