cephfs: apache locks up after parallel reloads on multiple nodes

Stefan Kooman <stefan@xxxxxx> · Thu, 12 Sep 2019 17:06:43 +0200

Dear list,

We recently switched the shared storage for our linux shared hosting
platforms from "nfs" to "cephfs". Performance improvement are
noticeable. It all works fine, however, there is one peculiar thing:
when Apache reloads after a logrotate of the "error" logs all but one
node will hang for ~ 15 minutes. The log rotates are scheduled with a
cron, the nodes themselves synced with ntp. The first node that reloads
apache will keep on working, all the others will hang, and after a
period of ~ 15 minutes they will all recover almost simultaneously.

Our setup looks like this: 10 webservers all sharing the same cephfs
filesystem. Each webserver with around 100 apache threads has around
10.000 open file handles to "error" logs on cephfs. To be clear, all
webservers have a file handle on _the same_ "error" logs. The logrotate
takes around two seconds on the "surviving" node.

What could be the reason for this? Does it have something to do with
file locking, i.e. that it behaves differently on cephfs compared to nfs
(more strict)? What would be a good way to find out what is the root
cause? We have sysdig traces of different nodes, but on the nodes where
apache hangs not a lot is going on ... until it all recovers.

We remediated this by delaying the Apache reloads on all but one node.
Then there is no issue at all, even as all the other web servers still
reload almost at the same time.

Any info / hints on how to investigate this issue further are highly
appreciated.

Gr. Stefan

-- 
| BIT BV  https://www.bit.nl/        Kamer van Koophandel 09090351
| GPG: 0xD14839C6                   +31 318 648 688 / info@xxxxxx
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com