On Fri, Dec 16, 2011 at 00:36, Amon Ott <a.ott@xxxxxxxxxxxx> wrote: > Our server clusters have quite a few cron jobs as well as Nagios health checks > that also access the common data area on Ceph FS for configuration and status > storage. If these jobs hang forever because of a blocked access, they cannot > finish their other tasks - even if that access is not vital for these other > tasks. Specially, they can never return a result. You cannot even shutdown > the system cleanly, if umount blocks forever. Especially for Nagios checks, you really want to be very defensive about things that might hang. I've seen too many nagios checks hanging e.g. due to TCP, so I'd recommend using a generic timeout mechanism to make sure all your checks, everywhere, fail if they take too long.. There's no point in waiting for a health check that's run every 5 minutes for more than 5 minutes. Something like /usr/bin/timeout from coreutils is your friend. (Whether this applies to cron jobs is a different conversation.. I tend to think "yes"; I've seen too many systems pile up a hundred instances of the same hourly cronjob.) -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html