Hi, > Do you have logs/cores which can help us? I'll try to produce some for you soon. I've been busy trying to stabilise the affected production systems for during our high demand period, so this will have to wait until it's a safe time to incur a deliberate outage. (The configuration I'm running now, while prone to less crashes, is not one I intend to keep running long-term, as only one daemon is in use across two client machines, and without any performance translators. So I have to wait until after peak time to try to debug the performance enhanced one daemon-per-server configuration I normally run, that I want to work again.) Oh - one thing I have noticed is that post-upgrade from 1.3 TLA 646, there's been a large number of (for want of a better word) 'unhealable' files - files that I know were present previously on at least one dataspace block, but are now only present in the namespace block. I mention this as there seems to be some correlation between deleting these files and increasing the time between crashes. It doesn't seem to be as clear cut as 'self-heal is causing the crash', as processes accessing the affected files through the GlusterFS export doesn't cause a crash right there and then. It just seems to increase the risk of a crash over time. Perhaps it's some sort of resource leak in self heal? Anyway, hopefully the logs - when I can safely produce them - will be able to resolve the true cause. > Given the fact that there is a reasonably high demand for it, I think > we should be adding this support as an option in our protocol. There > are a few challenges with the current design (like having stateful fd) > which will need some trickery to accommodate them across reconnects. > So it may not be implemented immediately, but maybe in 2.1.x or 2.2.x. Thanks for considering this. If I had a wish list for GlusterFS, this feature would be at the top of it. Kind regards, Geoff Kassel. On Sat, 17 Jan 2009, Anand Avati wrote: > > What I've realized is that a blocking GlusterFS client would solve this > > negative visibility problem for me while I look again at the crash > > issues. (I've just upgraded to the latest 1.4/2.0 TLA, so my experiences > > are relevant to the majority again. Yes, I'm still getting crashes.) > > Do you have logs/cores which can help us? > > > That way, I'd just have to restart the GlusterFS daemon(s), and my > > running services would block, but not have to be restarted. My clients > > would see a lack of responsiveness for up to 20 seconds, not a five to > > ten minute outage. > > > > Is there any possibility of this feature being added to GlusterFS? > > Given the fact that there is a reasonably high demand for it, I think > we should be adding this support as an option in our protocol. There > are a few challenges with the current design (like having stateful fd) > which will need some trickery to accommodate them across reconnects. > So it may not be implemented immediately, but maybe in 2.1.x or 2.2.x. > > avati