On Mon, Oct 22, 2012 at 3:27 AM, Yann ROBIN <yann.robin@xxxxxxxxxxxxx> wrote: > Hi, > > We use ceph to store small file (lot of them) on different servers and access it using rados gateway. > Our data size is 380Go (very small). We have two host with 5 osd each. > We use small config for ceph : 2Go RAM server with 5 x 2To Disk (one OSD on each disk). > This is a very cheap config that allow us to keep our storing cost under control and it's enough to get the read performance we need. > (We use this config with mogilefs to store 150To of data) These node sizes are one of your problems — while the OSDs in normal operation are only using 100-200MBs of memory, they can spike quite a lot during recovery. We generally recommend 1GB of RAM per daemon. > This week-end we had an alert saying ceph was down. > > After looking at the osd, we saw a very high load on osd (450 of load), some were down. > Ceph -s displayed that we were having down pg, peering+down pg, remapped pg. etc. > > So we started to see that when we were peering and stuff like that, the load was very high. > OSD stop responding and we could see in the log message like : > FileStore timeout and Abort Signal Right. That means the OSD was sending operations down to disk that were taking so long to complete it timed them out. Default on a FileStore operation is 60 seconds, and if it was actually suiciding the OSD that requires the disk to be nonresponsive for 180 seconds. > So basically the cluster was under load because we was recovering... but because it was under load recovering could not complete. > > We change this params to get a longer timeout : > filestore op thread suicide timeout = 360 > filestore op thread timeout = 180 > osd default notify timeout = 360 Okay, the first two of those are fine. The third one is actually related to the "notify" OSD operation, and isn't helping you here. > The cluster was still under heavy load, osd was still timeouting (less timeouting but still) > > So we test param to "throttle" the "recovery" process : > filestore op threads = 6 > filestore queue max ops = 24 > osd recovery max active = 1 Okay, so now you've increased the number of simultaneous disk operations the FileStore will dispatch from 2 to 6. That probably didn't help. Decreasing the "filestore queue max ops" from 500 to 24...probably didn't do anything. But it might have; I'll defer to others on that. The one thing here that definitely did help is bringing down the "osd recovery max active"...that's the number of PGs to try and recover simultaneously; by dropping it from 5 to 1 you've reduced the total number of recovery operations going on across the cluster. > Load was better, but still very high (30). > > We also try to put the journal in a tmpfs with zram. So, you took away RAM from the system? Ouch, but okay, maybe it's reducing disk usage overall... > We set noout so it won't copy files to satisfy the replicate count because osd were out. > > We then updated to kernel 3.5 to get last xfs optim. > > In the end nothing was working we were in the same infinite death loop of recovering => load => timeout => recovering. > So we updated from ceph 0.48.2 to 0.53, load was better and recovery finally worked. Right. You've run into a problem we call "cluster thrashing", in which a problem with one OSD causes it to go out, and then the subsequent map changes and data movement cause other OSDs to fall over as well. This is a problem in argonaut which has been smoothed out a great deal in subsequent development releases by greatly reducing the cost of OSD map updates. > As we don't want to be in the position again (24h downtime), I have some questions on ceph/rados. > > 1/ Even when we switch to ceph 0.53, the rados gateway was still not responding, Log was displaying Initalization timeout. > Is it normal that the "recovering" process kill the fact that we can read data from ceph ? > The data is here, it is just moving, why can't we access it ? You had PGs that were in a "down" state, meaning that the OSD which is supposed to be primary for them wasn't servicing requests yet. It takes some time to establish who has the newest version of data in a PG and gather up an active set. > 2/ In case of very high load because ceph is moving data, is there a way to tell ceph to go slowly ? There are a lot of switches you can throw to do this. You threw a number of them. I'm not aware of any others off the top of my head, but Sam or Sage might have more. -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html