Re: Osd load and timeout when recovering

Gregory Farnum <greg@xxxxxxxxxxx> · Mon, 22 Oct 2012 07:26:15 -0700

On Mon, Oct 22, 2012 at 3:27 AM, Yann ROBIN <yann.robin@xxxxxxxxxxxxx> wrote:
> Hi,
>
> We use ceph to store small file (lot of them) on different servers and access it using rados gateway.
> Our data size is 380Go (very small). We have two host with 5 osd each.
> We use small config for ceph : 2Go RAM server with 5 x 2To Disk (one OSD on each disk).
> This is a very cheap config that allow us to keep our storing cost under control and it's enough to get the read performance we need.
> (We use this config with mogilefs to store 150To of data)

These node sizes are one of your problems — while the OSDs in normal
operation are only using 100-200MBs of memory, they can spike quite a
lot during recovery. We generally recommend 1GB of RAM per daemon.

> This week-end we had an alert saying ceph was down.
>
> After looking at the osd, we saw a very high load on osd (450 of load), some were down.
> Ceph -s displayed that we were having down pg, peering+down pg, remapped pg. etc.
>
> So we started to see that when we were peering and stuff like that, the load was very high.
> OSD stop responding and we could see in the log message like :
> FileStore timeout and Abort Signal

Right. That means the OSD was sending operations down to disk that
were taking so long to complete it timed them out. Default on a
FileStore operation is 60 seconds, and if it was actually suiciding
the OSD that requires the disk to be nonresponsive for 180 seconds.

> So basically the cluster was under load because we was recovering... but because it was under load recovering could not complete.
>
> We change this params to get a longer timeout :
> filestore op thread suicide timeout = 360
> filestore op thread timeout = 180
> osd default notify timeout = 360

Okay, the first two of those are fine. The third one is actually
related to the "notify" OSD operation, and isn't helping you here.

> The cluster was still under heavy load, osd was still timeouting (less timeouting but still)
>
> So we test param to "throttle" the "recovery" process :
> filestore op threads = 6
> filestore queue max ops = 24
> osd recovery max active = 1

Okay, so now you've increased the number of simultaneous disk
operations the FileStore will dispatch from 2 to 6. That probably
didn't help. Decreasing the "filestore queue max ops" from 500 to
24...probably didn't do anything. But it might have; I'll defer to
others on that.
The one thing here that definitely did help is bringing down the "osd
recovery max active"...that's the number of PGs to try and recover
simultaneously; by dropping it from 5 to 1 you've reduced the total
number of recovery operations going on across the cluster.

> Load was better, but still very high (30).
>
> We also try to put the journal in a tmpfs with zram.

So, you took away RAM from the system? Ouch, but okay, maybe it's
reducing disk usage overall...

> We set noout so it won't copy files to satisfy the replicate count because osd were out.
>
> We then updated to kernel 3.5 to get last xfs optim.
>
> In the end nothing was working we were in the same infinite death loop of recovering => load => timeout => recovering.
> So we updated from ceph 0.48.2 to 0.53, load was better and recovery finally worked.

Right. You've run into a problem we call "cluster thrashing", in which
a problem with one OSD causes it to go out, and then the subsequent
map changes and data movement cause other OSDs to fall over as well.
This is a problem in argonaut which has been smoothed out a great deal
in subsequent development releases by greatly reducing the cost of OSD
map updates.

> As we don't want to be in the position again (24h downtime), I have some questions on ceph/rados.
>
> 1/ Even when we switch to ceph 0.53, the rados gateway was still not responding, Log was displaying Initalization timeout.
> Is it normal that the "recovering" process kill the fact that we can read data from ceph ?
> The data is here, it is just moving, why can't we access it ?

You had PGs that were in a "down" state, meaning that the OSD which is
supposed to be primary for them wasn't servicing requests yet. It
takes some time to establish who has the newest version of data in a
PG and gather up an active set.

> 2/ In case of very high load because ceph is moving data, is there a way to tell ceph to go slowly ?

There are a lot of switches you can throw to do this. You threw a
number of them. I'm not aware of any others off the top of my head,
but Sam or Sage might have more.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html