Thanks for your answers, we will also experiment with osd recovery max active / threads and will come back to you Regards, Kostis On 16 July 2015 at 12:29, Jan Schermer <jan@xxxxxxxxxxx> wrote: > For me setting recovery_delay_start helps during the OSD bootup _sometimes_, but it clearly does something different than what’s in the docs. > > Docs say: > After peering completes, Ceph will delay for the specified number of seconds before starting to recover objects. > > However, what I see is greatly slowed recovery, not a delayed start of recovery. It seems to basically sleep between recovering the PGs. AFAIK peering is already done unless I was remapping the PGs at the same moment, so not sure what’s happening there in reality. > > We had this set to 20 for some time and recovering after host restart took close to two hours. > With this parameter set to 0, it recovered in less than 30 seconds (and caused no slow requests or anything). > > So what I usually do is set this to a high number (like 200), and after all the OSDs are started I set it to 0. This does not completely prevent slow requests from happening, but does somewhat help… > > Jan > >> On 15 Jul 2015, at 11:52, Andrey Korolyov <andrey@xxxxxxx> wrote: >> >> On Wed, Jul 15, 2015 at 12:15 PM, Jan Schermer <jan@xxxxxxxxxxx> wrote: >>> We have the same problems, we need to start the OSDs slowly. >>> The problem seems to be CPU congestion. A booting OSD will use all available CPU power you give it, and if it doesn’t have enough nasty stuff happens (this might actually be the manifestation of some kind of problem in our setup as well). >>> It doesn’t do that always - I was restarting our hosts this weekend and most of them came up fine with simple “service ceph start”, some just sat there spinning the CPU and not doing any real world (and the cluster was not very happy about that). >>> >>> Jan >>> >>> >>>> On 15 Jul 2015, at 10:53, Kostis Fardelas <dante1234@xxxxxxxxx> wrote: >>>> >>>> Hello, >>>> after some trial and error we concluded that if we start the 6 stopped >>>> OSD daemons with a delay of 1 minute, we do not experience slow >>>> requests (threshold is set on 30 sec), althrough there are some ops >>>> that last up to 10s which is already high enough. I assume that if we >>>> spread the delay more, the slow requests will vanish. The possibility >>>> of not having tuned our setup to the most finest detail is not zeroed >>>> out but I wonder if at any way we miss some ceph tuning in terms of >>>> ceph configuration. >>>> >>>> We run firefly latest stable version. >>>> >>>> Regards, >>>> Kostis >>>> >>>> On 13 July 2015 at 13:28, Kostis Fardelas <dante1234@xxxxxxxxx> wrote: >>>>> Hello, >>>>> after rebooting a ceph node and the OSDs starting booting and joining >>>>> the cluster, we experience slow requests that get resolved immediately >>>>> after cluster recovers. It is improtant to note that before the node >>>>> reboot, we set noout flag in order to prevent recovery - so there are >>>>> only degraded PGs when OSDs shut down- and let the cluster handle the >>>>> OSDs down/up in the lightest way. >>>>> >>>>> Is there any tunable we should consider in order to avoid service >>>>> degradation for our ceph clients? >>>>> >>>>> Regards, >>>>> Kostis >>>> _______________________________________________ >>>> ceph-users mailing list >>>> ceph-users@xxxxxxxxxxxxxx >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@xxxxxxxxxxxxxx >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> As far as I`ve seen this problem, the main issue for regular >> disk-backed OSDs is an IOPS starvation during some interval after >> reading maps from filestore and marking itself as 'in' - even if >> in-memory caches are still hot, I/O will significantly degrade for a >> short period. The possible workaround for an otherwise healthy cluster >> and node-wide restart is to set norecover flag, it would greatly >> reduce a chance of hitting slow operations. Of course it is applicable >> only to non-empty cluster with tens of percents of an average >> utilization for rotating media. I pointed this issue a couple of years >> ago first (it *does* break 30s I/O SLA for returning OSD, but >> refilling same OSDs from scratch would not violate the same SLA, >> giving out way bigger completion time for a refill). From UX side, it >> would be great to introduce some kind of recovery throttler for newly >> started OSDs, as recovery_ delay_start does not prevent immediate >> recovery procedures. > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com