Re: OSD's load_pgs takes a lot of time

Sage Weil <sweil@xxxxxxxxxx> · Tue, 26 Aug 2014 06:03:43 -0700 (PDT)

On Tue, 26 Aug 2014, Micha? Szyma?ski wrote:
> I have noticed that sometimes it takes a lot of time for an OSD to go
> back up and in. From what i can see in the logs it is stuck on
> load_pgs for a while:
> 2014-08-21 15:32:04.711048 7fba11569780  0 osd.1 139 load_pgs
> 2014-08-21 15:32:04.712512 7fba11569780 10 osd.1 139 load_pgs
> 3.165_TEMP clearing temp
> 2014-08-21 15:32:19.648610 7fba11569780 10 osd.1 139 load_pgs
> 3.13b_TEMP clearing temp
> 2014-08-21 15:32:34.674773 7fba11569780 10 osd.1 139 load_pgs
> 3.36b_TEMP clearing temp
> 
> It happens when you restart an OSD while there was an ongoing recovery
> in the cluster. The process isn't neither IO nor CPU heavy at that
> time, and judging by strace output it mostly does futex calls and a
> little IO on PGs. I am using Ceph 0.80.5.
> 
> Have anybody noticed this behavior? Isn't it possible to clear temp faster?

load_pgs is doing a lot more than just clearing temp; most of the work 
just isn't mentioned at debug osd = 10.  It loads the PG logs into 
memory (recent operations) and scans them to ensure they aren't missing 
locally.

There are likely several things we could do to speed this up (especially 
the search_for_missing() call) but they're a bit complex and not at the 
top of the list...

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html