Re: Significantly increased CPU footprint on OSDs after Hammer -> Jewel upgrade, OSDs occasionally wrongly marked as down

Trygve Vea <trygve.vea@xxxxxxxxxxxxxxxxxx> · Wed, 26 Oct 2016 20:10:44 +0200 (CEST)

----- Den 26.okt.2016 16:37 skrev Sage Weil sage@xxxxxxxxxxxx:
> On Wed, 26 Oct 2016, Trygve Vea wrote:
>> ----- Den 26.okt.2016 14:41 skrev Sage Weil sage@xxxxxxxxxxxx:
>> > On Wed, 26 Oct 2016, Trygve Vea wrote:
>> >> Hi,
>> >> 
>> >> We have two Ceph-clusters, one exposing pools both for RGW and RBD
>> >> (OpenStack/KVM) pools - and one only for RBD.
>> >> 
>> >> After upgrading both to Jewel, we have seen a significantly increased CPU
>> >> footprint on the OSDs that are a part of the cluster which includes RGW.
>> >> 
>> >> This graph illustrates this: http://i.imgur.com/Z81LW5Y.png
>> > 
>> > That looks pretty significant!
>> > 
>> > This doesn't ring any bells--I don't think it's something we've seen.  Can
>> > you do a 'perf top -p `pidof ceph-osd`' on one of the OSDs and grab a
>> > snapshot of the output?  It would be nice to compare to hammer but I
>> > expect you've long since upgraded all of the OSDs...
>> 
>> # perf record -p 18001
>> ^C[ perf record: Woken up 57 times to write data ]
>> [ perf record: Captured and wrote 18.239 MB perf.data (408850 samples) ]
>> 
>> 
>> This is a screenshot of one of the osds during high utilization:
>> http://i.imgur.com/031MyIJ.png
> 
> It looks like a ton of time spent in std::string methods and a lot more
> map<sring,ghobject_t> than I would expect.  Can you do a
> 
> perf record -p `pidof ceph-osd` -g
> perf report --stdout

Here you go:

http://employee.tv.situla.bitbit.net/stdio-report.gz?AWSAccessKeyId=V4NZ37SLP3VOPR2BI5UW&Expires=1477579744&Signature=pt8CvsaVHhYCtJ1kUfRsKq4MY7k%3D

>> Link to download binary format sent directly to you.
>> 
>> 
>> Your expectation about upgrades is correct.  We actually had some
>> problems performing the upgrade, so we ended up re-initializing the osds
>> as empty and backfill into jewel.  When we first started them on jewel,
>> they ended up blocking
> 
> Hrm, this is a new one for me too.  They've all been upgraded now?  It
> would be nice to see a log or backtrace to see why they got stuck.

Sorry, I cannot provide this information anymore :(

-- 
Trygve
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com