One specific OSD process using much more CPU than all the others

Sylvain Munaut <s.munaut@xxxxxxxxxxxxxxxxxxxx> · Tue, 21 Jan 2014 22:10:09 +0100

Hi,

I have a cluster that contains 16 OSDs spread over 4 physical
machines. Each machines runs 4 OSD process.

Among those, one isue periodically using 100% of the CPU. if you
aggregate the total CPU time of the process over long periods, you can
clearly see it uses roughtly 6x more CPU than any other of the all
OSDs. The numbers for the other 15 OSDs (both on the same machine and
on other machines) are quite consitent with one another.

The PG distribution isn't ideal (some OSDs have more than others) but
it's not bad either so there isn't one OSD having twice as much PGs as
the other for example. I also ran a full SMART self-check on all the
drives hosting OSD data but that didn't uncover anything.

The logs (with default logging level) are not really showing anything
abnormal either.

The problem also seems to have been exacerbated by my recent update to
Emperor (from Dumpling) this week end.

For instance, here the CPU usage logs for the 16 OSDs during the last 6 months :

http://i.imgur.com/cno73Ea.png

The red line is osd.14 which is the problematic one. As you can see it
recently "flared up" a lot but even before the update it was much
higher than the other and rising which is a troubling trend.

Any idea what this could be ? How can I isolate it and solve it ?

Cheers,

    Sylvain
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com