RGW hung, 2 OSDs using 100% CPU

clewis@xxxxxxxxxxxxxxxxxx (Craig Lewis) · Thu, 18 Sep 2014 17:49:40 -0700

No, removing the snapshots didn't solve my problem.  I eventually traced
this problem to XFS deadlocks caused by
[osd]
  "osd mkfs options xfs": "-l size=1024m -n size=64k -i size=2048 -s
size=4096"

Changing to just "-s size=4096", and reformatting all OSDs solved this
problem.

Since then, I ran into http://tracker.ceph.com/issues/5699.  Snapshots are
off until I've deployed Firefly.

On Wed, Sep 17, 2014 at 8:09 AM, Florian Haas <florian at hastexo.com> wrote:

> Hi Craig,
>
> just dug this up in the list archives.
>
> On Fri, Mar 28, 2014 at 2:04 AM, Craig Lewis <clewis at centraldesktop.com>
> wrote:
> > In the interest of removing variables, I removed all snapshots on all
> pools,
> > then restarted all ceph daemons at the same time.  This brought up osd.8
> as
> > well.
>
> So just to summarize this: your 100% CPU problem at the time went away
> after you removed all snapshots, and the actual cause of the issue was
> never found?
>
> I am seeing a similar issue now, and have filed
> http://tracker.ceph.com/issues/9503 to make sure it doesn't get lost
> again. Can you take a look at that issue and let me know if anything
> in the description sounds familiar?
>
> You mentioned in a later message in the same thread that you would
> keep your snapshot script running and "repeat the experiment". Did the
> situation change in any way after that? Did the issue come back? Or
> did you just stop using snapshots altogether?
>
> Cheers,
> Florian
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140918/c2adaf9e/attachment.htm>