unusual growth in cluster after replacing journal SSDs

Jogi Hofmüller <jogi@xxxxxx> · Thu, 16 Nov 2017 13:36:50 +0100

Dear all,

for about a month we experience something strange in our small cluster.
 Let me first describe what happened on the way.

On Oct 4ht smartmon told us that the journal SSDs in one of our two
ceph nodes will fail.  Since getting replacements took way longer than
expected we decided to place the journal on a spare HDD rather than
have the SSD fail and leave us in an uncertain state.

On Oct 17th we finally got the replacement SSDs.  First we replaced
broken/soon to be broken SSD and moved journals from the temporarily
used HDD to the new SSD.  Then we also replaced the journal SSD on the
other ceph node since it would probably fail sooner than later.

We performed all operations by setting noout first, then taking down
the OSDs, flushing journals, replacing disks, creating new journals and
starting OSDs again.  We waited until the cluster was back in HEALTH_OK
state before we proceeded to the next node.

AFAIR mkjournal crashed once on the second node.  So we ran the command
again and journals where created.

The next day in the morning at 6:25 (time of cron.daily jobs on Debian
systems) we registered almost 2000 slow requests.  We've had slow
requests before, but never more than 900 per day and that was rare.

Another odd thing we noticed is that the cluster had grown over night
by 50GB!  We currently run 12 vservers from ceph images and they are
all not really busy.  Usually used data would grow by 2GB per week or
less.  Network traffic between our three monitors has roughly doubled
at the same time and stayed on that level until now.

We eventually got rid of all the slow requests by removing all but one
snapshot per image.  We used to take nightly snapshots of all images
and keep 14 snapshots per image.

Now we take on snapshot per image per night, use export-diff and
offload the diff to storage outside of ceph and remove the nightly
snapshot right away.  The only snapshot we keep is the one that the
diffs are based on.

What remains is the growth of used data in the cluster.

I put background information of our cluster and some graphs of
different metrics on a wiki page:

  https://wiki.mur.at/Dokumentation/CephCluster

Basically we need to reduce the growth in the cluster, but since we are
not sure what causes it we don't have an idea.

So the main question that I have is what went gone wrong when we
replaced the journal disks?  And of course: how can we fix it?

As always, any hint appreciated!

Regards,
-- 
J.Hofmüller

       Ich zitiere wie Espenlaub.
       - https://twitter.com/TheGurkenkaiser/status/463444397678690304
Attachment:
signature.asc

Description: This is a digitally signed message part
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com