Dear all, for about a month we experience something strange in our small cluster. Let me first describe what happened on the way. On Oct 4ht smartmon told us that the journal SSDs in one of our two ceph nodes will fail. Since getting replacements took way longer than expected we decided to place the journal on a spare HDD rather than have the SSD fail and leave us in an uncertain state. On Oct 17th we finally got the replacement SSDs. First we replaced broken/soon to be broken SSD and moved journals from the temporarily used HDD to the new SSD. Then we also replaced the journal SSD on the other ceph node since it would probably fail sooner than later. We performed all operations by setting noout first, then taking down the OSDs, flushing journals, replacing disks, creating new journals and starting OSDs again. We waited until the cluster was back in HEALTH_OK state before we proceeded to the next node. AFAIR mkjournal crashed once on the second node. So we ran the command again and journals where created. The next day in the morning at 6:25 (time of cron.daily jobs on Debian systems) we registered almost 2000 slow requests. We've had slow requests before, but never more than 900 per day and that was rare. Another odd thing we noticed is that the cluster had grown over night by 50GB! We currently run 12 vservers from ceph images and they are all not really busy. Usually used data would grow by 2GB per week or less. Network traffic between our three monitors has roughly doubled at the same time and stayed on that level until now. We eventually got rid of all the slow requests by removing all but one snapshot per image. We used to take nightly snapshots of all images and keep 14 snapshots per image. Now we take on snapshot per image per night, use export-diff and offload the diff to storage outside of ceph and remove the nightly snapshot right away. The only snapshot we keep is the one that the diffs are based on. What remains is the growth of used data in the cluster. I put background information of our cluster and some graphs of different metrics on a wiki page: https://wiki.mur.at/Dokumentation/CephCluster Basically we need to reduce the growth in the cluster, but since we are not sure what causes it we don't have an idea. So the main question that I have is what went gone wrong when we replaced the journal disks? And of course: how can we fix it? As always, any hint appreciated! Regards, -- J.Hofmüller Ich zitiere wie Espenlaub. - https://twitter.com/TheGurkenkaiser/status/463444397678690304
Attachment:
signature.asc
Description: This is a digitally signed message part
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com