Oh, forgot to ask, any core dumps?
Mark
On 11/30/2015 06:58 PM, Mark Nelson wrote:
Hi Laurent,
Wow, that's excessive! I'd see if anyone else has any tricks first, but
if nothing else helps, running an OSD through valgrind with massif will
probably help pinpoint what's going on. Have you tweaked the recovery
tunables at all?
Mark
On 11/30/2015 06:52 PM, Laurent GUERBY wrote:
Hi,
We lost a disk today in our ceph cluster so we added a new machine with
4 disks to replace the capacity and we activated straw1 tunable too
(we also tried straw2 but we quickly backed up this change).
During recovery OSD started crashing on all of our machines
the issue being OSD RAM usage that goes very high, eg:
24078 root 20 0 27.784g 0.026t 10888 S 5.9 84.9
16:23.63 /usr/bin/ceph-osd --cluster=ceph -i 41 -f
/dev/sda1 2.7T 2.2T 514G 82% /var/lib/ceph/osd/ceph-41
That's about 8GB resident RAM per TB of disk, way above
what we provisionned ~ 2-4 GB RAM/TB.
We rebuilt 0.94.5 with the three memory related commits below but
it didn't change anything.
Right now our cluster is unable to fully restart and recover with the
machines and RAM we have been working with for the past year.
Any idea on what to look for?
Thanks in advance,
Sincerely,
Laurent
commit 296bec72649884447b59e785c345c53994df9e09
Author: xiexingguo <258156334@xxxxxx>
Date: Mon Oct 26 18:38:01 2015 +0800
FileStore: potential memory leak if _fgetattrs fails
Memory leak happens if _fgetattrs encounters some error and simply
returns.
Fixes: #13597
Signed-off-by: xie xingguo <xie.xingguo@xxxxxxxxxx>
(cherry picked from commit ace7dd096b58a88e25ce16f011aed09269f2a2b4)
commit 16aa14ab0208df568e64e2a4f7fe7692eaf6b469
Author: Xinze Chi <xmdxcxz@xxxxxxxxx>
Date: Sun Aug 2 18:36:40 2015 +0800
bug fix: osd: do not cache unused buffer in attrs
attrs only reference the origin bufferlist (decode from MOSDPGPush
or
ECSubReadReply message) whose size is much greater than attrs in
recovery.
If obc cache it (get_obc maybe cache the attr), this causes the
whole origin
bufferlist would not be free until obc is evicted from obc cache. So
rebuild
the bufferlist before cache it.
Fixes: #12565
Signed-off-by: Ning Yao <zay11022@xxxxxxxxx>
Signed-off-by: Xinze Chi <xmdxcxz@xxxxxxxxx>
(cherry picked from commit c5895d3fad9da0ab7f05f134c49e22795d5c61f3)
commit 51ea1ca7f4a7763bfeb110957cd8a6f33b8a1422
Author: xiexingguo <258156334@xxxxxx>
Date: Thu Oct 29 20:04:11 2015 +0800
Objecter: pool_op callback may hang forever.
pool_op callback may hang forever due to osdmap update during reply
handling.
Fixes: #13642
Signed-off-by: xie xingguo <xie.xingguo@xxxxxxxxxx>
(cherry picked from commit 00c6fa9e31975a935ed2bb33a099e2b4f02ad7f2)
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com