Re: ceph node crashed with these errors "kernel: ceph: build_snap_context" (maybe now it is urgent?)

"Marc Roos" <M.Roos@xxxxxxxxxxxxxxxxx> · Mon, 2 Dec 2019 12:48:12 +0100

Hi Ilya,

 >
 >
 >ISTR there were some anti-spam measures put in place.  Is your account
 >waiting for manual approval?  If so, David should be able to help.

Yes if I remember correctly I get waiting approval when I try to log in.

 >>
 >>
 >>
 >> Dec 1 03:14:36 c04 kernel: ceph: build_snap_context 100020c9287
 >> ffff911a9a26bd00 fail -12
 >> Dec 1 03:14:36 c04 kernel: ceph: build_snap_context 100020c9283
 >
 >
 >It is failing to allocate memory.  "low load" isn't very specific,
 >can you describe the setup and the workload in more detail?

4 nodes (osd, mon combined), the 4th node has local cephfs mount, which 
is rsync'ing some files from vm's. 'low load' I have sort of test setup, 
going to production. Mostly the nodes are below a load of 1 (except when 
the concurrent rsync starts)

 >How many snapshots do you have?

Don't know how to count them. I have script running on a 2000 dirs. If 
one of these dirs is not empty it creates a snapshot. So in theory I 
could have 2000 x 7 days = 14000 snapshots.
(btw the cephfs snapshots are in a different tree than the rsync is 
using)

 >Do you keep track of memory consumption on the node?

A bit, attached is nagios graph. I have 100GB in this node. Since then, 
I disabled all the hugepages (2MB, 1GB) I created there, to free up more 
memory.

 >Finally, you say "crash" in the subject.  Does the kernel actually
 >crash or perhaps it locks up?  If it actually crashes, do you have the
 >panic message?
 >

Whole server was gone. The logs are from the remote syslog server.

New situation is with more memory and kernel updated to 
3.10.0-1062.4.3.el7.x86_64, rsync is very slow and I have kworker 100% 
load

Attachment:
c04-memory.png

Description: Binary data
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx