Hi , The output of ceph -s : cluster 50961297-815c-4598-8efe-5e08203f9fea health HEALTH_OK monmap e5: 5 mons at {pshn05=10.71.13.5:6789/0,pshn06=10.71.13.6:6789/0,pshn13=10.71.13.13:6789/0,psosctl111=10.71.13.111:6789/0,psosctl112=10.71.13.112:6789/0}, election epoch 258, quorum 0,1,2,3,4 pshn05,pshn06,pshn13,psosctl111,psosctl112 mdsmap e173: 1/1/1 up {0=pshn17=up:active}, 4 up:standby osdmap e21319: 16 osds: 16 up, 16 in pgmap v3301189: 384 pgs, 3 pools, 4906 GB data, 3794 kobjects 9940 GB used, 10170 GB / 21187 GB avail 384 active+clean I don't use any ceph client (kernel or fuse) on the same nodes that run osd/mon/mds daemons. Yes, I see slow operations warnings from time to time when I'm looking at ceph -w. The number of iops on the servers aren't that high and I think the write-back cache of the RAID controller sould be able to help with the journal ops. Simion Rad. ________________________________________ From: Gregory Farnum [greg@xxxxxxxxxxx] Sent: Tuesday, July 14, 2015 12:38 To: Simion Rad Cc: ceph-users@xxxxxxxx Subject: Re: ceph daemons stucked in FUTEX_WAIT syscall On Mon, Jul 13, 2015 at 11:00 PM, Simion Rad <Simion.Rad@xxxxxxxxx> wrote: > Hi , > > I'm running a small cephFS ( 21 TB , 16 OSDs having different sizes between > 400G and 3.5 TB ) cluster that is used as a file warehouse (both small and > big files). > Every day there are times when a lot of processes running on the client > servers ( using either fuse of kernel client) become stuck in D state and > when I run a strace of them I see them waiting in FUTEX_WAIT syscall. > The same issue I'm able to see on all OSD demons. > The ceph version I'm running is Firefly 0.80.10 both on clients and on > server daemons. > I use ext4 as osd filesystem. > Operating system on servers : Ubuntu 14.04 and kernel 3.13. > Operaing system on clients : Ubuntu 12.04 LTS with HWE option kernel 3.13 > The osd daemons are using RAID5 virtual disks (6 x 300 GB 10K RPM disks on > RAID controller Dell PERC H700 with 512MB BBU using write-back mode). > The servers which the ceph daemons are running on are also hosting KVM VMs ( > OpenStack Nova ). > Because of this unfortunate setup the performance is really bad, but at > least I shouldn't see as many locking issues (or shoud I ? ). > The only thing which temporarily improves the performance is restarting > every osd. After such a restart I see some processes on client machines > resume I/O but only for a couple of > hours, then the whole process must be repeated. > I cannot afford to run a setup without RAID because there isn't enough RAM > left for a couple of osd daemons. > > The ceph.conf settings I use : > > auth cluster required = cephx > auth service required = cephx > auth client required = cephx > filestore xattr use omap = true > osd pool default size = 2 > osd pool default min size = 1 > osd pool default pg num = 128 > osd pool default pgp num = 128 > public network = 10.71.13.0/24 > cluster network = 10.71.12.0/24 > > Did someone else experienced this kind of behaviour (stuck processes in > FUTEX_WAIT syscall) when running firefly release on Ubuntu 14.04 ? What's the output of "ceph -s" on your cluster? When your clients get stuck, is the cluster complaining about stuck ops on the OSDs? Are you running kernel clients on the same boxes as your OSDs? If I were to guess I'd imagine that you might just have overloaded your cluster and the FUTEX_WAIT is the clients waiting for writes to get acknowledged, but if restarting the OSDs brings everything back up for a few hours that might not be the case. -Greg _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com