Re: Ceph Volume Issue

<Mehul1.Jani@xxxxxxx> · Fri, 18 Nov 2016 05:19:12 +0000

Thanks everyone for your inputs.

Below is a small writeup which I wanted to share with everyone in Ceph User community.

Summary of the Ceph Issue with Volumes

Our Setup 
As mentioned earlier in our setup we have Openstack MOS 6.0 integrated with Ceph Storage cluster.
The version details are as follows
Ceph version : 0.80.7
Libvirt version : 1.2.2
Openstack Version : Juno (Mirantis 6.0)

Statement of Problem 
When we attached multiple volumes( greater than 6) to  VM instance ,similar to adding multiple disks on a Hadoop Baremetal. And tried to write to multiple disks simultaneously for example via dd command "dd if=/dev/zero
 of=/disk{1..6}/test bs=4K count=10485760".
Observing the "vmstat 1" on the VM instance, we saw that over a period of time the "bo (block out)" value started to trickle down to zero.
As soon as the "bo value" reached zero, the load on the VM instance spiked and system became unresponsive. We had to reboot the VM instance to recover.

Also when we checked all the "dd" processes were in "D uninterruptible sleep state".

Our investigate and Probable Resolution

It was from the /var/log/syslog on the compute nodes on which the VM instance was running where we found an error message "Too many open files".
Example below 
ABCD = PID of qemu instance

<8>Nov 18 04:56:49 node-XXX qemu-system-x86_64: 2016-11-18 04:56:49.939702 7fe9b569d700 -1 -- <COMPUTE IP>:0/70<ABCD> >> <CEPH MONITOR>:6830/14356 pipe(0x7fede65dcbf0 sd=
-1 :0 s=1 pgs=0 cs=0 l=1 c=0x7fede37fc8c0).connect couldn't created socket (24) Too many open files  

When we checked the limit for number of open files from proc we found below

XXXXXX@node-XXXX:~# cat /proc/<ABCD>/limits
...
Max open files            1024                 4096                 files
...

On this basis ,we increased the open file decriptor limit for libvirt-bin process from 1024 to 65536.
We had to put the below limit commands in /etc/default/libvirt-bin
ulimit -Hn 65536
ulimit -Sn 65536

We had to reboot the qemu instances via nova stop and nova start for the new limits to take effect.

This workaround has solved our issue for now and the above mentioned test cases are now successful.

We also checked different points below which were indeed helpful in narrowing the issue

·        
Was the issue limited to a specific type of Linux OS (Ubuntu or CentOS)
·        
Was the issue limited to specific kernel. We upgraded the kernel but still the issue persisted.
·        
Was the issue due to any limiting resources (CPU , RAM, NETWORK , DISK IO) on either the VM instance or Compute Node.
·        
We also tried to tune kernel parameters such as dirty_ratio , background_dirty_ratio etc. But no improvement was observed.
·        
Also we observed that issue was NOT the number of volumes attached but the total amount of IO performed.

As per our understanding this is a good resolution for now but it may need monitoring and appropriate tuning.

Please do let me know if there are any questions/concerns or pointers
J

Thanks once again.

Thanks,
Mehul  

From: Mehul1 Jani

Sent: 16 November 2016 11:40

To: 'ceph-users@xxxxxxxxxxxxxx'

Cc: Sanjeev Jaiswal; Harshit T Shah; Hardikv Desai

Subject: Ceph Volume Issue

Hi All,

We have a Ceph Storage Cluster and it’s been integrated with our Openstack private cloud.
We have created a Pool for Volume which allows our Openstack Private Cloud user to create a volume from image and boot from volume.
Additionally our images(both Ubuntu1404 and CentOS 7) are in a raw format.

One of our use cases is to attach multiple volumes other than “boot volume”.
We have observed that when we attach multiple volumes, and try to simultaneous writes to these attached volumes for example via the “dd command” , all these processes go into “D state (uninterruptible sleep)”.
Also we can see in vmstat output that “bo” values trickling down to zero.
We have checked the network utilization on the compute node which does not show any issues.

Finally after a while system becomes unresponsive and only way to resolve is to reboot the VM.

Some of our version details are as follows.

Ceph version : 0.80.7
Libvirt version : 1.2.2
Openstack Version : Juno (Mirantis 6.0) 

Please do let me know if anyone has faced a similar issue or have any pointers.

Any direction will be helpful.

Thanks,
Mehul

"Confidentiality Warning: This message and any attachments are intended only for the use of the intended recipient(s), are confidential and may be privileged. If you are not the intended recipient, you are hereby notified that any review, re-transmission, conversion to hard copy, copying, circulation or other use of this message and any attachments is strictly prohibited. If you are not the intended recipient, please notify the sender immediately by return email and delete this message and any attachments from your system.
Virus Warning: Although the company has taken reasonable precautions to ensure no viruses are present in this email. The company cannot accept responsibility for any loss or damage arising from the use of this email or attachment."

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com