Any suggestions on the best way to migrate / fix my cluster configuration

Carl J Taylor <cjtaylor@xxxxxxxxx> · Fri, 13 Feb 2015 01:11:48 +0000

Hi,
An engineer who worked for me a couple of years ago setup a Ceph cloud running an old version.  We are now seeing serious performance problems that are affecting other systems.  So I have tried to research what to do.  I have updated to 0.80.7 version and added more hardware and things are no better.  I am now stuck trying to decide how to move forward and wonder if anyone can advise?

The setup was originally 3 machines each with 2 3TB disks using EXT-4 FS on the first disk and XFS on the second on each node.  One has 8GB Ram and the other two have 4GB RAM.

I decided to add a 4th node into the group and set that up with XFS on the primary disk and BTRFS on the second.  On this machine the Journal whilst on the same disk as the OSD is in a raw partition on each disk. The memory is 8GB.

It took about 7 days for the data to finally move off the first node onto the fourth node and already bad performance became abysmal.

I had planned to add a PCI based SSD as a main drive and use the two disks with journals on the SSD but I could not get the SSD to work at all.

I am stuck trying to keep this production cluster operational and trying to find a way to migrate onto a better configuration. But worried that I may make the wrong decision and make things worse.  The original 3 machines are running ubuntu 12.10 and the newest one is 14.10.  The node ST001 does not have any active OSDs but does have active NFS server with disk images for Virtual Machines on it.  Ideally I want to migrate these onto Ceph.

If anyone can offer any suggestions on the best way to proceed with this, then I would be more than happy to listen.  I will also be happy to provide additional information if that would be useful.

Thanks in advance,

Carl Taylor

The machines are SuperMicro blades with Intel(R) Xeon(R) CPU E3-1220 v3 @ 3.10GHz cpu's 4 or 8GB Ram - intended to upgrade all to 8GB.  The disks are 3TB SATA.

# id	weight	type name	up/down	reweight
-1	21.38	root default
-2	5.41		host st002
3	2.68			osd.3	up	1	ext4
1	2.73			osd.1	up	1	xfs
-3	5.41		host st003
5	2.68			osd.5	up	1	ext4
2	2.73			osd.2	up	1	xfs
-4	5.41		host st001
4	2.68			osd.4	up	0	
0	2.73			osd.0	up	0	
-5	5.15		host st004
6	2.64			osd.6	up	1	btrfs
7	2.51			osd.7	up	1	xfs

As you can see from the following there is no rhyme nor reason to the performance.

root@st004:/home/cjtaylor# time ceph tell osd.7 bench
{ "bytes_written": 1073741824,
  "blocksize": 4194304,
  "bytes_per_sec": "52021545.000000"}

real	0m20.885s
user	0m0.041s
sys	0m0.024s
root@st004:/home/cjtaylor# time ceph tell osd.6 bench
{ "bytes_written": 1073741824,
  "blocksize": 4194304,
  "bytes_per_sec": "20573023.000000"}

real	1m8.642s
user	0m0.109s
sys	0m0.031s

root@st003:~# time ceph tell osd.5 bench
{ "bytes_written": 1073741824,
  "blocksize": 4194304,
  "bytes_per_sec": "31145749.000000"}

real	0m36.698s
user	0m0.060s
sys	0m0.032s

root@st003:~# time ceph tell osd.2 bench
{ "bytes_written": 1073741824,
  "blocksize": 4194304,
  "bytes_per_sec": "24935604.000000"}

real	0m44.964s
user	0m0.076s
sys	0m0.020s

root@st002:~# time ceph tell osd.3 bench
{ "bytes_written": 1073741824,
  "blocksize": 4194304,
  "bytes_per_sec": "25826386.000000"}

real	0m44.951s
user	0m0.060s
sys	0m0.024s

root@st002:~# time ceph tell osd.1 bench
{ "bytes_written": 1073741824,
  "blocksize": 4194304,
  "bytes_per_sec": "17568847.000000"}

real	1m4.088s
user	0m0.072s
sys	0m0.024s

    cluster cd1fa211-911e-4b18-8392-9adcf0ed0bd5
     health HEALTH_OK
     monmap e3: 3 mons at {st001=172.16.2.109:6789/0,st002=172.16.2.101:6789/0,st003=172.16.2.106:6789/0}, election epoch 15498, quorum 0,1,2 st002,st003,st001
     mdsmap e344: 1/1/1 up {0=st001=up:active}
     osdmap e14769: 8 osds: 8 up, 6 in
      pgmap v23155467: 896 pgs, 19 pools, 3675 GB data, 8543 kobjects
            9913 GB used, 6159 GB / 16356 GB avail
                 894 active+clean
                   2 active+clean+scrubbing+deep
  client io 373 kB/s rd, 809 kB/s wr, 57 op/s

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com