kernel: [ 8773.432358] libceph: osd1 192.168.0.131:6803 socket error on read

Frerot, Jean-Sébastien <jsfrerot@xxxxxxxxxxxxxxxx> · Sat, 5 Oct 2013 21:42:47 -0400

Hi,  I have a ceph cluster running with 3 physical servers,

Here is how my setup is configured
server1: mon, osd, mds
server2: mon, osd, mds
server3: mon
OS ubuntu 13.04
ceph version: 0.67.4-1raring (recentrly upgrade to see if my problem still persisted with the new version)

So I was running version CUTTLEFISH until yesterday. And I was using ceph with openstack (using rdb) but I simplified my setup and removed openstack to simply use kvm with virtmanager. 

So I created a new pool to be able to do live migration of kvm instances
#ceph osd lspools
0 data,1 metadata,2 rbd,3 volumes,4 images,6 live_migration,

I've been running VMs for some days without problems, but then I notice that I couldn't use the full disk size of my first VM (web01 which was 160G big originaly) but now is only 119G stored in ceph. I also have a windows instance running on a 300G raw file located in ceph too. So trying to fix the issue I decided to do a local backup of my file in cause something goes wrong and guess what, i wasn't able to copy the file from ceph to my local drive. The moment I tried to do that "cp live_migration/web01 /mnt/" the OS hangs, and syslog show this >30 lines/s:

Oct  5 15:25:45 server2 kernel: [ 8773.432358] libceph: osd1 192.168.0.131:6803 socket error on read

i couldn't kill my cp neither normally reboot my server. So I had to reset it.

I tried to copy my other file "win2012" also stored in the ceph cluster and get the same issue and now I can't read anything from it nor start my VM again

[root@server1 ~]# ceph status
  cluster 50dc0404-c081-4c43-ac3f-872ba5494bd7
   health HEALTH_OK
   monmap e4: 3 mons at {server1=192.168.0.130:6789/0,server2=192.168.0.131:6789/0,server3=192.168.0.132:6789/0}, election epoch 120, quorum 0,1,2 server1,server2,server3
   osdmap e275: 2 osds: 2 up, 2 in
    pgmap v1508209: 576 pgs: 576 active+clean; 108 GB data, 214 GB used, 785 GB / 999 GB avail
   mdsmap e181: 1/1/1 up {0=server2=up:active}, 1 up:standby

I mount the FS with fstab like this: 
192.168.0.131:6789,192.168.0.130:6789:/live_migration /var/lib/instances ceph name=live_migration,secret=mysecret==,noatime 0 2

I get this log in ceph-osd.0.log as spammy as "socket error on read" error i get in syslog
2013-10-05 23:07:23.586807 7f24731cc700  0 -- 192.168.0.130:6801/19182 >> 192.168.0.130:0/4212596483 pipe(0x128d8500 sd=115 :6801 s=0 pgs=0 cs=0 l=0 c=0x14ac09a0).accept peer addr is rea
lly 192.168.0.130:0/4212596483 (socket is 192.168.0.130:35078/0)

other infos:
df -h
/dev/mapper/server1--vg-ceph                         500G  108G  393G  22% /opt/data/ceph
192.168.0.131:6789,192.168.0.130:6789:/live_migration 1000G  215G  786G  22% /var/lib/instances
...

mount
/dev/mapper/server1--vg-ceph on /opt/data/ceph type xfs (rw,noatime)

192.168.0.131:6789,192.168.0.130:6789:/live_migration on /var/lib/instances type ceph (name=live_migration,key=client.live_migration)
...

How can I recover from this ?

Thank you,
--
Jean-Sébastien Frerot
jsfrerot@xxxxxxxxxxxxxxxx

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com