Re: HEALTH_WARNING

Martin Wilderoth <martin.wilderoth@xxxxxxxxxx> · Tue, 5 Apr 2011 21:07:52 +0200 (CEST)

I did clear some data and the restart but the osd didn't go online again. Instead The osd was running for some time and then they became dead one by one.

I was re-creating the filesystem and transfering data again with a similar result. This time the filesystem was not filled up.
It seems as the filesystem is hanginging and I can't get any respons from it.

I have done same process again, during the creation it complained on journaling
hdparm -W 0 /dev/sda2. This time I made sure it didn't complain on the hdparam of the SSD disks, while I was creating the filesystem

on my host where the filesystem is mounted i have seen some dmesg conection filed

[16143.534936] libceph: client4428 fsid 19be9ae7-cdf8-cb03-4178-568342d30fa5
[16143.535092] libceph: mon0 10.0.6.10:6789 session established
[16224.427969] libceph: mon0 10.0.6.10:6789 socket closed
[16224.427975] libceph: mon0 10.0.6.10:6789 session lost, hunting for new mon
[16224.429637] libceph: mon0 10.0.6.10:6789 connection failed
[16233.700478] libceph: mon1 10.0.6.11:6789 connection failed
[16243.716405] libceph: mon2 10.0.6.12:6789 connection failed
[16253.728529] libceph: mon2 10.0.6.12:6789 connection failed
[17008.794981] libceph: client4107 fsid 2c3fefe7-3362-f541-27b4-64176adb3f22
[17008.795127] libceph: mon0 10.0.6.10:6789 session established

Not sure I have everything configured corectly ?

Regards Martin

----- Ursprungligt meddelande ----- 
FrÃn: "Gregory Farnum" <gregf@xxxxxxxxxxxxxxx> 
Till: "Martin Wilderoth" <martin.wilderoth@xxxxxxxxxx> 
Kopia: ceph-devel@xxxxxxxxxxxxxxx 
Skickat: mÃndag, 4 apr 2011 1:38:48 
Ãmne: Re: HEALTH_WARNING 

On Sat, Apr 2, 2011 at 3:55 AM, Martin Wilderoth 
<martin.wilderoth@xxxxxxxxxx> wrote: 
> Hello, 
> 
> I have seperate partitions for my osd and the btrfs file system. 
> I also use SSD-disk for journaling. 
> 
> But I got problem when the root system was filled up with logfiles on one host, 
> the file system reported out of diskspace. 
> 
> But the osd's were not filled to 100%. Later I realised that the root system on one of the osd hosts (osd2 and osd3) had no space left, to much logging. 
> 
> The only way I know to recover is to create a new filesystem in the cluster :-) 
> But it's bad fot the data :-) 
> 
> When i get problems with one osd it seems as if they are crashing one by one. 
> And i dont know how to get them up again whitout deleting all the data. 
You should be able to simply clear up some space (don't remove any of 
the actual OSD data though!) and then start up the OSD daemon, at 
which point it ought to automatically rejoin the cluster. 
Is this not working? If not, please start up the daemon with higher 
levels of debug logging and put the logs somewhere accessible. 
-Greg 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html