Re: Cephfs unaccessible

Marco Aroldi <marco.aroldi@xxxxxxxxx> · Mon, 22 Apr 2013 10:54:03 +0200

In the original design,
I've change the rules since I would data placed with replica 2 in 2
identical room (named p1 and p2)
Now that 1 room has 4 osd out of cluster, do I have to change the
rules and use an "type host" rule instead "type room"?
Could this help?

root default {
        id -1           # do not change unnecessarily
        # weight 122.500
        alg straw
        hash 0  # rjenkins1
        item p1 weight 57.500
        item p2 weight 65.000
}

# rules
rule data {
        ruleset 0
        type replicated
        min_size 1
        max_size 10
        step take default
        step chooseleaf firstn 0 type room
        step emit
}
rule metadata {
        ruleset 1
        type replicated
        min_size 1
        max_size 10
        step take default
        step chooseleaf firstn 0 type room
        step emit
}
rule rbd {
        ruleset 2
        type replicated
        min_size 1
        max_size 10
        step take default
        step chooseleaf firstn 0 type room
        step emit
}

# end crush map

ceph health:

HEALTH_WARN 2072 pgs backfill; 43 pgs backfill_toofull; 131 pgs
backfilling; 68 pgs degraded; 594 pgs recovery_wait; 2802 pgs stuck
unclean; recovery 2811952/22351845 degraded (12.580%);  recovering 35
o/s, 197MB/s; 4 near full osd(s); noup,nodown flag(s) set

2013-04-22 10:53:26.800014 mon.0 [INF] pgmap v1457213: 17280 pgs:
14474 active+clean, 1975 active+remapped+wait_backfill, 18
active+degraded+wait_backfill, 37
active+remapped+wait_backfill+backfill_toofull, 569
active+recovery_wait, 123 active+remapped+backfilling, 3
active+remapped+backfill_toofull, 3 active+degraded+backfilling, 6
active+clean+scrubbing, 39 active+degraded+remapped+wait_backfill, 25
active+recovery_wait+remapped, 3
active+degraded+remapped+wait_backfill+backfill_toofull, 5
active+degraded+remapped+backfilling; 50432 GB data, 76277 GB used,
37154 GB / 110 TB avail; 2811241/22350671 degraded (12.578%);
recovering 29 o/s, 119MB/s

2013/4/22 Marco Aroldi <marco.aroldi@xxxxxxxxx>:
> The rebalance is still going
> and the mounts are still refused
>
> I've re-set the nodown noup flags because the osd are flapping continuously
> and added in ceph.conf "osd backfill tooful ratio = 0.91", tryin to
> get rid of all that "backfill_tooful"
>
> What I have to to now to regain access?
>
> I can provide you any logs or whatever you need
> Thanks for support
>
> in ceph -w I see this:
> 2013-04-22 09:25:46.601721 osd.8 [WRN] 1 slow requests, 1 included
> below; oldest blocked for > 5404.500806 secs
> 2013-04-22 09:25:46.601727 osd.8 [WRN] slow request 5404.500806
> seconds old, received at 2013-04-22 07:55:42.100886:
> osd_op(mds.0.9:177037 10000025d80.000017b3 [stat] 0.300279a9 RETRY
> rwordered) v4 currently reached pgosd
>
> this is the ceph mds dump:
>
> dumped mdsmap epoch 52
> epoch    52
> flags    0
> created    2013-03-18 14:42:29.330548
> modified    2013-04-22 09:08:45.599613
> tableserver    0
> root    0
> session_timeout    60
> session_autoclose    300
> last_failure    49
> last_failure_osd_epoch    33152
> compat    compat={},rocompat={},incompat={1=base v0.20,2=client
> writeable ranges,3=default file layouts on dirs,4=dir inode in
> separate object}
> max_mds    1
> in    0
> up    {0=6957}
> failed
> stopped
> data_pools    [0]
> metadata_pool    1
> 6957:    192.168.21.11:6800/5844 'm1' mds.0.10 up:active seq 23
> 5945:    192.168.21.13:6800/12999 'm3' mds.-1.0 up:standby seq 1
> 5963:    192.168.21.12:6800/22454 'm2' mds.-1.0 up:standby seq 1
>
> ceph health:
>
> HEALTH_WARN 2133 pgs backfill; 47 pgs backfill_toofull; 136 pgs
> backfilling; 74 pgs degraded; 1 pgs recovering; 599 pgs recovery_wait;
> 2877 pgs stuck unclean; recovery 2910416/22449672 degraded (12.964%);
> recovering 10 o/s, 48850KB/s; 7 near full osd(s); noup,nodown flag(s)
> set
>
> 2013-04-22 09:34:11.436514 mon.0 [INF] pgmap v1452450: 17280 pgs:
> 14403 active+clean, 2032 active+remapped+wait_backfill, 19
> active+degraded+wait_backfill, 35
> active+remapped+wait_backfill+backfill_toofull, 574
> active+recovery_wait, 126 active+remapped+backfilling, 9
> active+remapped+backfill_toofull, 3 active+degraded+backfilling, 2
> active+clean+scrubbing, 41 active+degraded+remapped+wait_backfill, 25
> active+recovery_wait+remapped, 3
> active+degraded+remapped+wait_backfill+backfill_toofull, 8
> active+degraded+remapped+backfilling; 50432 GB data, 76229 GB used,
> 37202 GB / 110 TB avail; 2908837/22447349 degraded (12.958%);
> recovering 6 o/s, 20408KB/s
>
> 2013/4/21 Marco Aroldi <marco.aroldi@xxxxxxxxx>:
>> Greg, your supposition about the small amount data to be written is
>> right but the rebalance is writing an insane amount of data to the new
>> nodes and the mount is not working again
>>
>> this is the node S203 (the os is on /dev/sdl, not listed)
>>
>> /dev/sda1       1.9T  467G  1.4T  26% /var/lib/ceph/osd/ceph-44
>> /dev/sdb1       1.9T  595G  1.3T  33% /var/lib/ceph/osd/ceph-45
>> /dev/sdc1       1.9T  396G  1.5T  22% /var/lib/ceph/osd/ceph-46
>> /dev/sdd1       1.9T  401G  1.5T  22% /var/lib/ceph/osd/ceph-47
>> /dev/sde1       1.9T  337G  1.5T  19% /var/lib/ceph/osd/ceph-48
>> /dev/sdf1       1.9T  441G  1.4T  24% /var/lib/ceph/osd/ceph-49
>> /dev/sdg1       1.9T  338G  1.5T  19% /var/lib/ceph/osd/ceph-50
>> /dev/sdh1       1.9T  359G  1.5T  20% /var/lib/ceph/osd/ceph-51
>> /dev/sdi1       1.4T  281G  1.1T  21% /var/lib/ceph/osd/ceph-52
>> /dev/sdj1       1.4T  423G  964G  31% /var/lib/ceph/osd/ceph-53
>> /dev/sdk1       1.9T  421G  1.4T  23% /var/lib/ceph/osd/ceph-54
>>
>> 2013/4/21 Marco Aroldi <marco.aroldi@xxxxxxxxx>:
>>> What I can try to do/delete to regain access?
>>> Those osd are crazy, flapping up and down. I think that the situation
>>> is without control
>>>
>>>
>>> HEALTH_WARN 2735 pgs backfill; 13 pgs backfill_toofull; 157 pgs
>>> backfilling; 188 pgs degraded; 251 pgs peering; 13 pgs recovering;
>>> 1159 pgs recovery_wait; 159 pgs stuck inactive; 4641 pgs stuck
>>> unclean; recovery 4007916/23007073 degraded (17.420%);  recovering 4
>>> o/s, 31927KB/s; 19 near full osd(s)
>>>
>>> 2013-04-21 18:56:46.839851 mon.0 [INF] pgmap v1399007: 17280 pgs: 276
>>> active, 12791 active+clean, 2575 active+remapped+wait_backfill, 71
>>> active+degraded+wait_backfill, 6
>>> active+remapped+wait_backfill+backfill_toofull, 1121
>>> active+recovery_wait, 90 peering, 3 remapped, 1 active+remapped, 127
>>> active+remapped+backfilling, 1 active+degraded, 5
>>> active+remapped+backfill_toofull, 19 active+degraded+backfilling, 1
>>> active+clean+scrubbing, 79 active+degraded+remapped+wait_backfill, 36
>>> active+recovery_wait+remapped, 1
>>> active+degraded+remapped+wait_backfill+backfill_toofull, 46
>>> remapped+peering, 16 active+degraded+remapped+backfilling, 1
>>> active+recovery_wait+degraded+remapped, 14 active+recovering; 50435 GB
>>> data, 74790 GB used, 38642 GB / 110 TB avail; 4018849/23025448
>>> degraded (17.454%);  recovering 14 o/s, 54732KB/s
>>>
>>> # id    weight    type name    up/down    reweight
>>> -1    130    root default
>>> -9    65        room p1
>>> -3    44            rack r14
>>> -4    22                host s101
>>> 11    2                    osd.11    up    1
>>> 12    2                    osd.12    up    1
>>> 13    2                    osd.13    up    1
>>> 14    2                    osd.14    up    1
>>> 15    2                    osd.15    up    1
>>> 16    2                    osd.16    up    1
>>> 17    2                    osd.17    up    1
>>> 18    2                    osd.18    up    1
>>> 19    2                    osd.19    up    1
>>> 20    2                    osd.20    up    1
>>> 21    2                    osd.21    up    1
>>> -6    22                host s102
>>> 33    2                    osd.33    up    1
>>> 34    2                    osd.34    up    1
>>> 35    2                    osd.35    up    1
>>> 36    2                    osd.36    up    1
>>> 37    2                    osd.37    up    1
>>> 38    2                    osd.38    up    1
>>> 39    2                    osd.39    up    1
>>> 40    2                    osd.40    up    1
>>> 41    2                    osd.41    up    1
>>> 42    2                    osd.42    up    1
>>> 43    2                    osd.43    up    1
>>> -13    21            rack r10
>>> -12    21                host s103
>>> 55    2                    osd.55    up    1
>>> 56    2                    osd.56    up    1
>>> 57    2                    osd.57    up    1
>>> 58    2                    osd.58    up    1
>>> 59    2                    osd.59    down    0
>>> 60    2                    osd.60    down    0
>>> 61    2                    osd.61    down    0
>>> 62    2                    osd.62    up    1
>>> 63    2                    osd.63    up    1
>>> 64    1.5                    osd.64    up    1
>>> 65    1.5                    osd.65    down    0
>>> -10    65        room p2
>>> -7    22            rack r20
>>> -5    22                host s202
>>> 22    2                    osd.22    up    1
>>> 23    2                    osd.23    up    1
>>> 24    2                    osd.24    up    1
>>> 25    2                    osd.25    up    1
>>> 26    2                    osd.26    up    1
>>> 27    2                    osd.27    up    1
>>> 28    2                    osd.28    up    1
>>> 29    2                    osd.29    up    1
>>> 30    2                    osd.30    up    1
>>> 31    2                    osd.31    up    1
>>> 32    2                    osd.32    up    1
>>> -8    22            rack r22
>>> -2    22                host s201
>>> 0    2                    osd.0    up    1
>>> 1    2                    osd.1    up    1
>>> 2    2                    osd.2    up    1
>>> 3    2                    osd.3    up    1
>>> 4    2                    osd.4    up    1
>>> 5    2                    osd.5    up    1
>>> 6    2                    osd.6    up    1
>>> 7    2                    osd.7    up    1
>>> 8    2                    osd.8    up    1
>>> 9    2                    osd.9    up    1
>>> 10    2                    osd.10    up    1
>>> -14    21            rack r21
>>> -11    21                host s203
>>> 44    2                    osd.44    up    1
>>> 45    2                    osd.45    up    1
>>> 46    2                    osd.46    up    1
>>> 47    2                    osd.47    up    1
>>> 48    2                    osd.48    up    1
>>> 49    2                    osd.49    up    1
>>> 50    2                    osd.50    up    1
>>> 51    2                    osd.51    up    1
>>> 52    1.5                    osd.52    up    1
>>> 53    1.5                    osd.53    up    1
>>> 54    2                    osd.54    up    1
>>>
>>>
>>> 2013/4/21 Marco Aroldi <marco.aroldi@xxxxxxxxx>:
>>>> So, I've restarted the new osds as many as possible and the cluster
>>>> started to move data to the 2 new nodes overnight.
>>>> This morning there was not netowrk traffic and the healt was
>>>>
>>>> HEALTH_ERR 1323 pgs backfill; 150 pgs backfill_toofull; 100 pgs
>>>> backfilling; 114 pgs degraded; 3374 pgs peering; 36 pgs recovering;
>>>> 949 pgs recovery_wait; 3374 pgs stuck inactive; 6289 pgs stuck
>>>> unclean; recovery 2130652/20890113 degraded (10.199%); 58/8914654
>>>> unfound (0.001%); 1 full osd(s); 22 near full osd(s); full,noup,nodown
>>>> flag(s) set
>>>>
>>>> So I have unset the noup and nodown flags and the data started movin again
>>>> I've increased the full ratio to 97% so now there's no "official" full
>>>> osd and the HEALTH_ERR became HEALT_WARN
>>>>
>>>> However, still no access to filesystem
>>>>
>>>> HEALTH_WARN 1906 pgs backfill; 21 pgs backfill_toofull; 52 pgs
>>>> backfilling; 707 pgs degraded; 371 pgs down; 97 pgs incomplete; 3385
>>>> pgs peering; 35 pgs recovering; 1002 pgs recovery_wait; 4 pgs stale;
>>>> 683 pgs stuck inactive; 5898 pgs stuck unclean; recovery
>>>> 3081499/22208859 degraded (13.875%); 487/9433642 unfound (0.005%);
>>>> recovering 11722 o/s, 57040MB/s; 17 near full osd(s)
>>>>
>>>> The osd are flapping in/out again...
>>>>
>>>> I'm disposed to start deleting some portion of data.
>>>> What can I try to do now?
>>>>
>>>> 2013/4/21 Gregory Farnum <greg@xxxxxxxxxxx>:
>>>>> It's not entirely clear from your description and the output you've
>>>>> given us, but it looks like maybe you've managed to bring up all your
>>>>> OSDs correctly at this point? Or are they just not reporting down
>>>>> because you set the "no down" flag...
>>>>>
>>>>> In any case, CephFS isn't going to come up while the underlying RADOS
>>>>> cluster is this unhealthy, so you're going to need to get that going
>>>>> again. Since your OSDs have managed to get themselves so full it's
>>>>> going to be trickier than normal, but if all the rebalancing that's
>>>>> happening is only because you sort-of-didn't-really lose nodes, and
>>>>> you can bring them all back up, you should be able to sort it out by
>>>>> getting all the nodes back up, and then changing your full percentages
>>>>> (by a *very small* amount); since you haven't been doing any writes to
>>>>> the cluster it shouldn't take much data writes to get everything back
>>>>> where it was, although if this has been continuing to backfill in the
>>>>> meanwhile that will need to unwind.
>>>>> -Greg
>>>>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>>>>
>>>>>
>>>>> On Sat, Apr 20, 2013 at 12:21 PM, John Wilkins <john.wilkins@xxxxxxxxxxx> wrote:
>>>>>> I don't see anything related to lost objects in your output. I just see
>>>>>> waiting on backfill, backfill_toofull, remapped, and so forth. You can read
>>>>>> a bit about what is going on here:
>>>>>> http://ceph.com/docs/next/rados/operations/monitoring-osd-pg/
>>>>>>
>>>>>> Keep us posted as to the recovery, and let me know what I can do to improve
>>>>>> the docs for scenarios like this.
>>>>>>
>>>>>>
>>>>>> On Sat, Apr 20, 2013 at 10:52 AM, Marco Aroldi <marco.aroldi@xxxxxxxxx>
>>>>>> wrote:
>>>>>>>
>>>>>>> John,
>>>>>>> thanks for the quick reply.
>>>>>>> Below you can see my ceph osd tree
>>>>>>> The problem is caused not by the failure itself, but by the "renamed"
>>>>>>> bunch of devices.
>>>>>>> It was like a deadly 15-puzzle
>>>>>>> I think that the solution was to mount the devices in fstab using UUID
>>>>>>> (/dev/disk/by-uuid) instead of /dev/sdX
>>>>>>>
>>>>>>> However, yes I have an entry in my ceph.conf (devs = /dev/sdX1 --
>>>>>>> osd_journal = /dev/sdX2) *and* an entry in my fstab for each OSD
>>>>>>>
>>>>>>> The node with failed disk is s103 (osd.59)
>>>>>>>
>>>>>>> Now i have 5 osd from s203 up and in to try to let ceph rebalance
>>>>>>> data... but is still a bloody mess.
>>>>>>> Look at ceph -w output: is reported a total of 110TB: is wrong... al
>>>>>>> drives are 2TB and i have 49 drives up and in -- total 98Tb
>>>>>>> I think that 110TB (55 osd) was the size before cluster became
>>>>>>> inaccessible
>>>>>>>
>>>>>>> # id    weight    type name    up/down    reweight
>>>>>>> -1    130    root default
>>>>>>> -9    65        room p1
>>>>>>> -3    44            rack r14
>>>>>>> -4    22                host s101
>>>>>>> 11    2                    osd.11    up    1
>>>>>>> 12    2                    osd.12    up    1
>>>>>>> 13    2                    osd.13    up    1
>>>>>>> 14    2                    osd.14    up    1
>>>>>>> 15    2                    osd.15    up    1
>>>>>>> 16    2                    osd.16    up    1
>>>>>>> 17    2                    osd.17    up    1
>>>>>>> 18    2                    osd.18    up    1
>>>>>>> 19    2                    osd.19    up    1
>>>>>>> 20    2                    osd.20    up    1
>>>>>>> 21    2                    osd.21    up    1
>>>>>>> -6    22                host s102
>>>>>>> 33    2                    osd.33    up    1
>>>>>>> 34    2                    osd.34    up    1
>>>>>>> 35    2                    osd.35    up    1
>>>>>>> 36    2                    osd.36    up    1
>>>>>>> 37    2                    osd.37    up    1
>>>>>>> 38    2                    osd.38    up    1
>>>>>>> 39    2                    osd.39    up    1
>>>>>>> 40    2                    osd.40    up    1
>>>>>>> 41    2                    osd.41    up    1
>>>>>>> 42    2                    osd.42    up    1
>>>>>>> 43    2                    osd.43    up    1
>>>>>>> -13    21            rack r10
>>>>>>> -12    21                host s103
>>>>>>> 55    2                    osd.55    up    0
>>>>>>> 56    2                    osd.56    up    0
>>>>>>> 57    2                    osd.57    up    0
>>>>>>> 58    2                    osd.58    up    0
>>>>>>> 59    2                    osd.59    down    0
>>>>>>> 60    2                    osd.60    down    0
>>>>>>> 61    2                    osd.61    down    0
>>>>>>> 62    2                    osd.62    up    0
>>>>>>> 63    2                    osd.63    up    0
>>>>>>> 64    1.5                    osd.64    up    0
>>>>>>> 65    1.5                    osd.65    down    0
>>>>>>> -10    65        room p2
>>>>>>> -7    22            rack r20
>>>>>>> -5    22                host s202
>>>>>>> 22    2                    osd.22    up    1
>>>>>>> 23    2                    osd.23    up    1
>>>>>>> 24    2                    osd.24    up    1
>>>>>>> 25    2                    osd.25    up    1
>>>>>>> 26    2                    osd.26    up    1
>>>>>>> 27    2                    osd.27    up    1
>>>>>>> 28    2                    osd.28    up    1
>>>>>>> 29    2                    osd.29    up    1
>>>>>>> 30    2                    osd.30    up    1
>>>>>>> 31    2                    osd.31    up    1
>>>>>>> 32    2                    osd.32    up    1
>>>>>>> -8    22            rack r22
>>>>>>> -2    22                host s201
>>>>>>> 0    2                    osd.0    up    1
>>>>>>> 1    2                    osd.1    up    1
>>>>>>> 2    2                    osd.2    up    1
>>>>>>> 3    2                    osd.3    up    1
>>>>>>> 4    2                    osd.4    up    1
>>>>>>> 5    2                    osd.5    up    1
>>>>>>> 6    2                    osd.6    up    1
>>>>>>> 7    2                    osd.7    up    1
>>>>>>> 8    2                    osd.8    up    1
>>>>>>> 9    2                    osd.9    up    1
>>>>>>> 10    2                    osd.10    up    1
>>>>>>> -14    21            rack r21
>>>>>>> -11    21                host s203
>>>>>>> 44    2                    osd.44    up    1
>>>>>>> 45    2                    osd.45    up    1
>>>>>>> 46    2                    osd.46    up    1
>>>>>>> 47    2                    osd.47    up    1
>>>>>>> 48    2                    osd.48    up    1
>>>>>>> 49    2                    osd.49    up    0
>>>>>>> 50    2                    osd.50    up    0
>>>>>>> 51    2                    osd.51    up    0
>>>>>>> 52    1.5                    osd.52    up    0
>>>>>>> 53    1.5                    osd.53    up    0
>>>>>>> 54    2                    osd.54    up    0
>>>>>>>
>>>>>>>
>>>>>>> ceph -w
>>>>>>>
>>>>>>> 2013-04-20 19:46:48.608988 mon.0 [INF] pgmap v1352767: 17280 pgs: 58
>>>>>>> active, 12581 active+clean, 1686 active+remapped+wait_backfill, 24
>>>>>>> active+degraded+wait_backfill, 224
>>>>>>> active+remapped+wait_backfill+backfill_toofull, 1061
>>>>>>> active+recovery_wait, 4
>>>>>>> active+degraded+wait_backfill+backfill_toofull, 629 peering, 626
>>>>>>> active+remapped, 72 active+remapped+backfilling, 89 active+degraded,
>>>>>>> 14 active+remapped+backfill_toofull, 1 active+clean+scrubbing, 8
>>>>>>> active+degraded+remapped+wait_backfill, 20
>>>>>>> active+recovery_wait+remapped, 5
>>>>>>> active+degraded+remapped+wait_backfill+backfill_toofull, 162
>>>>>>> remapped+peering, 1 active+degraded+remapped+backfilling, 2
>>>>>>> active+degraded+remapped+backfill_toofull, 13 active+recovering; 49777
>>>>>>> GB data, 72863 GB used, 40568 GB / 110 TB avail; 2965687/21848501
>>>>>>> degraded (13.574%);  recovering 5 o/s, 16363B/s
>>>>>>>
>>>>>>> 2013/4/20 John Wilkins <john.wilkins@xxxxxxxxxxx>:
>>>>>>> > Marco,
>>>>>>> >
>>>>>>> > If you do a "ceph tree" can you see if your OSDs are all up? You seem to
>>>>>>> > have at least one problem related to the backfill OSDs being too full,
>>>>>>> > and
>>>>>>> > some which are near full or full for the purposes of storage. See the
>>>>>>> > following in the documentation to see if this helps:
>>>>>>> >
>>>>>>> >
>>>>>>> > http://ceph.com/docs/master/rados/configuration/mon-config-ref/#storage-capacity
>>>>>>> >
>>>>>>> > http://ceph.com/docs/master/rados/configuration/osd-config-ref/#backfilling
>>>>>>> >
>>>>>>> > http://ceph.com/docs/master/rados/operations/troubleshooting-osd/#no-free-drive-space
>>>>>>> >
>>>>>>> > Before you start deleting data as a remedy, you'd want to at least try
>>>>>>> > to
>>>>>>> > get the OSDs back up and running first.
>>>>>>> >
>>>>>>> > If rebooting changed the drive names, you might look here:
>>>>>>> >
>>>>>>> > http://ceph.com/docs/master/rados/configuration/osd-config-ref/#general-settings
>>>>>>> >
>>>>>>> > We have default settings for OSD and journal paths, which you could
>>>>>>> > override
>>>>>>> > if you can locate the data and journal sources on the renamed drives. If
>>>>>>> > you
>>>>>>> > mounted them, but didn't add them to the fstab, that might be the source
>>>>>>> > of
>>>>>>> > the problem. I'd rather see you use the default paths, as it would be
>>>>>>> > easier
>>>>>>> > to troubleshoot later. So did you mount the drives, but not add the
>>>>>>> > mount
>>>>>>> > points to fstab?
>>>>>>> >
>>>>>>> > John
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> > On Sat, Apr 20, 2013 at 8:46 AM, Marco Aroldi <marco.aroldi@xxxxxxxxx>
>>>>>>> > wrote:
>>>>>>> >>
>>>>>>> >> Hi,
>>>>>>> >> due a harware failure during expanding ceph, I'm in big trouble
>>>>>>> >> because the cephfs doesn't mount anymore.
>>>>>>> >> I was adding a couple storage nodes, but a disk has failed and after a
>>>>>>> >> reboot the OS (ubuntu 12.04) renamed the remaining devices, so the
>>>>>>> >> entire node has been screwed out.
>>>>>>> >>
>>>>>>> >> Now, from the "sane new node", I'm taking some new osd up and in
>>>>>>> >> because the cluster is near full and I can't revert completely the
>>>>>>> >> situation as before
>>>>>>> >>
>>>>>>> >> *I can* afford data loss, but i need to regain access to the filesystem
>>>>>>> >>
>>>>>>> >> My setup:
>>>>>>> >> 3 mon + 3 mds
>>>>>>> >> 4 storage nodes (i was adding no. 5 and 6)
>>>>>>> >>
>>>>>>> >> Ceph 0.56.4
>>>>>>> >>
>>>>>>> >>
>>>>>>> >> ceph health:
>>>>>>> >> HEALTH_ERR 2008 pgs backfill; 246 pgs backfill_toofull; 74 pgs
>>>>>>> >> backfilling; 134 pgs degraded; 790 pgs peering; 10 pgs recovering;
>>>>>>> >> 1116 pgs recovery_wait; 790 pgs stuck inactive; 4782 pgs stuck
>>>>>>> >> unclean; recovery 3049459/21926624 degraded (13.908%);  recovering 6
>>>>>>> >> o/s, 16316KB/s; 4 full osd(s); 30 near full osd(s); full,noup,nodown
>>>>>>> >> flag(s) set
>>>>>>> >>
>>>>>>> >>
>>>>>>> >>
>>>>>>> >> ceph mds dump:
>>>>>>> >> dumped mdsmap epoch 44
>>>>>>> >> epoch    44
>>>>>>> >> flags    0
>>>>>>> >> created    2013-03-18 14:42:29.330548
>>>>>>> >> modified    2013-04-20 17:14:32.969332
>>>>>>> >> tableserver    0
>>>>>>> >> root    0
>>>>>>> >> session_timeout    60
>>>>>>> >> session_autoclose    300
>>>>>>> >> last_failure    43
>>>>>>> >> last_failure_osd_epoch    18160
>>>>>>> >> compat    compat={},rocompat={},incompat={1=base v0.20,2=client
>>>>>>> >> writeable ranges,3=default file layouts on dirs,4=dir inode in
>>>>>>> >> separate object}
>>>>>>> >> max_mds    1
>>>>>>> >> in    0
>>>>>>> >> up    {0=6376}
>>>>>>> >> failed
>>>>>>> >> stopped
>>>>>>> >> data_pools    [0]
>>>>>>> >> metadata_pool    1
>>>>>>> >> 6376:    192.168.21.11:6800/13457 'm1' mds.0.9 up:replay seq 1
>>>>>>> >> 5945:    192.168.21.13:6800/12999 'm3' mds.-1.0 up:standby seq 1
>>>>>>> >> 5963:    192.168.21.12:6800/22454 'm2' mds.-1.0 up:standby seq 1
>>>>>>> >>
>>>>>>> >>
>>>>>>> >>
>>>>>>> >> ceph mon dump:
>>>>>>> >> epoch 1
>>>>>>> >> fsid d634f7b3-8a8a-4893-bdfb-a95ccca7fddd
>>>>>>> >> last_changed 2013-03-18 14:39:42.253923
>>>>>>> >> created 2013-03-18 14:39:42.253923
>>>>>>> >> 0: 192.168.21.11:6789/0 mon.m1
>>>>>>> >> 1: 192.168.21.12:6789/0 mon.m2
>>>>>>> >> 2: 192.168.21.13:6789/0 mon.m3
>>>>>>> >> _______________________________________________
>>>>>>> >> ceph-users mailing list
>>>>>>> >> ceph-users@xxxxxxxxxxxxxx
>>>>>>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> > --
>>>>>>> > John Wilkins
>>>>>>> > Senior Technical Writer
>>>>>>> > Intank
>>>>>>> > john.wilkins@xxxxxxxxxxx
>>>>>>> > (415) 425-9599
>>>>>>> > http://inktank.com
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> John Wilkins
>>>>>> Senior Technical Writer
>>>>>> Intank
>>>>>> john.wilkins@xxxxxxxxxxx
>>>>>> (415) 425-9599
>>>>>> http://inktank.com
>>>>>>
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list
>>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com