Re: Cephfs unaccessible

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Greg, your supposition about the small amount data to be written is
right but the rebalance is writing an insane amount of data to the new
nodes and the mount is not working again

this is the node S203 (the os is on /dev/sdl, not listed)

/dev/sda1       1.9T  467G  1.4T  26% /var/lib/ceph/osd/ceph-44
/dev/sdb1       1.9T  595G  1.3T  33% /var/lib/ceph/osd/ceph-45
/dev/sdc1       1.9T  396G  1.5T  22% /var/lib/ceph/osd/ceph-46
/dev/sdd1       1.9T  401G  1.5T  22% /var/lib/ceph/osd/ceph-47
/dev/sde1       1.9T  337G  1.5T  19% /var/lib/ceph/osd/ceph-48
/dev/sdf1       1.9T  441G  1.4T  24% /var/lib/ceph/osd/ceph-49
/dev/sdg1       1.9T  338G  1.5T  19% /var/lib/ceph/osd/ceph-50
/dev/sdh1       1.9T  359G  1.5T  20% /var/lib/ceph/osd/ceph-51
/dev/sdi1       1.4T  281G  1.1T  21% /var/lib/ceph/osd/ceph-52
/dev/sdj1       1.4T  423G  964G  31% /var/lib/ceph/osd/ceph-53
/dev/sdk1       1.9T  421G  1.4T  23% /var/lib/ceph/osd/ceph-54

2013/4/21 Marco Aroldi <marco.aroldi@xxxxxxxxx>:
> What I can try to do/delete to regain access?
> Those osd are crazy, flapping up and down. I think that the situation
> is without control
>
>
> HEALTH_WARN 2735 pgs backfill; 13 pgs backfill_toofull; 157 pgs
> backfilling; 188 pgs degraded; 251 pgs peering; 13 pgs recovering;
> 1159 pgs recovery_wait; 159 pgs stuck inactive; 4641 pgs stuck
> unclean; recovery 4007916/23007073 degraded (17.420%);  recovering 4
> o/s, 31927KB/s; 19 near full osd(s)
>
> 2013-04-21 18:56:46.839851 mon.0 [INF] pgmap v1399007: 17280 pgs: 276
> active, 12791 active+clean, 2575 active+remapped+wait_backfill, 71
> active+degraded+wait_backfill, 6
> active+remapped+wait_backfill+backfill_toofull, 1121
> active+recovery_wait, 90 peering, 3 remapped, 1 active+remapped, 127
> active+remapped+backfilling, 1 active+degraded, 5
> active+remapped+backfill_toofull, 19 active+degraded+backfilling, 1
> active+clean+scrubbing, 79 active+degraded+remapped+wait_backfill, 36
> active+recovery_wait+remapped, 1
> active+degraded+remapped+wait_backfill+backfill_toofull, 46
> remapped+peering, 16 active+degraded+remapped+backfilling, 1
> active+recovery_wait+degraded+remapped, 14 active+recovering; 50435 GB
> data, 74790 GB used, 38642 GB / 110 TB avail; 4018849/23025448
> degraded (17.454%);  recovering 14 o/s, 54732KB/s
>
> # id    weight    type name    up/down    reweight
> -1    130    root default
> -9    65        room p1
> -3    44            rack r14
> -4    22                host s101
> 11    2                    osd.11    up    1
> 12    2                    osd.12    up    1
> 13    2                    osd.13    up    1
> 14    2                    osd.14    up    1
> 15    2                    osd.15    up    1
> 16    2                    osd.16    up    1
> 17    2                    osd.17    up    1
> 18    2                    osd.18    up    1
> 19    2                    osd.19    up    1
> 20    2                    osd.20    up    1
> 21    2                    osd.21    up    1
> -6    22                host s102
> 33    2                    osd.33    up    1
> 34    2                    osd.34    up    1
> 35    2                    osd.35    up    1
> 36    2                    osd.36    up    1
> 37    2                    osd.37    up    1
> 38    2                    osd.38    up    1
> 39    2                    osd.39    up    1
> 40    2                    osd.40    up    1
> 41    2                    osd.41    up    1
> 42    2                    osd.42    up    1
> 43    2                    osd.43    up    1
> -13    21            rack r10
> -12    21                host s103
> 55    2                    osd.55    up    1
> 56    2                    osd.56    up    1
> 57    2                    osd.57    up    1
> 58    2                    osd.58    up    1
> 59    2                    osd.59    down    0
> 60    2                    osd.60    down    0
> 61    2                    osd.61    down    0
> 62    2                    osd.62    up    1
> 63    2                    osd.63    up    1
> 64    1.5                    osd.64    up    1
> 65    1.5                    osd.65    down    0
> -10    65        room p2
> -7    22            rack r20
> -5    22                host s202
> 22    2                    osd.22    up    1
> 23    2                    osd.23    up    1
> 24    2                    osd.24    up    1
> 25    2                    osd.25    up    1
> 26    2                    osd.26    up    1
> 27    2                    osd.27    up    1
> 28    2                    osd.28    up    1
> 29    2                    osd.29    up    1
> 30    2                    osd.30    up    1
> 31    2                    osd.31    up    1
> 32    2                    osd.32    up    1
> -8    22            rack r22
> -2    22                host s201
> 0    2                    osd.0    up    1
> 1    2                    osd.1    up    1
> 2    2                    osd.2    up    1
> 3    2                    osd.3    up    1
> 4    2                    osd.4    up    1
> 5    2                    osd.5    up    1
> 6    2                    osd.6    up    1
> 7    2                    osd.7    up    1
> 8    2                    osd.8    up    1
> 9    2                    osd.9    up    1
> 10    2                    osd.10    up    1
> -14    21            rack r21
> -11    21                host s203
> 44    2                    osd.44    up    1
> 45    2                    osd.45    up    1
> 46    2                    osd.46    up    1
> 47    2                    osd.47    up    1
> 48    2                    osd.48    up    1
> 49    2                    osd.49    up    1
> 50    2                    osd.50    up    1
> 51    2                    osd.51    up    1
> 52    1.5                    osd.52    up    1
> 53    1.5                    osd.53    up    1
> 54    2                    osd.54    up    1
>
>
> 2013/4/21 Marco Aroldi <marco.aroldi@xxxxxxxxx>:
>> So, I've restarted the new osds as many as possible and the cluster
>> started to move data to the 2 new nodes overnight.
>> This morning there was not netowrk traffic and the healt was
>>
>> HEALTH_ERR 1323 pgs backfill; 150 pgs backfill_toofull; 100 pgs
>> backfilling; 114 pgs degraded; 3374 pgs peering; 36 pgs recovering;
>> 949 pgs recovery_wait; 3374 pgs stuck inactive; 6289 pgs stuck
>> unclean; recovery 2130652/20890113 degraded (10.199%); 58/8914654
>> unfound (0.001%); 1 full osd(s); 22 near full osd(s); full,noup,nodown
>> flag(s) set
>>
>> So I have unset the noup and nodown flags and the data started movin again
>> I've increased the full ratio to 97% so now there's no "official" full
>> osd and the HEALTH_ERR became HEALT_WARN
>>
>> However, still no access to filesystem
>>
>> HEALTH_WARN 1906 pgs backfill; 21 pgs backfill_toofull; 52 pgs
>> backfilling; 707 pgs degraded; 371 pgs down; 97 pgs incomplete; 3385
>> pgs peering; 35 pgs recovering; 1002 pgs recovery_wait; 4 pgs stale;
>> 683 pgs stuck inactive; 5898 pgs stuck unclean; recovery
>> 3081499/22208859 degraded (13.875%); 487/9433642 unfound (0.005%);
>> recovering 11722 o/s, 57040MB/s; 17 near full osd(s)
>>
>> The osd are flapping in/out again...
>>
>> I'm disposed to start deleting some portion of data.
>> What can I try to do now?
>>
>> 2013/4/21 Gregory Farnum <greg@xxxxxxxxxxx>:
>>> It's not entirely clear from your description and the output you've
>>> given us, but it looks like maybe you've managed to bring up all your
>>> OSDs correctly at this point? Or are they just not reporting down
>>> because you set the "no down" flag...
>>>
>>> In any case, CephFS isn't going to come up while the underlying RADOS
>>> cluster is this unhealthy, so you're going to need to get that going
>>> again. Since your OSDs have managed to get themselves so full it's
>>> going to be trickier than normal, but if all the rebalancing that's
>>> happening is only because you sort-of-didn't-really lose nodes, and
>>> you can bring them all back up, you should be able to sort it out by
>>> getting all the nodes back up, and then changing your full percentages
>>> (by a *very small* amount); since you haven't been doing any writes to
>>> the cluster it shouldn't take much data writes to get everything back
>>> where it was, although if this has been continuing to backfill in the
>>> meanwhile that will need to unwind.
>>> -Greg
>>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>>
>>>
>>> On Sat, Apr 20, 2013 at 12:21 PM, John Wilkins <john.wilkins@xxxxxxxxxxx> wrote:
>>>> I don't see anything related to lost objects in your output. I just see
>>>> waiting on backfill, backfill_toofull, remapped, and so forth. You can read
>>>> a bit about what is going on here:
>>>> http://ceph.com/docs/next/rados/operations/monitoring-osd-pg/
>>>>
>>>> Keep us posted as to the recovery, and let me know what I can do to improve
>>>> the docs for scenarios like this.
>>>>
>>>>
>>>> On Sat, Apr 20, 2013 at 10:52 AM, Marco Aroldi <marco.aroldi@xxxxxxxxx>
>>>> wrote:
>>>>>
>>>>> John,
>>>>> thanks for the quick reply.
>>>>> Below you can see my ceph osd tree
>>>>> The problem is caused not by the failure itself, but by the "renamed"
>>>>> bunch of devices.
>>>>> It was like a deadly 15-puzzle
>>>>> I think that the solution was to mount the devices in fstab using UUID
>>>>> (/dev/disk/by-uuid) instead of /dev/sdX
>>>>>
>>>>> However, yes I have an entry in my ceph.conf (devs = /dev/sdX1 --
>>>>> osd_journal = /dev/sdX2) *and* an entry in my fstab for each OSD
>>>>>
>>>>> The node with failed disk is s103 (osd.59)
>>>>>
>>>>> Now i have 5 osd from s203 up and in to try to let ceph rebalance
>>>>> data... but is still a bloody mess.
>>>>> Look at ceph -w output: is reported a total of 110TB: is wrong... al
>>>>> drives are 2TB and i have 49 drives up and in -- total 98Tb
>>>>> I think that 110TB (55 osd) was the size before cluster became
>>>>> inaccessible
>>>>>
>>>>> # id    weight    type name    up/down    reweight
>>>>> -1    130    root default
>>>>> -9    65        room p1
>>>>> -3    44            rack r14
>>>>> -4    22                host s101
>>>>> 11    2                    osd.11    up    1
>>>>> 12    2                    osd.12    up    1
>>>>> 13    2                    osd.13    up    1
>>>>> 14    2                    osd.14    up    1
>>>>> 15    2                    osd.15    up    1
>>>>> 16    2                    osd.16    up    1
>>>>> 17    2                    osd.17    up    1
>>>>> 18    2                    osd.18    up    1
>>>>> 19    2                    osd.19    up    1
>>>>> 20    2                    osd.20    up    1
>>>>> 21    2                    osd.21    up    1
>>>>> -6    22                host s102
>>>>> 33    2                    osd.33    up    1
>>>>> 34    2                    osd.34    up    1
>>>>> 35    2                    osd.35    up    1
>>>>> 36    2                    osd.36    up    1
>>>>> 37    2                    osd.37    up    1
>>>>> 38    2                    osd.38    up    1
>>>>> 39    2                    osd.39    up    1
>>>>> 40    2                    osd.40    up    1
>>>>> 41    2                    osd.41    up    1
>>>>> 42    2                    osd.42    up    1
>>>>> 43    2                    osd.43    up    1
>>>>> -13    21            rack r10
>>>>> -12    21                host s103
>>>>> 55    2                    osd.55    up    0
>>>>> 56    2                    osd.56    up    0
>>>>> 57    2                    osd.57    up    0
>>>>> 58    2                    osd.58    up    0
>>>>> 59    2                    osd.59    down    0
>>>>> 60    2                    osd.60    down    0
>>>>> 61    2                    osd.61    down    0
>>>>> 62    2                    osd.62    up    0
>>>>> 63    2                    osd.63    up    0
>>>>> 64    1.5                    osd.64    up    0
>>>>> 65    1.5                    osd.65    down    0
>>>>> -10    65        room p2
>>>>> -7    22            rack r20
>>>>> -5    22                host s202
>>>>> 22    2                    osd.22    up    1
>>>>> 23    2                    osd.23    up    1
>>>>> 24    2                    osd.24    up    1
>>>>> 25    2                    osd.25    up    1
>>>>> 26    2                    osd.26    up    1
>>>>> 27    2                    osd.27    up    1
>>>>> 28    2                    osd.28    up    1
>>>>> 29    2                    osd.29    up    1
>>>>> 30    2                    osd.30    up    1
>>>>> 31    2                    osd.31    up    1
>>>>> 32    2                    osd.32    up    1
>>>>> -8    22            rack r22
>>>>> -2    22                host s201
>>>>> 0    2                    osd.0    up    1
>>>>> 1    2                    osd.1    up    1
>>>>> 2    2                    osd.2    up    1
>>>>> 3    2                    osd.3    up    1
>>>>> 4    2                    osd.4    up    1
>>>>> 5    2                    osd.5    up    1
>>>>> 6    2                    osd.6    up    1
>>>>> 7    2                    osd.7    up    1
>>>>> 8    2                    osd.8    up    1
>>>>> 9    2                    osd.9    up    1
>>>>> 10    2                    osd.10    up    1
>>>>> -14    21            rack r21
>>>>> -11    21                host s203
>>>>> 44    2                    osd.44    up    1
>>>>> 45    2                    osd.45    up    1
>>>>> 46    2                    osd.46    up    1
>>>>> 47    2                    osd.47    up    1
>>>>> 48    2                    osd.48    up    1
>>>>> 49    2                    osd.49    up    0
>>>>> 50    2                    osd.50    up    0
>>>>> 51    2                    osd.51    up    0
>>>>> 52    1.5                    osd.52    up    0
>>>>> 53    1.5                    osd.53    up    0
>>>>> 54    2                    osd.54    up    0
>>>>>
>>>>>
>>>>> ceph -w
>>>>>
>>>>> 2013-04-20 19:46:48.608988 mon.0 [INF] pgmap v1352767: 17280 pgs: 58
>>>>> active, 12581 active+clean, 1686 active+remapped+wait_backfill, 24
>>>>> active+degraded+wait_backfill, 224
>>>>> active+remapped+wait_backfill+backfill_toofull, 1061
>>>>> active+recovery_wait, 4
>>>>> active+degraded+wait_backfill+backfill_toofull, 629 peering, 626
>>>>> active+remapped, 72 active+remapped+backfilling, 89 active+degraded,
>>>>> 14 active+remapped+backfill_toofull, 1 active+clean+scrubbing, 8
>>>>> active+degraded+remapped+wait_backfill, 20
>>>>> active+recovery_wait+remapped, 5
>>>>> active+degraded+remapped+wait_backfill+backfill_toofull, 162
>>>>> remapped+peering, 1 active+degraded+remapped+backfilling, 2
>>>>> active+degraded+remapped+backfill_toofull, 13 active+recovering; 49777
>>>>> GB data, 72863 GB used, 40568 GB / 110 TB avail; 2965687/21848501
>>>>> degraded (13.574%);  recovering 5 o/s, 16363B/s
>>>>>
>>>>> 2013/4/20 John Wilkins <john.wilkins@xxxxxxxxxxx>:
>>>>> > Marco,
>>>>> >
>>>>> > If you do a "ceph tree" can you see if your OSDs are all up? You seem to
>>>>> > have at least one problem related to the backfill OSDs being too full,
>>>>> > and
>>>>> > some which are near full or full for the purposes of storage. See the
>>>>> > following in the documentation to see if this helps:
>>>>> >
>>>>> >
>>>>> > http://ceph.com/docs/master/rados/configuration/mon-config-ref/#storage-capacity
>>>>> >
>>>>> > http://ceph.com/docs/master/rados/configuration/osd-config-ref/#backfilling
>>>>> >
>>>>> > http://ceph.com/docs/master/rados/operations/troubleshooting-osd/#no-free-drive-space
>>>>> >
>>>>> > Before you start deleting data as a remedy, you'd want to at least try
>>>>> > to
>>>>> > get the OSDs back up and running first.
>>>>> >
>>>>> > If rebooting changed the drive names, you might look here:
>>>>> >
>>>>> > http://ceph.com/docs/master/rados/configuration/osd-config-ref/#general-settings
>>>>> >
>>>>> > We have default settings for OSD and journal paths, which you could
>>>>> > override
>>>>> > if you can locate the data and journal sources on the renamed drives. If
>>>>> > you
>>>>> > mounted them, but didn't add them to the fstab, that might be the source
>>>>> > of
>>>>> > the problem. I'd rather see you use the default paths, as it would be
>>>>> > easier
>>>>> > to troubleshoot later. So did you mount the drives, but not add the
>>>>> > mount
>>>>> > points to fstab?
>>>>> >
>>>>> > John
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> > On Sat, Apr 20, 2013 at 8:46 AM, Marco Aroldi <marco.aroldi@xxxxxxxxx>
>>>>> > wrote:
>>>>> >>
>>>>> >> Hi,
>>>>> >> due a harware failure during expanding ceph, I'm in big trouble
>>>>> >> because the cephfs doesn't mount anymore.
>>>>> >> I was adding a couple storage nodes, but a disk has failed and after a
>>>>> >> reboot the OS (ubuntu 12.04) renamed the remaining devices, so the
>>>>> >> entire node has been screwed out.
>>>>> >>
>>>>> >> Now, from the "sane new node", I'm taking some new osd up and in
>>>>> >> because the cluster is near full and I can't revert completely the
>>>>> >> situation as before
>>>>> >>
>>>>> >> *I can* afford data loss, but i need to regain access to the filesystem
>>>>> >>
>>>>> >> My setup:
>>>>> >> 3 mon + 3 mds
>>>>> >> 4 storage nodes (i was adding no. 5 and 6)
>>>>> >>
>>>>> >> Ceph 0.56.4
>>>>> >>
>>>>> >>
>>>>> >> ceph health:
>>>>> >> HEALTH_ERR 2008 pgs backfill; 246 pgs backfill_toofull; 74 pgs
>>>>> >> backfilling; 134 pgs degraded; 790 pgs peering; 10 pgs recovering;
>>>>> >> 1116 pgs recovery_wait; 790 pgs stuck inactive; 4782 pgs stuck
>>>>> >> unclean; recovery 3049459/21926624 degraded (13.908%);  recovering 6
>>>>> >> o/s, 16316KB/s; 4 full osd(s); 30 near full osd(s); full,noup,nodown
>>>>> >> flag(s) set
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> ceph mds dump:
>>>>> >> dumped mdsmap epoch 44
>>>>> >> epoch    44
>>>>> >> flags    0
>>>>> >> created    2013-03-18 14:42:29.330548
>>>>> >> modified    2013-04-20 17:14:32.969332
>>>>> >> tableserver    0
>>>>> >> root    0
>>>>> >> session_timeout    60
>>>>> >> session_autoclose    300
>>>>> >> last_failure    43
>>>>> >> last_failure_osd_epoch    18160
>>>>> >> compat    compat={},rocompat={},incompat={1=base v0.20,2=client
>>>>> >> writeable ranges,3=default file layouts on dirs,4=dir inode in
>>>>> >> separate object}
>>>>> >> max_mds    1
>>>>> >> in    0
>>>>> >> up    {0=6376}
>>>>> >> failed
>>>>> >> stopped
>>>>> >> data_pools    [0]
>>>>> >> metadata_pool    1
>>>>> >> 6376:    192.168.21.11:6800/13457 'm1' mds.0.9 up:replay seq 1
>>>>> >> 5945:    192.168.21.13:6800/12999 'm3' mds.-1.0 up:standby seq 1
>>>>> >> 5963:    192.168.21.12:6800/22454 'm2' mds.-1.0 up:standby seq 1
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> ceph mon dump:
>>>>> >> epoch 1
>>>>> >> fsid d634f7b3-8a8a-4893-bdfb-a95ccca7fddd
>>>>> >> last_changed 2013-03-18 14:39:42.253923
>>>>> >> created 2013-03-18 14:39:42.253923
>>>>> >> 0: 192.168.21.11:6789/0 mon.m1
>>>>> >> 1: 192.168.21.12:6789/0 mon.m2
>>>>> >> 2: 192.168.21.13:6789/0 mon.m3
>>>>> >> _______________________________________________
>>>>> >> ceph-users mailing list
>>>>> >> ceph-users@xxxxxxxxxxxxxx
>>>>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> > --
>>>>> > John Wilkins
>>>>> > Senior Technical Writer
>>>>> > Intank
>>>>> > john.wilkins@xxxxxxxxxxx
>>>>> > (415) 425-9599
>>>>> > http://inktank.com
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> John Wilkins
>>>> Senior Technical Writer
>>>> Intank
>>>> john.wilkins@xxxxxxxxxxx
>>>> (415) 425-9599
>>>> http://inktank.com
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@xxxxxxxxxxxxxx
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux