Re: how to re-add a deleted osd device as a osd with data

lin zhou <hnuzhoulin2@xxxxxxxxx> · Wed, 30 Mar 2016 13:50:17 +0800

maybe I found the problerm:

smartctl -a /dev/sda | grep Media_Wearout_Indicator
233 Media_Wearout_Indicator 0x0032   001   001   000    Old_age   Always

root@node-65:~# fio -direct=1 -bs=4k -ramp_time=40 -runtime=100
-size=20g -filename=./testfio.file -ioengine=libaio -iodepth=8
-norandommap -randrepeat=0 -time_based -rw=randwrite -name "osd.0 4k
randwrite test"
osd.0 4k randwrite test: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K,
ioengine=libaio, iodepth=8
fio-2.1.10
Starting 1 process
osd.0 4k randwrite test: Laying out IO file(s) (1 file(s) / 20480MB)
Jobs: 1 (f=1): [w] [100.0% done] [0KB/252KB/0KB /s] [0/63/0 iops] [eta 00m:00s]
osd.0 4k randwrite test: (groupid=0, jobs=1): err= 0: pid=30071: Wed
Mar 30 13:38:27 2016
  write: io=79528KB, bw=814106B/s, iops=198, runt=100032msec
    slat (usec): min=5, max=1031.3K, avg=363.76, stdev=13260.26
    clat (usec): min=109, max=1325.7K, avg=39755.66, stdev=81798.27
     lat (msec): min=3, max=1325, avg=40.25, stdev=83.48
    clat percentiles (msec):
     |  1.00th=[   30],  5.00th=[   30], 10.00th=[   30], 20.00th=[   31],
     | 30.00th=[   31], 40.00th=[   31], 50.00th=[   31], 60.00th=[   36],
     | 70.00th=[   36], 80.00th=[   36], 90.00th=[   36], 95.00th=[   36],
     | 99.00th=[  165], 99.50th=[  799], 99.90th=[ 1221], 99.95th=[ 1237],
     | 99.99th=[ 1319]
    bw (KB  /s): min=    0, max= 1047, per=100.00%, avg=844.21, stdev=291.89
    lat (usec) : 250=0.01%
    lat (msec) : 4=0.02%, 10=0.14%, 20=0.31%, 50=98.13%, 100=0.25%
    lat (msec) : 250=0.23%, 500=0.16%, 750=0.24%, 1000=0.22%, 2000=0.34%
  cpu          : usr=0.18%, sys=1.27%, ctx=22838, majf=0, minf=27
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=143.7%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=19875/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=8

Run status group 0 (all jobs):
  WRITE: io=79528KB, aggrb=795KB/s, minb=795KB/s, maxb=795KB/s,
mint=100032msec, maxt=100032msec

Disk stats (read/write):
  sda: ios=864/28988, merge=0/5738, ticks=31932/1061860,
in_queue=1093892, util=99.99%
root@node-65:~#

the lifetime of this SSD is over.

Thanks so much,Christian.

2016-03-30 12:19 GMT+08:00 lin zhou <hnuzhoulin2@xxxxxxxxx>:
> 2016-03-29 14:50 GMT+08:00 Christian Balzer <chibi@xxxxxxx>:
>>
>> Hello,
>>
>> On Tue, 29 Mar 2016 14:00:44 +0800 lin zhou wrote:
>>
>>> Hi,Christian.
>>> When I re-add these OSD(0,3,9,12,15),the high latency occur again.the
>>> default reweight of these OSD is 0.0
>>>
>> That makes no sense, at a crush weight (not reweight) of 0 they should not
>> get used at all.
>>
>> When you deleted the other OSD (6?) because of high latency, was your only
>> reason/data point the "ceph osd perf" output?
>
> because this is a near-product environment,so when the osd latency and
> the system latency is high.I delete these osd to let it work first.
>
>>> root@node-65:~# ceph osd tree
>>> # id    weight  type name       up/down reweight
>>> -1      103.7   root default
>>> -2      8.19            host node-65
>>> 18      2.73                    osd.18  up      1
>>> 21      0                       osd.21  up      1
>>> 24      2.73                    osd.24  up      1
>>> 27      2.73                    osd.27  up      1
>>> 30      0                       osd.30  up      1
>>> 33      0                       osd.33  up      1
>>> 0       0                       osd.0   up      1
>>> 3       0                       osd.3   up      1
>>> 6       0                       osd.6   down    0
>>> 9       0                       osd.9   up      1
>>> 12      0                       osd.12  up      1
>>> 15      0                       osd.15  up      1
>>>
>>> ceph osd perf:
>>>     0                  9825                10211
>>>     3                  9398                 9775
>>>     9                 35852                36904
>>>    12                 24716                25626
>>>    15                 18893                19633
>>>
>> This could very well be old, stale data.
>> Still, these are some seriously bad numbers, if they are real.
>>
>> Do the these perf numbers change at all? My guess would be no.
>
> Yes,the are never changed.
>
>>
>>> but iostat of these device is empty.
>> Unsurprising, as they should not be used by Ceph with a weight of 0.
>> atop gives you an even better, complete view.
>>
>>> smartctl say nothing error found in these OSD device.
>>>
>> What exactly are these devices (model please), 3TB SATA drives Ia assume?
>> How are they connect (controller)?
>
> Yes,3TB SATA,Model Number:       WDC WD3000FYYZ-01UL1B1
> and today,I try to set osd.0 reweight to 0.1;and then check.some
> useful data found.
>
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>            1.63    0.00    0.48   16.15    0.00   81.75
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sda               0.00     0.00    0.00    2.00     0.00     1.00
> 1024.00    39.85 1134.00    0.00 1134.00 500.00 100.00
> sda1              0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
> sda2              0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
> sda3              0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     1.00    0.00    0.00    0.00   0.00 100.40
> sda4              0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
> sda5              0.00     0.00    0.00    2.00     0.00     1.00
> 1024.00    32.32 1134.00    0.00 1134.00 502.00 100.40
> sda6              0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.66    0.00    0.00    0.00   0.00  66.40
> sda7              0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
> sda8              0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     1.00    0.00    0.00    0.00   0.00 100.00
> sda9              0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     1.00    0.00    0.00    0.00   0.00 100.00
> sda10             0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     1.00    0.00    0.00    0.00   0.00 100.00
>
> ^C^C^C^C^C^C^C^C^C
> root@node-65:~# ls -l /var/lib/ceph/osd/ceph-0
> total 62924048
> -rw-r--r--   1 root root         487 Oct 12 16:49 activate.monmap
> -rw-r--r--   1 root root           3 Oct 12 16:49 active
> -rw-r--r--   1 root root          37 Oct 12 16:49 ceph_fsid
> drwxr-xr-x 280 root root        8192 Mar 30 11:58 current
> -rw-r--r--   1 root root          37 Oct 12 16:49 fsid
> lrwxrwxrwx   1 root root           9 Oct 12 16:49 journal -> /dev/sda5
> -rw-------   1 root root          56 Oct 12 16:49 keyring
> -rw-r--r--   1 root root          21 Oct 12 16:49 magic
> -rw-r--r--   1 root root           6 Oct 12 16:49 ready
> -rw-r--r--   1 root root           4 Oct 12 16:49 store_version
> -rw-r--r--   1 root root          42 Oct 12 16:49 superblock
> -rw-r--r--   1 root root 64424509440 Mar 30 10:20 testfio.file
> -rw-r--r--   1 root root           0 Mar 28 09:54 upstart
> -rw-r--r--   1 root root           2 Oct 12 16:49 whoami
>
> the journal of osd.0 is sda5,it is so busy.and cpu wait in top is
> 30%.the system is slow.
>
> So,maybe this is the problem of sda5?it is INTEL SSDSC2BB120G4
> we use two SSD for journal and system.
>
> root@node-65:~# lsblk
> NAME               MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
> sda                  8:0    0 111.8G  0 disk
> ├─sda1               8:1    0    22M  0 part
> ├─sda2               8:2    0   191M  0 part /boot
> ├─sda3               8:3    0  43.9G  0 part /
> ├─sda4               8:4    0   3.8G  0 part [SWAP]
> ├─sda5               8:5    0  10.2G  0 part
> ├─sda6               8:6    0  10.2G  0 part
> ├─sda7               8:7    0  10.2G  0 part
> ├─sda8               8:8    0  10.2G  0 part
> ├─sda9               8:9    0  10.2G  0 part
> └─sda10              8:10   0  10.2G  0 part
> sdb                  8:16   0 111.8G  0 disk
> ├─sdb1               8:17   0    24M  0 part
> ├─sdb2               8:18   0  10.2G  0 part
> ├─sdb3               8:19   0  10.2G  0 part
> ├─sdb4               8:20   0  10.2G  0 part
> ├─sdb5               8:21   0  10.2G  0 part
> ├─sdb6               8:22   0  10.2G  0 part
> ├─sdb7               8:23   0  10.2G  0 part
> └─sdb8               8:24   0  50.1G  0 part
>
>> Christian
>>> 2016-03-29 13:22 GMT+08:00 lin zhou <hnuzhoulin2@xxxxxxxxx>:
>>> > Thanks.I try this method just like ceph document say.
>>> > But I just test osd.6 in this way,and the leveldb of osd.6 is
>>> > broken.so it can not start.
>>> >
>>> > When I try this for other osd,it works.
>>> >
>>> > 2016-03-29 8:22 GMT+08:00 Christian Balzer <chibi@xxxxxxx>:
>>> >> On Mon, 28 Mar 2016 18:36:14 +0800 lin zhou wrote:
>>> >>
>>> >>> > Hello,
>>> >>> >
>>> >>> > On Sun, 27 Mar 2016 13:41:57 +0800 lin zhou wrote:
>>> >>> >
>>> >>> > > Hi,guys.
>>> >>> > > some days ago,one osd have a large latency seeing in ceph osd
>>> >>> > > perf.and this device make this node a high cpu await.
>>> >>> > The thing to do at that point would have been look at things with
>>> >>> > atop or iostat to verify that it was the device itself that was
>>> >>> > slow and not because it was genuinely busy due to uneven activity
>>> >>> > maybe. As well as a quick glance at SMART of course.
>>> >>>
>>> >>> Thanks.I will follow this when I face this problem next time.
>>> >>>
>>> >>> > > So,I delete this osd ad then check this device.
>>> >>> > If that device (HDD, SSD, which model?) slowed down your cluster,
>>> >>> > you should not have deleted it.
>>> >>> > The best method would have been to set your cluster to noout and
>>> >>> > stop that specific OSD.
>>> >>> >
>>> >>> > When you say "delete", what exact steps did you take?
>>> >>> > Did this include removing it from the crush map?
>>> >>>
>>> >>> Yes,I delete it from crush map.delete its auth,and rm osd.
>>> >>>
>>> >>
>>> >> Google is your friend, if you deleted it like in the link below you
>>> >> should be be able to re-add it the same way:
>>> >> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-June/002345.html
>>> >>
>>> >> Christian
>>> >>
>>> >>> > > But nothing error found.
>>> >>> > >
>>> >>> > > And now I want to re-add this device into cluster with it's data.
>>> >>> > >
>>> >>> > All the data was already replicated elsewhere if you
>>> >>> > deleted/removed the OSD, you're likely not going to save much if
>>> >>> > any data movement by re-adding it.
>>> >>>
>>> >>> Yes,the cluster finished rebalance.but I face a problem of one
>>> >>> unfound object. And in the output of pg query in recovery_state
>>> >>> say,this osd is down,but other odds are ok.
>>> >>> So I want to recover this osd to recover this unfound object.
>>> >>>
>>> >>> and mark_unfound_lost revert/delete do not work:
>>> >>> Error EINVAL: pg has 1 unfound objects but we haven't probed all
>>> >>> sources,
>>> >>>
>>> >>> detail see:
>>> >>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-March/008452.html
>>> >>>
>>> >>> Thanks again.
>>> >>>
>>> >>> > >
>>> >>> > > I try to using ceph-osd to add it,but it can not start.log are
>>> >>> > > paste in :
>>> >>> > > https://gist.github.com/hnuzhoulin/836f9e633b90041e89ad
>>> >>> > >
>>> >>> > > so what's the recommend steps.
>>> >>> > That depends on how you deleted it, but at this point your data is
>>> >>> > likely to be mostly stale anyway, so I'd start from scratch.
>>> >>>
>>> >>> > Christian
>>> >>> > --
>>> >>> > Christian Balzer Network/Systems Engineer
>>> >>> > chibi@xxxxxxx Global OnLine Japan/Rakuten Communications
>>> >>> > http://www.gol.com/
>>> >>> >
>>> >>>
>>> >>
>>> >>
>>> >> --
>>> >> Christian Balzer        Network/Systems Engineer
>>> >> chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
>>> >> http://www.gol.com/
>>>
>>
>>
>> --
>> Christian Balzer        Network/Systems Engineer
>> chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
>> http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com