Re: how to re-add a deleted osd device as a osd with data

Christian Balzer <chibi@xxxxxxx> · Wed, 30 Mar 2016 15:06:42 +0900

Hello,

On Wed, 30 Mar 2016 13:50:17 +0800 lin zhou wrote:

> maybe I found the problerm:
> 
> smartctl -a /dev/sda | grep Media_Wearout_Indicator
> 233 Media_Wearout_Indicator 0x0032   001   001   000    Old_age   Always
>
Exactly what I thought it would be.

See my previous mail.

Christian

> root@node-65:~# fio -direct=1 -bs=4k -ramp_time=40 -runtime=100
> -size=20g -filename=./testfio.file -ioengine=libaio -iodepth=8
> -norandommap -randrepeat=0 -time_based -rw=randwrite -name "osd.0 4k
> randwrite test"
> osd.0 4k randwrite test: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K,
> ioengine=libaio, iodepth=8
> fio-2.1.10
> Starting 1 process
> osd.0 4k randwrite test: Laying out IO file(s) (1 file(s) / 20480MB)
> Jobs: 1 (f=1): [w] [100.0% done] [0KB/252KB/0KB /s] [0/63/0 iops] [eta
> 00m:00s] osd.0 4k randwrite test: (groupid=0, jobs=1): err= 0:
> pid=30071: Wed Mar 30 13:38:27 2016
>   write: io=79528KB, bw=814106B/s, iops=198, runt=100032msec
>     slat (usec): min=5, max=1031.3K, avg=363.76, stdev=13260.26
>     clat (usec): min=109, max=1325.7K, avg=39755.66, stdev=81798.27
>      lat (msec): min=3, max=1325, avg=40.25, stdev=83.48
>     clat percentiles (msec):
>      |  1.00th=[   30],  5.00th=[   30], 10.00th=[   30],
> 20.00th=[   31], | 30.00th=[   31], 40.00th=[   31], 50.00th=[   31],
> 60.00th=[   36], | 70.00th=[   36], 80.00th=[   36], 90.00th=[   36],
> 95.00th=[   36], | 99.00th=[  165], 99.50th=[  799], 99.90th=[ 1221],
> 99.95th=[ 1237], | 99.99th=[ 1319]
>     bw (KB  /s): min=    0, max= 1047, per=100.00%, avg=844.21,
> stdev=291.89 lat (usec) : 250=0.01%
>     lat (msec) : 4=0.02%, 10=0.14%, 20=0.31%, 50=98.13%, 100=0.25%
>     lat (msec) : 250=0.23%, 500=0.16%, 750=0.24%, 1000=0.22%, 2000=0.34%
>   cpu          : usr=0.18%, sys=1.27%, ctx=22838, majf=0, minf=27
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=143.7%, 16=0.0%, 32=0.0%,
> >=64=0.0% submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%,
> >64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%,
> >=64=0.0% issued    : total=r=0/w=19875/d=0, short=r=0/w=0/d=0
>      latency   : target=0, window=0, percentile=100.00%, depth=8
> 
> Run status group 0 (all jobs):
>   WRITE: io=79528KB, aggrb=795KB/s, minb=795KB/s, maxb=795KB/s,
> mint=100032msec, maxt=100032msec
> 
> Disk stats (read/write):
>   sda: ios=864/28988, merge=0/5738, ticks=31932/1061860,
> in_queue=1093892, util=99.99%
> root@node-65:~#
> 
> the lifetime of this SSD is over.
> 
> Thanks so much,Christian.
> 
> 2016-03-30 12:19 GMT+08:00 lin zhou <hnuzhoulin2@xxxxxxxxx>:
> > 2016-03-29 14:50 GMT+08:00 Christian Balzer <chibi@xxxxxxx>:
> >>
> >> Hello,
> >>
> >> On Tue, 29 Mar 2016 14:00:44 +0800 lin zhou wrote:
> >>
> >>> Hi,Christian.
> >>> When I re-add these OSD(0,3,9,12,15),the high latency occur again.the
> >>> default reweight of these OSD is 0.0
> >>>
> >> That makes no sense, at a crush weight (not reweight) of 0 they
> >> should not get used at all.
> >>
> >> When you deleted the other OSD (6?) because of high latency, was your
> >> only reason/data point the "ceph osd perf" output?
> >
> > because this is a near-product environment,so when the osd latency and
> > the system latency is high.I delete these osd to let it work first.
> >
> >>> root@node-65:~# ceph osd tree
> >>> # id    weight  type name       up/down reweight
> >>> -1      103.7   root default
> >>> -2      8.19            host node-65
> >>> 18      2.73                    osd.18  up      1
> >>> 21      0                       osd.21  up      1
> >>> 24      2.73                    osd.24  up      1
> >>> 27      2.73                    osd.27  up      1
> >>> 30      0                       osd.30  up      1
> >>> 33      0                       osd.33  up      1
> >>> 0       0                       osd.0   up      1
> >>> 3       0                       osd.3   up      1
> >>> 6       0                       osd.6   down    0
> >>> 9       0                       osd.9   up      1
> >>> 12      0                       osd.12  up      1
> >>> 15      0                       osd.15  up      1
> >>>
> >>> ceph osd perf:
> >>>     0                  9825                10211
> >>>     3                  9398                 9775
> >>>     9                 35852                36904
> >>>    12                 24716                25626
> >>>    15                 18893                19633
> >>>
> >> This could very well be old, stale data.
> >> Still, these are some seriously bad numbers, if they are real.
> >>
> >> Do the these perf numbers change at all? My guess would be no.
> >
> > Yes,the are never changed.
> >
> >>
> >>> but iostat of these device is empty.
> >> Unsurprising, as they should not be used by Ceph with a weight of 0.
> >> atop gives you an even better, complete view.
> >>
> >>> smartctl say nothing error found in these OSD device.
> >>>
> >> What exactly are these devices (model please), 3TB SATA drives Ia
> >> assume? How are they connect (controller)?
> >
> > Yes,3TB SATA,Model Number:       WDC WD3000FYYZ-01UL1B1
> > and today,I try to set osd.0 reweight to 0.1;and then check.some
> > useful data found.
> >
> > avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> >            1.63    0.00    0.48   16.15    0.00   81.75
> >
> > Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
> > avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> > sda               0.00     0.00    0.00    2.00     0.00     1.00
> > 1024.00    39.85 1134.00    0.00 1134.00 500.00 100.00
> > sda1              0.00     0.00    0.00    0.00     0.00     0.00
> > 0.00     0.00    0.00    0.00    0.00   0.00   0.00
> > sda2              0.00     0.00    0.00    0.00     0.00     0.00
> > 0.00     0.00    0.00    0.00    0.00   0.00   0.00
> > sda3              0.00     0.00    0.00    0.00     0.00     0.00
> > 0.00     1.00    0.00    0.00    0.00   0.00 100.40
> > sda4              0.00     0.00    0.00    0.00     0.00     0.00
> > 0.00     0.00    0.00    0.00    0.00   0.00   0.00
> > sda5              0.00     0.00    0.00    2.00     0.00     1.00
> > 1024.00    32.32 1134.00    0.00 1134.00 502.00 100.40
> > sda6              0.00     0.00    0.00    0.00     0.00     0.00
> > 0.00     0.66    0.00    0.00    0.00   0.00  66.40
> > sda7              0.00     0.00    0.00    0.00     0.00     0.00
> > 0.00     0.00    0.00    0.00    0.00   0.00   0.00
> > sda8              0.00     0.00    0.00    0.00     0.00     0.00
> > 0.00     1.00    0.00    0.00    0.00   0.00 100.00
> > sda9              0.00     0.00    0.00    0.00     0.00     0.00
> > 0.00     1.00    0.00    0.00    0.00   0.00 100.00
> > sda10             0.00     0.00    0.00    0.00     0.00     0.00
> > 0.00     1.00    0.00    0.00    0.00   0.00 100.00
> >
> > ^C^C^C^C^C^C^C^C^C
> > root@node-65:~# ls -l /var/lib/ceph/osd/ceph-0
> > total 62924048
> > -rw-r--r--   1 root root         487 Oct 12 16:49 activate.monmap
> > -rw-r--r--   1 root root           3 Oct 12 16:49 active
> > -rw-r--r--   1 root root          37 Oct 12 16:49 ceph_fsid
> > drwxr-xr-x 280 root root        8192 Mar 30 11:58 current
> > -rw-r--r--   1 root root          37 Oct 12 16:49 fsid
> > lrwxrwxrwx   1 root root           9 Oct 12 16:49 journal -> /dev/sda5
> > -rw-------   1 root root          56 Oct 12 16:49 keyring
> > -rw-r--r--   1 root root          21 Oct 12 16:49 magic
> > -rw-r--r--   1 root root           6 Oct 12 16:49 ready
> > -rw-r--r--   1 root root           4 Oct 12 16:49 store_version
> > -rw-r--r--   1 root root          42 Oct 12 16:49 superblock
> > -rw-r--r--   1 root root 64424509440 Mar 30 10:20 testfio.file
> > -rw-r--r--   1 root root           0 Mar 28 09:54 upstart
> > -rw-r--r--   1 root root           2 Oct 12 16:49 whoami
> >
> > the journal of osd.0 is sda5,it is so busy.and cpu wait in top is
> > 30%.the system is slow.
> >
> > So,maybe this is the problem of sda5?it is INTEL SSDSC2BB120G4
> > we use two SSD for journal and system.
> >
> > root@node-65:~# lsblk
> > NAME               MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
> > sda                  8:0    0 111.8G  0 disk
> > ├─sda1               8:1    0    22M  0 part
> > ├─sda2               8:2    0   191M  0 part /boot
> > ├─sda3               8:3    0  43.9G  0 part /
> > ├─sda4               8:4    0   3.8G  0 part [SWAP]
> > ├─sda5               8:5    0  10.2G  0 part
> > ├─sda6               8:6    0  10.2G  0 part
> > ├─sda7               8:7    0  10.2G  0 part
> > ├─sda8               8:8    0  10.2G  0 part
> > ├─sda9               8:9    0  10.2G  0 part
> > └─sda10              8:10   0  10.2G  0 part
> > sdb                  8:16   0 111.8G  0 disk
> > ├─sdb1               8:17   0    24M  0 part
> > ├─sdb2               8:18   0  10.2G  0 part
> > ├─sdb3               8:19   0  10.2G  0 part
> > ├─sdb4               8:20   0  10.2G  0 part
> > ├─sdb5               8:21   0  10.2G  0 part
> > ├─sdb6               8:22   0  10.2G  0 part
> > ├─sdb7               8:23   0  10.2G  0 part
> > └─sdb8               8:24   0  50.1G  0 part
> >
> >> Christian
> >>> 2016-03-29 13:22 GMT+08:00 lin zhou <hnuzhoulin2@xxxxxxxxx>:
> >>> > Thanks.I try this method just like ceph document say.
> >>> > But I just test osd.6 in this way,and the leveldb of osd.6 is
> >>> > broken.so it can not start.
> >>> >
> >>> > When I try this for other osd,it works.
> >>> >
> >>> > 2016-03-29 8:22 GMT+08:00 Christian Balzer <chibi@xxxxxxx>:
> >>> >> On Mon, 28 Mar 2016 18:36:14 +0800 lin zhou wrote:
> >>> >>
> >>> >>> > Hello,
> >>> >>> >
> >>> >>> > On Sun, 27 Mar 2016 13:41:57 +0800 lin zhou wrote:
> >>> >>> >
> >>> >>> > > Hi,guys.
> >>> >>> > > some days ago,one osd have a large latency seeing in ceph osd
> >>> >>> > > perf.and this device make this node a high cpu await.
> >>> >>> > The thing to do at that point would have been look at things
> >>> >>> > with atop or iostat to verify that it was the device itself
> >>> >>> > that was slow and not because it was genuinely busy due to
> >>> >>> > uneven activity maybe. As well as a quick glance at SMART of
> >>> >>> > course.
> >>> >>>
> >>> >>> Thanks.I will follow this when I face this problem next time.
> >>> >>>
> >>> >>> > > So,I delete this osd ad then check this device.
> >>> >>> > If that device (HDD, SSD, which model?) slowed down your
> >>> >>> > cluster, you should not have deleted it.
> >>> >>> > The best method would have been to set your cluster to noout
> >>> >>> > and stop that specific OSD.
> >>> >>> >
> >>> >>> > When you say "delete", what exact steps did you take?
> >>> >>> > Did this include removing it from the crush map?
> >>> >>>
> >>> >>> Yes,I delete it from crush map.delete its auth,and rm osd.
> >>> >>>
> >>> >>
> >>> >> Google is your friend, if you deleted it like in the link below
> >>> >> you should be be able to re-add it the same way:
> >>> >> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-June/002345.html
> >>> >>
> >>> >> Christian
> >>> >>
> >>> >>> > > But nothing error found.
> >>> >>> > >
> >>> >>> > > And now I want to re-add this device into cluster with it's
> >>> >>> > > data.
> >>> >>> > >
> >>> >>> > All the data was already replicated elsewhere if you
> >>> >>> > deleted/removed the OSD, you're likely not going to save much
> >>> >>> > if any data movement by re-adding it.
> >>> >>>
> >>> >>> Yes,the cluster finished rebalance.but I face a problem of one
> >>> >>> unfound object. And in the output of pg query in recovery_state
> >>> >>> say,this osd is down,but other odds are ok.
> >>> >>> So I want to recover this osd to recover this unfound object.
> >>> >>>
> >>> >>> and mark_unfound_lost revert/delete do not work:
> >>> >>> Error EINVAL: pg has 1 unfound objects but we haven't probed all
> >>> >>> sources,
> >>> >>>
> >>> >>> detail see:
> >>> >>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-March/008452.html
> >>> >>>
> >>> >>> Thanks again.
> >>> >>>
> >>> >>> > >
> >>> >>> > > I try to using ceph-osd to add it,but it can not start.log
> >>> >>> > > are paste in :
> >>> >>> > > https://gist.github.com/hnuzhoulin/836f9e633b90041e89ad
> >>> >>> > >
> >>> >>> > > so what's the recommend steps.
> >>> >>> > That depends on how you deleted it, but at this point your
> >>> >>> > data is likely to be mostly stale anyway, so I'd start from
> >>> >>> > scratch.
> >>> >>>
> >>> >>> > Christian
> >>> >>> > --
> >>> >>> > Christian Balzer Network/Systems Engineer
> >>> >>> > chibi@xxxxxxx Global OnLine Japan/Rakuten Communications
> >>> >>> > http://www.gol.com/
> >>> >>> >
> >>> >>>
> >>> >>
> >>> >>
> >>> >> --
> >>> >> Christian Balzer        Network/Systems Engineer
> >>> >> chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
> >>> >> http://www.gol.com/
> >>>
> >>
> >>
> >> --
> >> Christian Balzer        Network/Systems Engineer
> >> chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
> >> http://www.gol.com/
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com