Re: how to re-add a deleted osd device as a osd with data

Christian Balzer <chibi@xxxxxxx> · Wed, 30 Mar 2016 15:05:45 +0900

Hello,

On Wed, 30 Mar 2016 12:19:57 +0800 lin zhou wrote:

> 2016-03-29 14:50 GMT+08:00 Christian Balzer <chibi@xxxxxxx>:
> >
> > Hello,
> >
> > On Tue, 29 Mar 2016 14:00:44 +0800 lin zhou wrote:
> >
> >> Hi,Christian.
> >> When I re-add these OSD(0,3,9,12,15),the high latency occur again.the
> >> default reweight of these OSD is 0.0
> >>
> > That makes no sense, at a crush weight (not reweight) of 0 they should
> > not get used at all.
> >
> > When you deleted the other OSD (6?) because of high latency, was your
> > only reason/data point the "ceph osd perf" output?
> 
> because this is a near-product environment,so when the osd latency and
> the system latency is high.I delete these osd to let it work first.
>
Understandable, but atop/iostat would have shown you the problem at that
time.

> >> root@node-65:~# ceph osd tree
> >> # id    weight  type name       up/down reweight
> >> -1      103.7   root default
> >> -2      8.19            host node-65
> >> 18      2.73                    osd.18  up      1
> >> 21      0                       osd.21  up      1
> >> 24      2.73                    osd.24  up      1
> >> 27      2.73                    osd.27  up      1
> >> 30      0                       osd.30  up      1
> >> 33      0                       osd.33  up      1
> >> 0       0                       osd.0   up      1
> >> 3       0                       osd.3   up      1
> >> 6       0                       osd.6   down    0
> >> 9       0                       osd.9   up      1
> >> 12      0                       osd.12  up      1
> >> 15      0                       osd.15  up      1
> >>
> >> ceph osd perf:
> >>     0                  9825                10211
> >>     3                  9398                 9775
> >>     9                 35852                36904
> >>    12                 24716                25626
> >>    15                 18893                19633
> >>
> > This could very well be old, stale data.
> > Still, these are some seriously bad numbers, if they are real.
> >
> > Do the these perf numbers change at all? My guess would be no.
> 
> Yes,the are never changed.
> 
They only get updated when something changes, as in written to the OSD.

> >
> >> but iostat of these device is empty.
> > Unsurprising, as they should not be used by Ceph with a weight of 0.
> > atop gives you an even better, complete view.
> >
> >> smartctl say nothing error found in these OSD device.
> >>
> > What exactly are these devices (model please), 3TB SATA drives Ia
> > assume? How are they connect (controller)?
> 
> Yes,3TB SATA,Model Number:       WDC WD3000FYYZ-01UL1B1
> and today,I try to set osd.0 reweight to 0.1;and then check.some
> useful data found.
> 
These HDDs should be fine, I'm not aware of any bugs with them, unlike the
Toshiba DT drives.

> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>            1.63    0.00    0.48   16.15    0.00   81.75
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sda               0.00     0.00    0.00    2.00     0.00     1.00
> 1024.00    39.85 1134.00    0.00 1134.00 500.00 100.00
> sda1              0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
> sda2              0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
> sda3              0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     1.00    0.00    0.00    0.00   0.00 100.40
> sda4              0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
> sda5              0.00     0.00    0.00    2.00     0.00     1.00
> 1024.00    32.32 1134.00    0.00 1134.00 502.00 100.40
> sda6              0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.66    0.00    0.00    0.00   0.00  66.40
> sda7              0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
> sda8              0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     1.00    0.00    0.00    0.00   0.00 100.00
> sda9              0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     1.00    0.00    0.00    0.00   0.00 100.00
> sda10             0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     1.00    0.00    0.00    0.00   0.00 100.00
> 
That's incredibly slow and all your other journal partitions are at 100%
utilization as well, clearly something is wrong with that SSD.

> ^C^C^C^C^C^C^C^C^C
> root@node-65:~# ls -l /var/lib/ceph/osd/ceph-0
> total 62924048
> -rw-r--r--   1 root root         487 Oct 12 16:49 activate.monmap
> -rw-r--r--   1 root root           3 Oct 12 16:49 active
> -rw-r--r--   1 root root          37 Oct 12 16:49 ceph_fsid
> drwxr-xr-x 280 root root        8192 Mar 30 11:58 current
> -rw-r--r--   1 root root          37 Oct 12 16:49 fsid
> lrwxrwxrwx   1 root root           9 Oct 12 16:49 journal -> /dev/sda5
> -rw-------   1 root root          56 Oct 12 16:49 keyring
> -rw-r--r--   1 root root          21 Oct 12 16:49 magic
> -rw-r--r--   1 root root           6 Oct 12 16:49 ready
> -rw-r--r--   1 root root           4 Oct 12 16:49 store_version
> -rw-r--r--   1 root root          42 Oct 12 16:49 superblock
> -rw-r--r--   1 root root 64424509440 Mar 30 10:20 testfio.file
> -rw-r--r--   1 root root           0 Mar 28 09:54 upstart
> -rw-r--r--   1 root root           2 Oct 12 16:49 whoami
> 
> the journal of osd.0 is sda5,it is so busy.and cpu wait in top is
> 30%.the system is slow.
> 
> So,maybe this is the problem of sda5?it is INTEL SSDSC2BB120G4
> we use two SSD for journal and system.
> 
That's 120GB Intel DC S3500.
While not a speed daemon, it should not be anywhere near that slow speed.
Something is wrong with it, my suspicion is that it's worn out or
encountered a firmware bug.
Can you post the "smartctl -a" output for it?

> root@node-65:~# lsblk
> NAME               MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
> sda                  8:0    0 111.8G  0 disk
> ├─sda1               8:1    0    22M  0 part
> ├─sda2               8:2    0   191M  0 part /boot
> ├─sda3               8:3    0  43.9G  0 part /
> ├─sda4               8:4    0   3.8G  0 part [SWAP]
> ├─sda5               8:5    0  10.2G  0 part
> ├─sda6               8:6    0  10.2G  0 part
> ├─sda7               8:7    0  10.2G  0 part
> ├─sda8               8:8    0  10.2G  0 part
> ├─sda9               8:9    0  10.2G  0 part
> └─sda10              8:10   0  10.2G  0 part
> sdb                  8:16   0 111.8G  0 disk
> ├─sdb1               8:17   0    24M  0 part
> ├─sdb2               8:18   0  10.2G  0 part
> ├─sdb3               8:19   0  10.2G  0 part
> ├─sdb4               8:20   0  10.2G  0 part
> ├─sdb5               8:21   0  10.2G  0 part
> ├─sdb6               8:22   0  10.2G  0 part
> ├─sdb7               8:23   0  10.2G  0 part
> └─sdb8               8:24   0  50.1G  0 part
> 

Several points here:
1. A RAID1 for your OS/Swap would be advantageous if you loose a SSD or
have to replace one when it wears out.

2. DC S3500 120GB have an endurance of only 70TBW, I'd never use them for
journals. DC S3610s if you know your write levels for sure, DC S3710s
otherwise.

3. 6 journals on this kind of SSD (top write speed 135MB/s) is total
overload. 
Your journals should roughly be as fast as your network (so if you have
2x1Gb/s that would actually be OK) and/or about 100MB/s per HDD, depending
which one is smaller.

Christian

> > Christian
> >> 2016-03-29 13:22 GMT+08:00 lin zhou <hnuzhoulin2@xxxxxxxxx>:
> >> > Thanks.I try this method just like ceph document say.
> >> > But I just test osd.6 in this way,and the leveldb of osd.6 is
> >> > broken.so it can not start.
> >> >
> >> > When I try this for other osd,it works.
> >> >
> >> > 2016-03-29 8:22 GMT+08:00 Christian Balzer <chibi@xxxxxxx>:
> >> >> On Mon, 28 Mar 2016 18:36:14 +0800 lin zhou wrote:
> >> >>
> >> >>> > Hello,
> >> >>> >
> >> >>> > On Sun, 27 Mar 2016 13:41:57 +0800 lin zhou wrote:
> >> >>> >
> >> >>> > > Hi,guys.
> >> >>> > > some days ago,one osd have a large latency seeing in ceph osd
> >> >>> > > perf.and this device make this node a high cpu await.
> >> >>> > The thing to do at that point would have been look at things
> >> >>> > with atop or iostat to verify that it was the device itself
> >> >>> > that was slow and not because it was genuinely busy due to
> >> >>> > uneven activity maybe. As well as a quick glance at SMART of
> >> >>> > course.
> >> >>>
> >> >>> Thanks.I will follow this when I face this problem next time.
> >> >>>
> >> >>> > > So,I delete this osd ad then check this device.
> >> >>> > If that device (HDD, SSD, which model?) slowed down your
> >> >>> > cluster, you should not have deleted it.
> >> >>> > The best method would have been to set your cluster to noout and
> >> >>> > stop that specific OSD.
> >> >>> >
> >> >>> > When you say "delete", what exact steps did you take?
> >> >>> > Did this include removing it from the crush map?
> >> >>>
> >> >>> Yes,I delete it from crush map.delete its auth,and rm osd.
> >> >>>
> >> >>
> >> >> Google is your friend, if you deleted it like in the link below you
> >> >> should be be able to re-add it the same way:
> >> >> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-June/002345.html
> >> >>
> >> >> Christian
> >> >>
> >> >>> > > But nothing error found.
> >> >>> > >
> >> >>> > > And now I want to re-add this device into cluster with it's
> >> >>> > > data.
> >> >>> > >
> >> >>> > All the data was already replicated elsewhere if you
> >> >>> > deleted/removed the OSD, you're likely not going to save much if
> >> >>> > any data movement by re-adding it.
> >> >>>
> >> >>> Yes,the cluster finished rebalance.but I face a problem of one
> >> >>> unfound object. And in the output of pg query in recovery_state
> >> >>> say,this osd is down,but other odds are ok.
> >> >>> So I want to recover this osd to recover this unfound object.
> >> >>>
> >> >>> and mark_unfound_lost revert/delete do not work:
> >> >>> Error EINVAL: pg has 1 unfound objects but we haven't probed all
> >> >>> sources,
> >> >>>
> >> >>> detail see:
> >> >>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-March/008452.html
> >> >>>
> >> >>> Thanks again.
> >> >>>
> >> >>> > >
> >> >>> > > I try to using ceph-osd to add it,but it can not start.log are
> >> >>> > > paste in :
> >> >>> > > https://gist.github.com/hnuzhoulin/836f9e633b90041e89ad
> >> >>> > >
> >> >>> > > so what's the recommend steps.
> >> >>> > That depends on how you deleted it, but at this point your data
> >> >>> > is likely to be mostly stale anyway, so I'd start from scratch.
> >> >>>
> >> >>> > Christian
> >> >>> > --
> >> >>> > Christian Balzer Network/Systems Engineer
> >> >>> > chibi@xxxxxxx Global OnLine Japan/Rakuten Communications
> >> >>> > http://www.gol.com/
> >> >>> >
> >> >>>
> >> >>
> >> >>
> >> >> --
> >> >> Christian Balzer        Network/Systems Engineer
> >> >> chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
> >> >> http://www.gol.com/
> >>
> >
> >
> > --
> > Christian Balzer        Network/Systems Engineer
> > chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
> > http://www.gol.com/
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com