Hello, On Wed, 30 Mar 2016 13:50:17 +0800 lin zhou wrote: > maybe I found the problerm: > > smartctl -a /dev/sda | grep Media_Wearout_Indicator > 233 Media_Wearout_Indicator 0x0032 001 001 000 Old_age Always > Exactly what I thought it would be. See my previous mail. Christian > root@node-65:~# fio -direct=1 -bs=4k -ramp_time=40 -runtime=100 > -size=20g -filename=./testfio.file -ioengine=libaio -iodepth=8 > -norandommap -randrepeat=0 -time_based -rw=randwrite -name "osd.0 4k > randwrite test" > osd.0 4k randwrite test: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, > ioengine=libaio, iodepth=8 > fio-2.1.10 > Starting 1 process > osd.0 4k randwrite test: Laying out IO file(s) (1 file(s) / 20480MB) > Jobs: 1 (f=1): [w] [100.0% done] [0KB/252KB/0KB /s] [0/63/0 iops] [eta > 00m:00s] osd.0 4k randwrite test: (groupid=0, jobs=1): err= 0: > pid=30071: Wed Mar 30 13:38:27 2016 > write: io=79528KB, bw=814106B/s, iops=198, runt=100032msec > slat (usec): min=5, max=1031.3K, avg=363.76, stdev=13260.26 > clat (usec): min=109, max=1325.7K, avg=39755.66, stdev=81798.27 > lat (msec): min=3, max=1325, avg=40.25, stdev=83.48 > clat percentiles (msec): > | 1.00th=[ 30], 5.00th=[ 30], 10.00th=[ 30], > 20.00th=[ 31], | 30.00th=[ 31], 40.00th=[ 31], 50.00th=[ 31], > 60.00th=[ 36], | 70.00th=[ 36], 80.00th=[ 36], 90.00th=[ 36], > 95.00th=[ 36], | 99.00th=[ 165], 99.50th=[ 799], 99.90th=[ 1221], > 99.95th=[ 1237], | 99.99th=[ 1319] > bw (KB /s): min= 0, max= 1047, per=100.00%, avg=844.21, > stdev=291.89 lat (usec) : 250=0.01% > lat (msec) : 4=0.02%, 10=0.14%, 20=0.31%, 50=98.13%, 100=0.25% > lat (msec) : 250=0.23%, 500=0.16%, 750=0.24%, 1000=0.22%, 2000=0.34% > cpu : usr=0.18%, sys=1.27%, ctx=22838, majf=0, minf=27 > IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=143.7%, 16=0.0%, 32=0.0%, > >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, > >64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, > >=64=0.0% issued : total=r=0/w=19875/d=0, short=r=0/w=0/d=0 > latency : target=0, window=0, percentile=100.00%, depth=8 > > Run status group 0 (all jobs): > WRITE: io=79528KB, aggrb=795KB/s, minb=795KB/s, maxb=795KB/s, > mint=100032msec, maxt=100032msec > > Disk stats (read/write): > sda: ios=864/28988, merge=0/5738, ticks=31932/1061860, > in_queue=1093892, util=99.99% > root@node-65:~# > > the lifetime of this SSD is over. > > Thanks so much,Christian. > > 2016-03-30 12:19 GMT+08:00 lin zhou <hnuzhoulin2@xxxxxxxxx>: > > 2016-03-29 14:50 GMT+08:00 Christian Balzer <chibi@xxxxxxx>: > >> > >> Hello, > >> > >> On Tue, 29 Mar 2016 14:00:44 +0800 lin zhou wrote: > >> > >>> Hi,Christian. > >>> When I re-add these OSD(0,3,9,12,15),the high latency occur again.the > >>> default reweight of these OSD is 0.0 > >>> > >> That makes no sense, at a crush weight (not reweight) of 0 they > >> should not get used at all. > >> > >> When you deleted the other OSD (6?) because of high latency, was your > >> only reason/data point the "ceph osd perf" output? > > > > because this is a near-product environment,so when the osd latency and > > the system latency is high.I delete these osd to let it work first. > > > >>> root@node-65:~# ceph osd tree > >>> # id weight type name up/down reweight > >>> -1 103.7 root default > >>> -2 8.19 host node-65 > >>> 18 2.73 osd.18 up 1 > >>> 21 0 osd.21 up 1 > >>> 24 2.73 osd.24 up 1 > >>> 27 2.73 osd.27 up 1 > >>> 30 0 osd.30 up 1 > >>> 33 0 osd.33 up 1 > >>> 0 0 osd.0 up 1 > >>> 3 0 osd.3 up 1 > >>> 6 0 osd.6 down 0 > >>> 9 0 osd.9 up 1 > >>> 12 0 osd.12 up 1 > >>> 15 0 osd.15 up 1 > >>> > >>> ceph osd perf: > >>> 0 9825 10211 > >>> 3 9398 9775 > >>> 9 35852 36904 > >>> 12 24716 25626 > >>> 15 18893 19633 > >>> > >> This could very well be old, stale data. > >> Still, these are some seriously bad numbers, if they are real. > >> > >> Do the these perf numbers change at all? My guess would be no. > > > > Yes,the are never changed. > > > >> > >>> but iostat of these device is empty. > >> Unsurprising, as they should not be used by Ceph with a weight of 0. > >> atop gives you an even better, complete view. > >> > >>> smartctl say nothing error found in these OSD device. > >>> > >> What exactly are these devices (model please), 3TB SATA drives Ia > >> assume? How are they connect (controller)? > > > > Yes,3TB SATA,Model Number: WDC WD3000FYYZ-01UL1B1 > > and today,I try to set osd.0 reweight to 0.1;and then check.some > > useful data found. > > > > avg-cpu: %user %nice %system %iowait %steal %idle > > 1.63 0.00 0.48 16.15 0.00 81.75 > > > > Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s > > avgrq-sz avgqu-sz await r_await w_await svctm %util > > sda 0.00 0.00 0.00 2.00 0.00 1.00 > > 1024.00 39.85 1134.00 0.00 1134.00 500.00 100.00 > > sda1 0.00 0.00 0.00 0.00 0.00 0.00 > > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > > sda2 0.00 0.00 0.00 0.00 0.00 0.00 > > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > > sda3 0.00 0.00 0.00 0.00 0.00 0.00 > > 0.00 1.00 0.00 0.00 0.00 0.00 100.40 > > sda4 0.00 0.00 0.00 0.00 0.00 0.00 > > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > > sda5 0.00 0.00 0.00 2.00 0.00 1.00 > > 1024.00 32.32 1134.00 0.00 1134.00 502.00 100.40 > > sda6 0.00 0.00 0.00 0.00 0.00 0.00 > > 0.00 0.66 0.00 0.00 0.00 0.00 66.40 > > sda7 0.00 0.00 0.00 0.00 0.00 0.00 > > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > > sda8 0.00 0.00 0.00 0.00 0.00 0.00 > > 0.00 1.00 0.00 0.00 0.00 0.00 100.00 > > sda9 0.00 0.00 0.00 0.00 0.00 0.00 > > 0.00 1.00 0.00 0.00 0.00 0.00 100.00 > > sda10 0.00 0.00 0.00 0.00 0.00 0.00 > > 0.00 1.00 0.00 0.00 0.00 0.00 100.00 > > > > ^C^C^C^C^C^C^C^C^C > > root@node-65:~# ls -l /var/lib/ceph/osd/ceph-0 > > total 62924048 > > -rw-r--r-- 1 root root 487 Oct 12 16:49 activate.monmap > > -rw-r--r-- 1 root root 3 Oct 12 16:49 active > > -rw-r--r-- 1 root root 37 Oct 12 16:49 ceph_fsid > > drwxr-xr-x 280 root root 8192 Mar 30 11:58 current > > -rw-r--r-- 1 root root 37 Oct 12 16:49 fsid > > lrwxrwxrwx 1 root root 9 Oct 12 16:49 journal -> /dev/sda5 > > -rw------- 1 root root 56 Oct 12 16:49 keyring > > -rw-r--r-- 1 root root 21 Oct 12 16:49 magic > > -rw-r--r-- 1 root root 6 Oct 12 16:49 ready > > -rw-r--r-- 1 root root 4 Oct 12 16:49 store_version > > -rw-r--r-- 1 root root 42 Oct 12 16:49 superblock > > -rw-r--r-- 1 root root 64424509440 Mar 30 10:20 testfio.file > > -rw-r--r-- 1 root root 0 Mar 28 09:54 upstart > > -rw-r--r-- 1 root root 2 Oct 12 16:49 whoami > > > > the journal of osd.0 is sda5,it is so busy.and cpu wait in top is > > 30%.the system is slow. > > > > So,maybe this is the problem of sda5?it is INTEL SSDSC2BB120G4 > > we use two SSD for journal and system. > > > > root@node-65:~# lsblk > > NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT > > sda 8:0 0 111.8G 0 disk > > ├─sda1 8:1 0 22M 0 part > > ├─sda2 8:2 0 191M 0 part /boot > > ├─sda3 8:3 0 43.9G 0 part / > > ├─sda4 8:4 0 3.8G 0 part [SWAP] > > ├─sda5 8:5 0 10.2G 0 part > > ├─sda6 8:6 0 10.2G 0 part > > ├─sda7 8:7 0 10.2G 0 part > > ├─sda8 8:8 0 10.2G 0 part > > ├─sda9 8:9 0 10.2G 0 part > > └─sda10 8:10 0 10.2G 0 part > > sdb 8:16 0 111.8G 0 disk > > ├─sdb1 8:17 0 24M 0 part > > ├─sdb2 8:18 0 10.2G 0 part > > ├─sdb3 8:19 0 10.2G 0 part > > ├─sdb4 8:20 0 10.2G 0 part > > ├─sdb5 8:21 0 10.2G 0 part > > ├─sdb6 8:22 0 10.2G 0 part > > ├─sdb7 8:23 0 10.2G 0 part > > └─sdb8 8:24 0 50.1G 0 part > > > >> Christian > >>> 2016-03-29 13:22 GMT+08:00 lin zhou <hnuzhoulin2@xxxxxxxxx>: > >>> > Thanks.I try this method just like ceph document say. > >>> > But I just test osd.6 in this way,and the leveldb of osd.6 is > >>> > broken.so it can not start. > >>> > > >>> > When I try this for other osd,it works. > >>> > > >>> > 2016-03-29 8:22 GMT+08:00 Christian Balzer <chibi@xxxxxxx>: > >>> >> On Mon, 28 Mar 2016 18:36:14 +0800 lin zhou wrote: > >>> >> > >>> >>> > Hello, > >>> >>> > > >>> >>> > On Sun, 27 Mar 2016 13:41:57 +0800 lin zhou wrote: > >>> >>> > > >>> >>> > > Hi,guys. > >>> >>> > > some days ago,one osd have a large latency seeing in ceph osd > >>> >>> > > perf.and this device make this node a high cpu await. > >>> >>> > The thing to do at that point would have been look at things > >>> >>> > with atop or iostat to verify that it was the device itself > >>> >>> > that was slow and not because it was genuinely busy due to > >>> >>> > uneven activity maybe. As well as a quick glance at SMART of > >>> >>> > course. > >>> >>> > >>> >>> Thanks.I will follow this when I face this problem next time. > >>> >>> > >>> >>> > > So,I delete this osd ad then check this device. > >>> >>> > If that device (HDD, SSD, which model?) slowed down your > >>> >>> > cluster, you should not have deleted it. > >>> >>> > The best method would have been to set your cluster to noout > >>> >>> > and stop that specific OSD. > >>> >>> > > >>> >>> > When you say "delete", what exact steps did you take? > >>> >>> > Did this include removing it from the crush map? > >>> >>> > >>> >>> Yes,I delete it from crush map.delete its auth,and rm osd. > >>> >>> > >>> >> > >>> >> Google is your friend, if you deleted it like in the link below > >>> >> you should be be able to re-add it the same way: > >>> >> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-June/002345.html > >>> >> > >>> >> Christian > >>> >> > >>> >>> > > But nothing error found. > >>> >>> > > > >>> >>> > > And now I want to re-add this device into cluster with it's > >>> >>> > > data. > >>> >>> > > > >>> >>> > All the data was already replicated elsewhere if you > >>> >>> > deleted/removed the OSD, you're likely not going to save much > >>> >>> > if any data movement by re-adding it. > >>> >>> > >>> >>> Yes,the cluster finished rebalance.but I face a problem of one > >>> >>> unfound object. And in the output of pg query in recovery_state > >>> >>> say,this osd is down,but other odds are ok. > >>> >>> So I want to recover this osd to recover this unfound object. > >>> >>> > >>> >>> and mark_unfound_lost revert/delete do not work: > >>> >>> Error EINVAL: pg has 1 unfound objects but we haven't probed all > >>> >>> sources, > >>> >>> > >>> >>> detail see: > >>> >>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-March/008452.html > >>> >>> > >>> >>> Thanks again. > >>> >>> > >>> >>> > > > >>> >>> > > I try to using ceph-osd to add it,but it can not start.log > >>> >>> > > are paste in : > >>> >>> > > https://gist.github.com/hnuzhoulin/836f9e633b90041e89ad > >>> >>> > > > >>> >>> > > so what's the recommend steps. > >>> >>> > That depends on how you deleted it, but at this point your > >>> >>> > data is likely to be mostly stale anyway, so I'd start from > >>> >>> > scratch. > >>> >>> > >>> >>> > Christian > >>> >>> > -- > >>> >>> > Christian Balzer Network/Systems Engineer > >>> >>> > chibi@xxxxxxx Global OnLine Japan/Rakuten Communications > >>> >>> > http://www.gol.com/ > >>> >>> > > >>> >>> > >>> >> > >>> >> > >>> >> -- > >>> >> Christian Balzer Network/Systems Engineer > >>> >> chibi@xxxxxxx Global OnLine Japan/Rakuten Communications > >>> >> http://www.gol.com/ > >>> > >> > >> > >> -- > >> Christian Balzer Network/Systems Engineer > >> chibi@xxxxxxx Global OnLine Japan/Rakuten Communications > >> http://www.gol.com/ > -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Rakuten Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com