maybe I found the problerm: smartctl -a /dev/sda | grep Media_Wearout_Indicator 233 Media_Wearout_Indicator 0x0032 001 001 000 Old_age Always root@node-65:~# fio -direct=1 -bs=4k -ramp_time=40 -runtime=100 -size=20g -filename=./testfio.file -ioengine=libaio -iodepth=8 -norandommap -randrepeat=0 -time_based -rw=randwrite -name "osd.0 4k randwrite test" osd.0 4k randwrite test: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=8 fio-2.1.10 Starting 1 process osd.0 4k randwrite test: Laying out IO file(s) (1 file(s) / 20480MB) Jobs: 1 (f=1): [w] [100.0% done] [0KB/252KB/0KB /s] [0/63/0 iops] [eta 00m:00s] osd.0 4k randwrite test: (groupid=0, jobs=1): err= 0: pid=30071: Wed Mar 30 13:38:27 2016 write: io=79528KB, bw=814106B/s, iops=198, runt=100032msec slat (usec): min=5, max=1031.3K, avg=363.76, stdev=13260.26 clat (usec): min=109, max=1325.7K, avg=39755.66, stdev=81798.27 lat (msec): min=3, max=1325, avg=40.25, stdev=83.48 clat percentiles (msec): | 1.00th=[ 30], 5.00th=[ 30], 10.00th=[ 30], 20.00th=[ 31], | 30.00th=[ 31], 40.00th=[ 31], 50.00th=[ 31], 60.00th=[ 36], | 70.00th=[ 36], 80.00th=[ 36], 90.00th=[ 36], 95.00th=[ 36], | 99.00th=[ 165], 99.50th=[ 799], 99.90th=[ 1221], 99.95th=[ 1237], | 99.99th=[ 1319] bw (KB /s): min= 0, max= 1047, per=100.00%, avg=844.21, stdev=291.89 lat (usec) : 250=0.01% lat (msec) : 4=0.02%, 10=0.14%, 20=0.31%, 50=98.13%, 100=0.25% lat (msec) : 250=0.23%, 500=0.16%, 750=0.24%, 1000=0.22%, 2000=0.34% cpu : usr=0.18%, sys=1.27%, ctx=22838, majf=0, minf=27 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=143.7%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=0/w=19875/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=8 Run status group 0 (all jobs): WRITE: io=79528KB, aggrb=795KB/s, minb=795KB/s, maxb=795KB/s, mint=100032msec, maxt=100032msec Disk stats (read/write): sda: ios=864/28988, merge=0/5738, ticks=31932/1061860, in_queue=1093892, util=99.99% root@node-65:~# the lifetime of this SSD is over. Thanks so much,Christian. 2016-03-30 12:19 GMT+08:00 lin zhou <hnuzhoulin2@xxxxxxxxx>: > 2016-03-29 14:50 GMT+08:00 Christian Balzer <chibi@xxxxxxx>: >> >> Hello, >> >> On Tue, 29 Mar 2016 14:00:44 +0800 lin zhou wrote: >> >>> Hi,Christian. >>> When I re-add these OSD(0,3,9,12,15),the high latency occur again.the >>> default reweight of these OSD is 0.0 >>> >> That makes no sense, at a crush weight (not reweight) of 0 they should not >> get used at all. >> >> When you deleted the other OSD (6?) because of high latency, was your only >> reason/data point the "ceph osd perf" output? > > because this is a near-product environment,so when the osd latency and > the system latency is high.I delete these osd to let it work first. > >>> root@node-65:~# ceph osd tree >>> # id weight type name up/down reweight >>> -1 103.7 root default >>> -2 8.19 host node-65 >>> 18 2.73 osd.18 up 1 >>> 21 0 osd.21 up 1 >>> 24 2.73 osd.24 up 1 >>> 27 2.73 osd.27 up 1 >>> 30 0 osd.30 up 1 >>> 33 0 osd.33 up 1 >>> 0 0 osd.0 up 1 >>> 3 0 osd.3 up 1 >>> 6 0 osd.6 down 0 >>> 9 0 osd.9 up 1 >>> 12 0 osd.12 up 1 >>> 15 0 osd.15 up 1 >>> >>> ceph osd perf: >>> 0 9825 10211 >>> 3 9398 9775 >>> 9 35852 36904 >>> 12 24716 25626 >>> 15 18893 19633 >>> >> This could very well be old, stale data. >> Still, these are some seriously bad numbers, if they are real. >> >> Do the these perf numbers change at all? My guess would be no. > > Yes,the are never changed. > >> >>> but iostat of these device is empty. >> Unsurprising, as they should not be used by Ceph with a weight of 0. >> atop gives you an even better, complete view. >> >>> smartctl say nothing error found in these OSD device. >>> >> What exactly are these devices (model please), 3TB SATA drives Ia assume? >> How are they connect (controller)? > > Yes,3TB SATA,Model Number: WDC WD3000FYYZ-01UL1B1 > and today,I try to set osd.0 reweight to 0.1;and then check.some > useful data found. > > avg-cpu: %user %nice %system %iowait %steal %idle > 1.63 0.00 0.48 16.15 0.00 81.75 > > Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s > avgrq-sz avgqu-sz await r_await w_await svctm %util > sda 0.00 0.00 0.00 2.00 0.00 1.00 > 1024.00 39.85 1134.00 0.00 1134.00 500.00 100.00 > sda1 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > sda2 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > sda3 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 1.00 0.00 0.00 0.00 0.00 100.40 > sda4 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > sda5 0.00 0.00 0.00 2.00 0.00 1.00 > 1024.00 32.32 1134.00 0.00 1134.00 502.00 100.40 > sda6 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 0.66 0.00 0.00 0.00 0.00 66.40 > sda7 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > sda8 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 1.00 0.00 0.00 0.00 0.00 100.00 > sda9 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 1.00 0.00 0.00 0.00 0.00 100.00 > sda10 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 1.00 0.00 0.00 0.00 0.00 100.00 > > ^C^C^C^C^C^C^C^C^C > root@node-65:~# ls -l /var/lib/ceph/osd/ceph-0 > total 62924048 > -rw-r--r-- 1 root root 487 Oct 12 16:49 activate.monmap > -rw-r--r-- 1 root root 3 Oct 12 16:49 active > -rw-r--r-- 1 root root 37 Oct 12 16:49 ceph_fsid > drwxr-xr-x 280 root root 8192 Mar 30 11:58 current > -rw-r--r-- 1 root root 37 Oct 12 16:49 fsid > lrwxrwxrwx 1 root root 9 Oct 12 16:49 journal -> /dev/sda5 > -rw------- 1 root root 56 Oct 12 16:49 keyring > -rw-r--r-- 1 root root 21 Oct 12 16:49 magic > -rw-r--r-- 1 root root 6 Oct 12 16:49 ready > -rw-r--r-- 1 root root 4 Oct 12 16:49 store_version > -rw-r--r-- 1 root root 42 Oct 12 16:49 superblock > -rw-r--r-- 1 root root 64424509440 Mar 30 10:20 testfio.file > -rw-r--r-- 1 root root 0 Mar 28 09:54 upstart > -rw-r--r-- 1 root root 2 Oct 12 16:49 whoami > > the journal of osd.0 is sda5,it is so busy.and cpu wait in top is > 30%.the system is slow. > > So,maybe this is the problem of sda5?it is INTEL SSDSC2BB120G4 > we use two SSD for journal and system. > > root@node-65:~# lsblk > NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT > sda 8:0 0 111.8G 0 disk > ├─sda1 8:1 0 22M 0 part > ├─sda2 8:2 0 191M 0 part /boot > ├─sda3 8:3 0 43.9G 0 part / > ├─sda4 8:4 0 3.8G 0 part [SWAP] > ├─sda5 8:5 0 10.2G 0 part > ├─sda6 8:6 0 10.2G 0 part > ├─sda7 8:7 0 10.2G 0 part > ├─sda8 8:8 0 10.2G 0 part > ├─sda9 8:9 0 10.2G 0 part > └─sda10 8:10 0 10.2G 0 part > sdb 8:16 0 111.8G 0 disk > ├─sdb1 8:17 0 24M 0 part > ├─sdb2 8:18 0 10.2G 0 part > ├─sdb3 8:19 0 10.2G 0 part > ├─sdb4 8:20 0 10.2G 0 part > ├─sdb5 8:21 0 10.2G 0 part > ├─sdb6 8:22 0 10.2G 0 part > ├─sdb7 8:23 0 10.2G 0 part > └─sdb8 8:24 0 50.1G 0 part > >> Christian >>> 2016-03-29 13:22 GMT+08:00 lin zhou <hnuzhoulin2@xxxxxxxxx>: >>> > Thanks.I try this method just like ceph document say. >>> > But I just test osd.6 in this way,and the leveldb of osd.6 is >>> > broken.so it can not start. >>> > >>> > When I try this for other osd,it works. >>> > >>> > 2016-03-29 8:22 GMT+08:00 Christian Balzer <chibi@xxxxxxx>: >>> >> On Mon, 28 Mar 2016 18:36:14 +0800 lin zhou wrote: >>> >> >>> >>> > Hello, >>> >>> > >>> >>> > On Sun, 27 Mar 2016 13:41:57 +0800 lin zhou wrote: >>> >>> > >>> >>> > > Hi,guys. >>> >>> > > some days ago,one osd have a large latency seeing in ceph osd >>> >>> > > perf.and this device make this node a high cpu await. >>> >>> > The thing to do at that point would have been look at things with >>> >>> > atop or iostat to verify that it was the device itself that was >>> >>> > slow and not because it was genuinely busy due to uneven activity >>> >>> > maybe. As well as a quick glance at SMART of course. >>> >>> >>> >>> Thanks.I will follow this when I face this problem next time. >>> >>> >>> >>> > > So,I delete this osd ad then check this device. >>> >>> > If that device (HDD, SSD, which model?) slowed down your cluster, >>> >>> > you should not have deleted it. >>> >>> > The best method would have been to set your cluster to noout and >>> >>> > stop that specific OSD. >>> >>> > >>> >>> > When you say "delete", what exact steps did you take? >>> >>> > Did this include removing it from the crush map? >>> >>> >>> >>> Yes,I delete it from crush map.delete its auth,and rm osd. >>> >>> >>> >> >>> >> Google is your friend, if you deleted it like in the link below you >>> >> should be be able to re-add it the same way: >>> >> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-June/002345.html >>> >> >>> >> Christian >>> >> >>> >>> > > But nothing error found. >>> >>> > > >>> >>> > > And now I want to re-add this device into cluster with it's data. >>> >>> > > >>> >>> > All the data was already replicated elsewhere if you >>> >>> > deleted/removed the OSD, you're likely not going to save much if >>> >>> > any data movement by re-adding it. >>> >>> >>> >>> Yes,the cluster finished rebalance.but I face a problem of one >>> >>> unfound object. And in the output of pg query in recovery_state >>> >>> say,this osd is down,but other odds are ok. >>> >>> So I want to recover this osd to recover this unfound object. >>> >>> >>> >>> and mark_unfound_lost revert/delete do not work: >>> >>> Error EINVAL: pg has 1 unfound objects but we haven't probed all >>> >>> sources, >>> >>> >>> >>> detail see: >>> >>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-March/008452.html >>> >>> >>> >>> Thanks again. >>> >>> >>> >>> > > >>> >>> > > I try to using ceph-osd to add it,but it can not start.log are >>> >>> > > paste in : >>> >>> > > https://gist.github.com/hnuzhoulin/836f9e633b90041e89ad >>> >>> > > >>> >>> > > so what's the recommend steps. >>> >>> > That depends on how you deleted it, but at this point your data is >>> >>> > likely to be mostly stale anyway, so I'd start from scratch. >>> >>> >>> >>> > Christian >>> >>> > -- >>> >>> > Christian Balzer Network/Systems Engineer >>> >>> > chibi@xxxxxxx Global OnLine Japan/Rakuten Communications >>> >>> > http://www.gol.com/ >>> >>> > >>> >>> >>> >> >>> >> >>> >> -- >>> >> Christian Balzer Network/Systems Engineer >>> >> chibi@xxxxxxx Global OnLine Japan/Rakuten Communications >>> >> http://www.gol.com/ >>> >> >> >> -- >> Christian Balzer Network/Systems Engineer >> chibi@xxxxxxx Global OnLine Japan/Rakuten Communications >> http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com