-28 == -ENOSPC (No space left on device). I think it's is due to the fact that some osds are near full. Yan, Zheng On Mon, Sep 30, 2013 at 10:39 PM, Eric Eastman <eric0e@xxxxxxx> wrote: > I have 5 RBD kernel based clients, all using kernel 3.11.1, running Ubuntu > 1304, that all failed with a write error at the same time and I need help to > figure out what caused the failure. > > The 5 clients were all using the same pool, and each had its own image, with > an 18TB XFS file system on each client. > > The errors reported in syslog on all 5 clients, which all came at about the > same time were: > > Sep 26 16:51:44 tca10 kernel: [244870.621836] rbd: rbd1: write 8000 at > 89591958000 (158000) > Sep 26 16:51:44 tca10 kernel: [244870.621842] rbd: rbd1: result -28 > xferred 8000 > > Sep 26 16:51:52 tca14 kernel: [245058.782519] rbd: rbd1: write 8000 at > 89593150000 (150000) > Sep 26 16:51:52 tca14 kernel: [245058.782524] rbd: rbd1: result -28 > xferred 8000 > > Sep 26 16:51:33 tca15 kernel: [245043.427752] rbd: rbd1: write 8000 at > 89593638000 (238000) > Sep 26 16:51:33 tca15 kernel: [245043.427758] rbd: rbd1: result -28 > xferred 8000 > > Sep 26 16:51:40 tca16 kernel: [245054.429278] rbd: rbd1: write 8000 at > 89593128000 (128000) > Sep 26 16:51:40 tca16 kernel: [245054.429284] rbd: rbd1: result -28 > xferred 8000 > > Sep 26 16:51:23 k6 kernel: [90574.093432] rbd: rbd1: write 80000 at > f3e93a80000 (280000) > Sep 26 16:51:23 k6 kernel: [90574.093441] rbd: rbd1: result -28 xferred > 80000 > > The client systems had been running read/write tests on each of the clients, > and had been running on some > of the clients for more then 2 days before it failed. > > The ceph version on the cluster is 0.67.3 running on Ubuntu 1304 with > 3.11.1 kernels. The cluster config includes 3 monitors, 6 OSD nodes with 15 > disk drives each, for a total of 90 OSD. All monitors and OSD are running: > > # ceph -v > ceph version 0.67.3 (408cd61584c72c0d97b774b3d8f95c6b1b06341a) > > --- > An ls of the rbd-pool shows > # rbd ls -l -p rbd-pool > NAME SIZE PARENT FMT PROT LOCK > k6_tst 17578G 1 > tca10_tst 17578G 1 > tca14_tst 17578G 1 > tca15_tst 17578G 1 > tca16_tst 17578G 1 > > --- > There is still space in the pool: > # ceph df > GLOBAL: > SIZE AVAIL RAW USED %RAW USED > 249T 118T 131T 52.60 > > POOLS: > NAME ID USED %USED OBJECTS > data 0 0 0 0 > metadata 1 0 0 0 > rbd 2 8 0 1 > rbd-pool 3 67187G 26.30 17713336 > > --- > # ceph health detail > HEALTH_WARN 9 near full osd(s) > osd.9 is near full at 85% > osd.29 is near full at 85% > osd.43 is near full at 91% > osd.45 is near full at 88% > osd.47 is near full at 88% > osd.55 is near full at 94% > osd.59 is near full at 94% > osd.67 is near full at 94% > osd.83 is near full at 94% > > --- > I did find these messages on one of my monitors, that occurred around the > same time as the write failure > > 2013-09-26 16:50:43.567007 7fc2cc197700 0 > mon.tca11@0(leader).data_health(10) update_stats avail 91% total 70303160 > used 2625788 avail 64083132 > 2013-09-26 16:51:23.519378 7fc2cc197700 1 mon.tca11@0(leader).osd e769 New > setting for CEPH_OSDMAP_FULL -- doing propose > 2013-09-26 16:51:23.520896 7fc2cb996700 1 mon.tca11@0(leader).osd e770 > e770: 90 osds: 90 up, 90 in full > 2013-09-26 16:51:23.521808 7fc2cb996700 0 log [INF] : osdmap e770: 90 osds: > 90 up, 90 in full > 2013-09-26 16:51:43.567118 7fc2cc197700 0 > mon.tca11@0(leader).data_health(10) update_stats avail 91% total 70303160 > used 2631904 avail 64077016 > 2013-09-26 16:52:43.567227 7fc2cc197700 0 > mon.tca11@0(leader).data_health(10) update_stats avail 91% total 70303160 > used 2632956 avail 64075964 > 2013-09-26 16:53:28.534868 7fc2cc197700 1 mon.tca11@0(leader).osd e770 New > setting for CEPH_OSDMAP_FULL -- doing propose > 2013-09-26 16:53:28.536477 7fc2cb996700 1 mon.tca11@0(leader).osd e771 > e771: 90 osds: 90 up, 90 in > 2013-09-26 16:53:28.538782 7fc2cb996700 0 log [INF] : osdmap e771: 90 osds: > 90 up, 90 in > 2013-09-26 16:53:43.567331 7fc2cc197700 0 > mon.tca11@0(leader).data_health(10) update_stats avail 91% total 70303160 > used 2623788 avail 64085132 > > --- > All my OSD are reporting they are up: > # ceph osd tree > # id weight type name up/down reweight > -1 249.7 root default > -2 51.72 host tca22 > 0 3.63 osd.0 up 1 > 6 3.63 osd.6 up 1 > 12 3.63 osd.12 up 1 > 18 3.63 osd.18 up 1 > 24 3.63 osd.24 up 1 > 30 3.63 osd.30 up 1 > 36 2.72 osd.36 up 1 > 42 3.63 osd.42 up 1 > 48 3.63 osd.48 up 1 > 54 2.72 osd.54 up 1 > 60 3.63 osd.60 up 1 > 66 3.63 osd.66 up 1 > 72 2.72 osd.72 up 1 > 78 3.63 osd.78 up 1 > 84 3.63 osd.84 up 1 > -3 31.5 host tca23 > 1 3.63 osd.1 up 1 > 7 0.26 osd.7 up 1 > 13 2.72 osd.13 up 1 > 19 2.72 osd.19 up 1 > 25 0.26 osd.25 up 1 > 31 3.63 osd.31 up 1 > 37 2.72 osd.37 up 1 > 43 0.26 osd.43 up 1 > 49 3.63 osd.49 up 1 > 55 0.26 osd.55 up 1 > 61 3.63 osd.61 up 1 > 67 0.26 osd.67 up 1 > 73 3.63 osd.73 up 1 > 79 0.26 osd.79 up 1 > 85 3.63 osd.85 up 1 > -4 51.72 host tca24 > 2 3.63 osd.2 up 1 > 8 3.63 osd.8 up 1 > 14 3.63 osd.14 up 1 > 20 3.63 osd.20 up 1 > 26 3.63 osd.26 up 1 > 32 3.63 osd.32 up 1 > 38 2.72 osd.38 up 1 > 44 3.63 osd.44 up 1 > 50 3.63 osd.50 up 1 > 56 2.72 osd.56 up 1 > 62 3.63 osd.62 up 1 > 68 3.63 osd.68 up 1 > 74 2.72 osd.74 up 1 > 80 3.63 osd.80 up 1 > 86 3.63 osd.86 up 1 > -5 31.5 host tca25 > 3 3.63 osd.3 up 1 > 9 0.26 osd.9 up 1 > 15 2.72 osd.15 up 1 > 21 2.72 osd.21 up 1 > 27 0.26 osd.27 up 1 > 33 3.63 osd.33 up 1 > 39 2.72 osd.39 up 1 > 45 0.26 osd.45 up 1 > 51 3.63 osd.51 up 1 > 57 0.26 osd.57 up 1 > 63 3.63 osd.63 up 1 > 69 0.26 osd.69 up 1 > 75 3.63 osd.75 up 1 > 81 0.26 osd.81 up 1 > 87 3.63 osd.87 up 1 > -6 51.72 host tca26 > 4 3.63 osd.4 up 1 > 10 3.63 osd.10 up 1 > 16 3.63 osd.16 up 1 > 22 3.63 osd.22 up 1 > 28 3.63 osd.28 up 1 > 34 3.63 osd.34 up 1 > 40 2.72 osd.40 up 1 > 46 3.63 osd.46 up 1 > 52 3.63 osd.52 up 1 > 58 2.72 osd.58 up 1 > 64 3.63 osd.64 up 1 > 70 3.63 osd.70 up 1 > 76 2.72 osd.76 up 1 > 82 3.63 osd.82 up 1 > 88 3.63 osd.88 up 1 > -7 31.5 host tca27 > 5 3.63 osd.5 up 1 > 11 0.26 osd.11 up 1 > 17 2.72 osd.17 up 1 > 23 2.72 osd.23 up 1 > 29 0.26 osd.29 up 1 > 35 3.63 osd.35 up 1 > 41 2.72 osd.41 up 1 > 47 0.26 osd.47 up 1 > 53 3.63 osd.53 up 1 > 59 0.26 osd.59 up 1 > 65 3.63 osd.65 up 1 > 71 0.26 osd.71 up 1 > 77 3.63 osd.77 up 1 > 83 0.26 osd.83 up 1 > 89 3.63 osd.89 up 1 > > Kernel version on all systems > # cat /proc/version > Linux version 3.11.1-031101-generic (apw@gomeisa) (gcc version 4.6.3 > (Ubuntu/Linaro 4.6.3-1ubuntu5) ) #201309141102 SMP Sat Sep 14 15:02:49 UTC > 2013 > > I would really like to know why it failed, before I restart my testing. > > Thanks in advance, > > Eric > > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com