Re: Multiple kernel RBD clients failures

"Yan, Zheng" <ukernel@xxxxxxxxx> · Mon, 30 Sep 2013 23:12:44 +0800



-28 == -ENOSPC (No space left on device). I think it's is due to the
fact that some osds are near full.

Yan, Zheng

On Mon, Sep 30, 2013 at 10:39 PM, Eric Eastman <eric0e@xxxxxxx> wrote:
> I have 5 RBD kernel based clients, all using  kernel 3.11.1, running Ubuntu
> 1304, that all failed with a write error at the same time and I need help to
> figure out what caused the failure.
>
> The 5 clients were all using the same pool, and each had its own image, with
> an 18TB XFS file system on each client.
>
> The errors reported in syslog on all 5 clients, which all came at about the
> same time were:
>
> Sep 26 16:51:44 tca10 kernel: [244870.621836] rbd: rbd1: write 8000 at
> 89591958000 (158000)
> Sep 26 16:51:44 tca10 kernel: [244870.621842] rbd: rbd1:   result -28
> xferred 8000
>
> Sep 26 16:51:52 tca14 kernel: [245058.782519] rbd: rbd1: write 8000 at
> 89593150000 (150000)
> Sep 26 16:51:52 tca14 kernel: [245058.782524] rbd: rbd1:   result -28
> xferred 8000
>
> Sep 26 16:51:33 tca15 kernel: [245043.427752] rbd: rbd1: write 8000 at
> 89593638000 (238000)
> Sep 26 16:51:33 tca15 kernel: [245043.427758] rbd: rbd1:   result -28
> xferred 8000
>
> Sep 26 16:51:40 tca16 kernel: [245054.429278] rbd: rbd1: write 8000 at
> 89593128000 (128000)
> Sep 26 16:51:40 tca16 kernel: [245054.429284] rbd: rbd1:   result -28
> xferred 8000
>
> Sep 26 16:51:23 k6 kernel: [90574.093432] rbd: rbd1: write 80000 at
> f3e93a80000 (280000)
> Sep 26 16:51:23 k6 kernel: [90574.093441] rbd: rbd1:   result -28 xferred
> 80000
>
> The client systems had been running read/write tests on each of the clients,
> and had been running on some
> of the clients for more then 2 days before it failed.
>
> The ceph version on the cluster is  0.67.3 running on Ubuntu 1304 with
> 3.11.1 kernels. The cluster config includes 3 monitors, 6 OSD nodes with 15
> disk drives each, for a total of 90 OSD.   All monitors and OSD are running:
>
> # ceph -v
> ceph version 0.67.3 (408cd61584c72c0d97b774b3d8f95c6b1b06341a)
>
> ---
> An ls of the rbd-pool shows
> # rbd ls -l -p rbd-pool
> NAME        SIZE PARENT FMT PROT LOCK
> k6_tst        17578G          1
> tca10_tst  17578G          1
> tca14_tst  17578G          1
> tca15_tst  17578G          1
> tca16_tst  17578G          1
>
> ---
> There is still space in the pool:
> # ceph df
> GLOBAL:
>    SIZE     AVAIL     RAW USED     %RAW USED
>    249T     118T      131T         52.60
>
> POOLS:
>    NAME         ID     USED       %USED     OBJECTS
>    data               0      0                     0         0
>    metadata      1      0                     0         0
>    rbd                 2      8                     0         1
>    rbd-pool        3      67187G     26.30     17713336
>
> ---
> # ceph health detail
> HEALTH_WARN 9 near full osd(s)
> osd.9 is near full at 85%
> osd.29 is near full at 85%
> osd.43 is near full at 91%
> osd.45 is near full at 88%
> osd.47 is near full at 88%
> osd.55 is near full at 94%
> osd.59 is near full at 94%
> osd.67 is near full at 94%
> osd.83 is near full at 94%
>
> ---
> I did find these messages on one of my monitors, that occurred around the
> same time as the write failure
>
> 2013-09-26 16:50:43.567007 7fc2cc197700  0
> mon.tca11@0(leader).data_health(10) update_stats avail 91% total 70303160
> used 2625788 avail 64083132
> 2013-09-26 16:51:23.519378 7fc2cc197700  1 mon.tca11@0(leader).osd e769 New
> setting for CEPH_OSDMAP_FULL -- doing propose
> 2013-09-26 16:51:23.520896 7fc2cb996700  1 mon.tca11@0(leader).osd e770
> e770: 90 osds: 90 up, 90 in full
> 2013-09-26 16:51:23.521808 7fc2cb996700  0 log [INF] : osdmap e770: 90 osds:
> 90 up, 90 in full
> 2013-09-26 16:51:43.567118 7fc2cc197700  0
> mon.tca11@0(leader).data_health(10) update_stats avail 91% total 70303160
> used 2631904 avail 64077016
> 2013-09-26 16:52:43.567227 7fc2cc197700  0
> mon.tca11@0(leader).data_health(10) update_stats avail 91% total 70303160
> used 2632956 avail 64075964
> 2013-09-26 16:53:28.534868 7fc2cc197700  1 mon.tca11@0(leader).osd e770 New
> setting for CEPH_OSDMAP_FULL -- doing propose
> 2013-09-26 16:53:28.536477 7fc2cb996700  1 mon.tca11@0(leader).osd e771
> e771: 90 osds: 90 up, 90 in
> 2013-09-26 16:53:28.538782 7fc2cb996700  0 log [INF] : osdmap e771: 90 osds:
> 90 up, 90 in
> 2013-09-26 16:53:43.567331 7fc2cc197700  0
> mon.tca11@0(leader).data_health(10) update_stats avail 91% total 70303160
> used 2623788 avail 64085132
>
> ---
> All my OSD are reporting they are up:
> # ceph osd tree
> # id    weight  type name       up/down reweight
> -1      249.7   root default
> -2      51.72           host tca22
> 0       3.63                    osd.0   up      1
> 6       3.63                    osd.6   up      1
> 12      3.63                    osd.12  up      1
> 18      3.63                    osd.18  up      1
> 24      3.63                    osd.24  up      1
> 30      3.63                    osd.30  up      1
> 36      2.72                    osd.36  up      1
> 42      3.63                    osd.42  up      1
> 48      3.63                    osd.48  up      1
> 54      2.72                    osd.54  up      1
> 60      3.63                    osd.60  up      1
> 66      3.63                    osd.66  up      1
> 72      2.72                    osd.72  up      1
> 78      3.63                    osd.78  up      1
> 84      3.63                    osd.84  up      1
> -3      31.5            host tca23
> 1       3.63                    osd.1   up      1
> 7       0.26                    osd.7   up      1
> 13      2.72                    osd.13  up      1
> 19      2.72                    osd.19  up      1
> 25      0.26                    osd.25  up      1
> 31      3.63                    osd.31  up      1
> 37      2.72                    osd.37  up      1
> 43      0.26                    osd.43  up      1
> 49      3.63                    osd.49  up      1
> 55      0.26                    osd.55  up      1
> 61      3.63                    osd.61  up      1
> 67      0.26                    osd.67  up      1
> 73      3.63                    osd.73  up      1
> 79      0.26                    osd.79  up      1
> 85      3.63                    osd.85  up      1
> -4      51.72           host tca24
> 2       3.63                    osd.2   up      1
> 8       3.63                    osd.8   up      1
> 14      3.63                    osd.14  up      1
> 20      3.63                    osd.20  up      1
> 26      3.63                    osd.26  up      1
> 32      3.63                    osd.32  up      1
> 38      2.72                    osd.38  up      1
> 44      3.63                    osd.44  up      1
> 50      3.63                    osd.50  up      1
> 56      2.72                    osd.56  up      1
> 62      3.63                    osd.62  up      1
> 68      3.63                    osd.68  up      1
> 74      2.72                    osd.74  up      1
> 80      3.63                    osd.80  up      1
> 86      3.63                    osd.86  up      1
> -5      31.5            host tca25
> 3       3.63                    osd.3   up      1
> 9       0.26                    osd.9   up      1
> 15      2.72                    osd.15  up      1
> 21      2.72                    osd.21  up      1
> 27      0.26                    osd.27  up      1
> 33      3.63                    osd.33  up      1
> 39      2.72                    osd.39  up      1
> 45      0.26                    osd.45  up      1
> 51      3.63                    osd.51  up      1
> 57      0.26                    osd.57  up      1
> 63      3.63                    osd.63  up      1
> 69      0.26                    osd.69  up      1
> 75      3.63                    osd.75  up      1
> 81      0.26                    osd.81  up      1
> 87      3.63                    osd.87  up      1
> -6      51.72           host tca26
> 4       3.63                    osd.4   up      1
> 10      3.63                    osd.10  up      1
> 16      3.63                    osd.16  up      1
> 22      3.63                    osd.22  up      1
> 28      3.63                    osd.28  up      1
> 34      3.63                    osd.34  up      1
> 40      2.72                    osd.40  up      1
> 46      3.63                    osd.46  up      1
> 52      3.63                    osd.52  up      1
> 58      2.72                    osd.58  up      1
> 64      3.63                    osd.64  up      1
> 70      3.63                    osd.70  up      1
> 76      2.72                    osd.76  up      1
> 82      3.63                    osd.82  up      1
> 88      3.63                    osd.88  up      1
> -7      31.5            host tca27
> 5       3.63                    osd.5   up      1
> 11      0.26                    osd.11  up      1
> 17      2.72                    osd.17  up      1
> 23      2.72                    osd.23  up      1
> 29      0.26                    osd.29  up      1
> 35      3.63                    osd.35  up      1
> 41      2.72                    osd.41  up      1
> 47      0.26                    osd.47  up      1
> 53      3.63                    osd.53  up      1
> 59      0.26                    osd.59  up      1
> 65      3.63                    osd.65  up      1
> 71      0.26                    osd.71  up      1
> 77      3.63                    osd.77  up      1
> 83      0.26                    osd.83  up      1
> 89      3.63                    osd.89  up      1
>
> Kernel version on all systems
> # cat /proc/version
> Linux version 3.11.1-031101-generic (apw@gomeisa) (gcc version 4.6.3
> (Ubuntu/Linaro 4.6.3-1ubuntu5) ) #201309141102 SMP Sat Sep 14 15:02:49 UTC
> 2013
>
> I would really like to know why it failed, before I restart my testing.
>
> Thanks in advance,
>
> Eric
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com