Hello, Debian Jessie cluster, thus kernel 3.16, ceph 0.80.7. 3 storage nodes with 8 OSDs (journals on 4 SSDs) each, 3 mons. 2 compute nodes, everything connected via Infiniband. This is pre-production, currently there are only 3 VMs and 2 of them were idle at the time. The non-idle one was having 600GB of maildirs copied onto it, which stresses things but not Ceph as those millions of small files coalesce nicely and result in rather few Ceph ops. A couple of hours into that copy marathon (the source FS and machine are slow and rsync isn't particular speedy with this kind of operation either) this happened: --- 2014-12-06 19:20:57.023974 osd.23 10.0.8.23:6815/3552 77 : [WRN] slow request 30 .673939 seconds old, received at 2014-12-06 19:20:26.346746: osd_op(client.33776 .0:743596 rb.0.819b.238e1f29.00000003f52f [set-alloc-hint object_size 4194304 wr ite_size 4194304,write 1748992~4096] 3.efa97e35 ack+ondisk+write e380) v4 curren tly waiting for subops from 4,8 2014-12-06 19:20:57.023991 osd.23 10.0.8.23:6815/3552 78 : [WRN] slow request 30 .673886 seconds old, received at 2014-12-06 19:20:26.346799: osd_op(client.33776 .0:743597 rb.0.819b.238e1f29.00000003f52f [set-alloc-hint object_size 4194304 wr ite_size 4194304,write 1945600~4096] 3.efa97e35 ack+ondisk+write e380) v4 curren tly waiting for subops from 4,8 2014-12-06 19:20:57.323976 osd.1 10.0.8.21:6815/4868 123 : [WRN] slow request 30 .910821 seconds old, received at 2014-12-06 19:20:26.413051: osd_op(client.33776 .0:743604 rb.0.819b.238e1f29.00000003e628 [set-alloc-hint object_size 4194304 wr ite_size 4194304,write 1794048~1835008] 3.5e76b8ba ack+ondisk+write e380) v4 cur rently waiting for subops from 8,17 --- There were a few more later, but they all involved OSD 8 as common factor. Alas there's nothing in the osd-8.log indicating why: --- 2014-12-06 19:13:13.933636 7fce85552700 0 -- 10.0.8.22:6835/5389 >> 10.0.8.6:0/ 716350435 pipe(0x7fcec3c25900 sd=23 :6835 s=0 pgs=0 cs=0 l=0 c=0x7fcebfad03c0).a ccept peer addr is really 10.0.8.6:0/716350435 (socket is 10.0.8.6:50592/0) 2014-12-06 19:20:56.595773 7fceac82f700 0 log [WRN] : 3 slow requests, 3 included below; oldest blocked for > 30.241397 secs 2014-12-06 19:20:56.595796 7fceac82f700 0 log [WRN] : slow request 30.241397 seconds old, received at 2014-12-06 19:20:26.354247: osd_sub_op(client.33776.0:743596 3.235 efa97e35/rb.0.819b.238e1f29.00000003f52f/head//3 [] v 380'3783 snapset=0=[]:[] snapc=0=[]) v11 currently started 2014-12-06 19:20:56.595825 7fceac82f700 0 log [WRN] : slow request 30.240286 seconds old, received at 2014-12-06 19:20:26.355358: osd_sub_op(client.33776.0:743597 3.235 efa97e35/rb.0.819b.238e1f29.00000003f52f/head//3 [] v 380'3784 snapset=0=[]:[] snapc=0=[]) v11 currently started 2014-12-06 19:20:56.595837 7fceac82f700 0 log [WRN] : slow request 30.177186 seconds old, received at 2014-12-06 19:20:26.418458: osd_sub_op(client.33776.0:743604 3.ba 5e76b8ba/rb.0.819b.238e1f29.00000003e628/head//3 [] v 380'6439 snapset=0=[]:[] snapc=0=[]) v11 currently started ---- The HDDs and SSDs are new, there's nothing in the pertinent logs or smart that indicates any problem with that HDD or its journal SSD, nor the system in general. This problem persisted (and the VM remained stuck) until OSD 8 was restarted the next day when I discovered this. I suppose this is another "this can't/shouldn't happen" case, but I'd be delighted about any suggestions as to what happened here, potential prevention measures and any insights on how to maybe coax more information out of Ceph if this happens again. Regards, Christian -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Fusion Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com