Thanks! On Thu, Aug 31, 2017 at 1:28 PM, Ilya Dryomov <idryomov@xxxxxxxxx> wrote: > On Thu, Aug 31, 2017 at 4:12 PM, Wyllys Ingersoll > <wyllys.ingersoll@xxxxxxxxxxxxxx> wrote: >> Sorry for lack of detail, here is some more info: >> >> Currently using ceph 10.2.7 >> >> - Prior to the error there was nothing in the kernel log for several hours. >> - cephfs snapshots are enabled, but are not currently being taken at >> regular intervals, the last one was 2 days before the error message >> appeared. >> - cephfs has a data pool with 17123471 objects (34% full) and a >> metadata pool with 70K objects >> - the system has 85 OSDs and 3 MDS servers, all are in a healthy state. >> - We use 3-copy replication rules: 81739 GB used, 161 TB / 241 TB avail > > OK, I see what happened. > > You have quite a lot of snapshots -- 4758 of them? send_request() > attempted to encode a 8 + 4 + 4758*8 = ~38k snap context into a 4k > buffer. Normally it's fine because the snap context is taken into > account when allocating a message buffer. However, this particular > code path (... ceph_osdc_writepages()) uses pre-allocated messages, > which are always 4k in size. > > I think it's a known bug^Wlimitation. As a short-term fix, we can > probably increase that pre-allocated size from 4k to something bigger. > A proper resolution would take a considerable amount of time. Until > then I'd recommend a much more aggressive snapshot rotation schedule, > which is a good idea anyway -- your writes will transmit faster! > > Thanks, > > Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html