On Wed, 11 Dec 2013, Josh Durgin wrote: > The PAUSEWR and PAUSERD flags are meant to stop the cluster from > processing writes and reads, respectively. The FULL flag is set when > the cluster determines that it is out of space, and will no longer > process writes. PAUSEWR and PAUSERD are purely client-side settings > already implemented in userspace clients. The osd does nothing special > with these flags. > > When the FULL flag is set, however, the osd responds to all writes > with -ENOSPC. For cephfs, this makes sense, but for rbd the block > layer translates this into EIO. If a cluster goes from full to > non-full quickly, a filesystem on top of rbd will not behave well, > since some writes succeed while others get EIO. > > Fix this by blocking any writes when the FULL flag is set in the osd > client. This is the same strategy used by userspace, so apply it by > default. A follow-on patch makes this configurable. > > __map_request() is called to re-target osd requests in case the > available osds changed. Add a paused field to a ceph_osd_request, and > set it whenever an appropriate osd map flag is set. Avoid queueing > paused requests in __map_request(), but force them to be resent if > they become unpaused. > > Also subscribe to the next osd map from the monitor if any of these > flags are set, so paused requests can be unblocked as soon as > possible. > > Fixes: http://tracker.ceph.com/issues/6079 > > Signed-off-by: Josh Durgin <josh.durgin@xxxxxxxxxxx> > --- > include/linux/ceph/osd_client.h | 1 + > net/ceph/osd_client.c | 29 +++++++++++++++++++++++++++-- > 2 files changed, 28 insertions(+), 2 deletions(-) > > diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h > index 8f47625..4fb6a89 100644 > --- a/include/linux/ceph/osd_client.h > +++ b/include/linux/ceph/osd_client.h > @@ -138,6 +138,7 @@ struct ceph_osd_request { > __le64 *r_request_pool; > void *r_request_pgid; > __le32 *r_request_attempts; > + bool r_paused; > struct ceph_eversion *r_request_reassert_version; > > int r_result; > diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c > index a17eaae..1ad9866 100644 > --- a/net/ceph/osd_client.c > +++ b/net/ceph/osd_client.c > @@ -1232,6 +1232,22 @@ void ceph_osdc_set_request_linger(struct ceph_osd_client *osdc, > EXPORT_SYMBOL(ceph_osdc_set_request_linger); > > /* > + * Returns whether a request should be blocked from being sent > + * based on the current osdmap and osd_client settings. > + * > + * Caller should hold map_sem for read. > + */ > +static bool __req_should_be_paused(struct ceph_osd_client *osdc, > + struct ceph_osd_request *req) > +{ > + bool pauserd = ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_PAUSERD); > + bool pausewr = ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_PAUSEWR) || > + ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_FULL); > + return (req->r_flags & CEPH_OSD_FLAG_READ && pauserd) || > + (req->r_flags & CEPH_OSD_FLAG_WRITE && pausewr); > +} > + > +/* > * Pick an osd (the first 'up' osd in the pg), allocate the osd struct > * (as needed), and set the request r_osd appropriately. If there is > * no up osd, set r_osd to NULL. Move the request to the appropriate list > @@ -1248,6 +1264,7 @@ static int __map_request(struct ceph_osd_client *osdc, > int acting[CEPH_PG_MAX_SIZE]; > int o = -1, num = 0; > int err; > + bool was_paused; > > dout("map_request %p tid %lld\n", req, req->r_tid); > err = ceph_calc_ceph_pg(&pgid, req->r_oid, osdc->osdmap, > @@ -1264,12 +1281,18 @@ static int __map_request(struct ceph_osd_client *osdc, > num = err; > } > > + was_paused = req->r_paused; > + req->r_paused = __req_should_be_paused(osdc, req); > + if (was_paused && !req->r_paused) > + force_resend = 1; > + > if ((!force_resend && > req->r_osd && req->r_osd->o_osd == o && > req->r_sent >= req->r_osd->o_incarnation && > req->r_num_pg_osds == num && > memcmp(req->r_pg_osds, acting, sizeof(acting[0])*num) == 0) || > - (req->r_osd == NULL && o == -1)) > + (req->r_osd == NULL && o == -1) || > + req->r_paused) It seems like we could be a bit more aggressive (and more closely aligned with what the other causes of changed mappings do) and cancel the request if it is newly paused. Otherwise, we leave req->r_osd set to the last person we sent the request to, which means we might get a reply. I guess that is what we want, actually... > return 0; /* no change */ > > dout("map_request tid %llu pgid %lld.%x osd%d (was osd%d)\n", > @@ -1811,7 +1834,9 @@ done: > * we find out when we are no longer full and stop returning > * ENOSPC. > */ > - if (ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_FULL)) > + if (ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_FULL) || > + ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_PAUSERD) || > + ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_PAUSEWR)) > ceph_monc_request_next_osdmap(&osdc->client->monc); > > mutex_lock(&osdc->request_mutex); > -- > 1.7.10.4 > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html