Re: osd full still writing data while cluster recovering

David Zafman <dzafman@xxxxxxxxxx> · Wed, 28 Jun 2017 10:04:47 -0700

Luminous has the more complex fix which prevents recovery/backfill from 
filling up a disk.

In your 3 node test cluster with 1 osd out you have 66% of your storage 
available with up to 80% in use, so you are out of space. In Luminous 
not only would new writes be blocked but PGs would be marked 
"backfill_toofull" or "recovery_toofull."

A portion of the Luminous changes are in a pending Jewel backport.  It 
includes code that warns about uneven OSD usage and increases 
mon_osd_min_in_ratio to .75 (75%).

In a more realistic Jewel cluster you can increase the value of 
mon_osd_min_in_ratio to what is best for your situation.  This will 
prevent too many OSDs from being marked out.

David

On 6/28/17 9:37 AM, Sage Weil wrote:
On Wed, 28 Jun 2017, handong He wrote:
Hello,

I'm using ceph-jewel 10.2.7 for some test.
Discovered that when an osd is full(like full_ratio=0.95), client
write failed, which is normal. But a full osd cannot stop a recovering
cluster writing data, make osd used ratio from 95% to100%. When that
happen, osd will be down for no space left and cannot startup anymore.

So the question is : can the cluster auto stop recovering while osd is
reaching full without setting the norecover flag manually?  Or is it
already fix in the latest version?

Consider this situation: a half-full cluster with many osds. For some
bad luck(netlink down| server down | or others) in midnight, some osds
down|out and trigger cluster recovery, makes some  other health osds'
used% to 100% (experienceless in operation and maintenance, please fix
me if i'm wrong). Unluckly, this just like a plague and make much more
osds down. It maybe easy to fix one down osd like that, but a disaster
to fix 10+ osds with 100% space used.
There are additional thresholds for stopping backfill and (later) a
failsafe to prevent any writes, but you're not hte first one to see these
not work properly in jewel.  David recently made a ton of
improvements here in master for luminous, but I'm not sure what the
status is for backporting some of the critical pieces to jewel...

sage

here is my test environment and steps:

three nodes, each node has one monitor and one osd(10G hdd for
convenient), running in vm.
ceph conf is basic.
pool size set to 2.
using 'rados bench' writing data to osds.

1. exec command  to set osd full ratio:
# ceph pg set_full_ratio 0.8
# ceph pg set nearfull_ratio 0.7

2. writing data, when an osd is reaching full, stop writing and mark
out one osd with command:
# ceph osd out 0

3. waiting for cluster recovering finished , and exec command:
# ceph osd df
# ceph osd tree

we can find that other osds is down.

Thanks and Best Regards！

He Handong
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html