Thanks for reply. It helps a lot. Later I will try Luminous and keep tracking this issue in jewel. Thanks, Handong 2017-06-29 1:21 GMT+08:00 Nathan Cutler <ncutler@xxxxxxx>: > > > On 06/28/2017 07:04 PM, David Zafman wrote: >> >> >> Luminous has the more complex fix which prevents recovery/backfill from >> filling up a disk. >> >> In your 3 node test cluster with 1 osd out you have 66% of your storage >> available with up to 80% in use, so you are out of space. In Luminous not >> only would new writes be blocked but PGs would be marked "backfill_toofull" >> or "recovery_toofull." >> >> >> A portion of the Luminous changes are in a pending Jewel backport. > > > That's https://github.com/ceph/ceph/pull/15050 in case anyone was wondering. > > >> It includes code that warns about uneven OSD usage and increases >> mon_osd_min_in_ratio to .75 (75%). >> >> In a more realistic Jewel cluster you can increase the value of >> mon_osd_min_in_ratio to what is best for your situation. This will prevent >> too many OSDs from being marked out. >> >> David >> >> >> On 6/28/17 9:37 AM, Sage Weil wrote: >>> >>> On Wed, 28 Jun 2017, handong He wrote: >>>> >>>> Hello, >>>> >>>> I'm using ceph-jewel 10.2.7 for some test. >>>> Discovered that when an osd is full(like full_ratio=0.95), client >>>> write failed, which is normal. But a full osd cannot stop a recovering >>>> cluster writing data, make osd used ratio from 95% to100%. When that >>>> happen, osd will be down for no space left and cannot startup anymore. >>>> >>>> So the question is : can the cluster auto stop recovering while osd is >>>> reaching full without setting the norecover flag manually? Or is it >>>> already fix in the latest version? >>>> >>>> Consider this situation: a half-full cluster with many osds. For some >>>> bad luck(netlink down| server down | or others) in midnight, some osds >>>> down|out and trigger cluster recovery, makes some other health osds' >>>> used% to 100% (experienceless in operation and maintenance, please fix >>>> me if i'm wrong). Unluckly, this just like a plague and make much more >>>> osds down. It maybe easy to fix one down osd like that, but a disaster >>>> to fix 10+ osds with 100% space used. >>> >>> There are additional thresholds for stopping backfill and (later) a >>> failsafe to prevent any writes, but you're not hte first one to see these >>> not work properly in jewel. David recently made a ton of >>> improvements here in master for luminous, but I'm not sure what the >>> status is for backporting some of the critical pieces to jewel... >>> >>> sage >>> >>>> here is my test environment and steps: >>>> >>>> three nodes, each node has one monitor and one osd(10G hdd for >>>> convenient), running in vm. >>>> ceph conf is basic. >>>> pool size set to 2. >>>> using 'rados bench' writing data to osds. >>>> >>>> 1. exec command to set osd full ratio: >>>> # ceph pg set_full_ratio 0.8 >>>> # ceph pg set nearfull_ratio 0.7 >>>> >>>> 2. writing data, when an osd is reaching full, stop writing and mark >>>> out one osd with command: >>>> # ceph osd out 0 >>>> >>>> 3. waiting for cluster recovering finished , and exec command: >>>> # ceph osd df >>>> # ceph osd tree >>>> >>>> we can find that other osds is down. >>>> >>>> Thanks and Best Regards! >>>> >>>> He Handong >>>> -- >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>> >> >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > > -- > Nathan Cutler > Software Engineer Distributed Storage > SUSE LINUX, s.r.o. > Tel.: +420 284 084 037 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html