Re: crashing OSDs: ceph_assert(h->file->fnode.ino != 1)

Simon Leinen <simon.leinen@xxxxxxxxx> · Fri, 29 May 2020 19:19:16 +0200

Dear Igor,

thanks a lot for the analysis and recommendations.

> Here is a brief analysis:

> 1) Your DB is pretty large - 27GB at DB device (making it full) and
> 279GB at main spinning one. I.e. RocksDB  is experiencing huge
> spillover to slow main device - expect performance drop. And generally
> DB is highly under-provisioned.

Yes, we have known about this issue for a long time.  This cluster and
in particular its SSD devices were dimensioned in the pre-Bluestore
days.  We haven't yet found a viable migration path towards something
more sensible (with ~1500 OSDs on two separate clusters and quite a bit
of user data on them).

> 2) Main device space is highly fragmented - 0.84012572151981013 where
> 1.0 is the maximum. Can't say for sure but I presume it's pretty full
> as well.

Not too full:

$ ceph osd df | sort -n
ID  CLASS WEIGHT  REWEIGHT SIZE    RAW USE DATA    OMAP     META    AVAIL    %USE  VAR  PGS STATUS
[...]
  3   hdd 7.27699  1.00000 7.3 TiB 4.6 TiB 4.3 TiB   49 MiB 319 GiB  2.7 TiB 63.46 1.10   0   down

> The above are indirect factors resulting in the current failure
> though, primarily I just want to make you aware since this might cause
> other issues later on.

Thanks.

> The major reason preventing OSD from proper starting is BlueFS attempt
> to claim additional space (~52GB), see in the log:
[...]

> I can suggest the following workarounds to start the OSD for now:

> 1) switch allocator to stupid by setting 'bluestore allocator'
> parameter to 'stupid'. Presume you have default setting of 'bitmap'
> now.. This will allow more continuous allocations for bluefs space
> claim. and hence shorter log write. But given high main disk
> fragmentation this might be not enough. 'stupid' allocator has some
> issues (e.g. high RAM utilization over time in some cases) as well but
> they're rather irrelevant for OSD startup.

Thanks, we'll try that & report.

> 2) Increase 'bluefs_max_log_runway' parameter to 8-12 MB (with the
> default value at 4MB).

> Suggest to start with 1) and then additionally proceed with 2) if the
> first one doesn't help.

> Once OSD is up and cluster is healthy please consider adding more DB
> space and/or OSDs to your cluster to fight dangerous factors I started
> with.

> BTW wondering what payload is the primarily one for you cluster - RGW
> or something else?

The payload has changed over the lifetime of the cluster (which has been
in operation for more than four years, growing and being upgraded).
Initially it was almost exclusively RBD (for OpenStack VMs), then we
added RadosGW (still all with 3-way replication).  As RadosGW/S3 became
more popular, we added an EC 8+3 pool.  (We also added an NVMe-only pool,
which is used for RadosGW indexes.)  Lately this EC 8+3 pool has become
very popular, and users have been storing hundreds of Terabyte on it.
Unfortunately they tend to use a small object size (~1MB per object).
That's why we have close to a billion objects in the EC pool now, and
things start to fail.

As I said, it's a problem of finding a viable migration path to a better
configuration.  Unfortunately we cannot just throw away the current
installation and start from scratch...

Cheers,
-- 
Simon.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx