Re: Rocksdb: Corruption: missing start of fragmented record(1)

Frank Schilder <frans@xxxxxx> · Tue, 30 Nov 2021 08:36:33 +0000

Hi Dan.

> ...however it is not unsafe to leave the cache enabled -- ceph uses
> fsync appropriately to make the writes durable.

Actually it is. You will rely on the drive's firmware to implement this correctly and this is, unfortunately, less than a given. Within the last one-two years somebody posted a link to a very interesting research paper to this list, where drives were tested under real conditions. Turns out that the "fsync to make writes persistent" is very vulnerable to power loss if volatile write cache is enabled. It I remember correctly, about 1-2% of drives ended up with data loss every time. In other words, for every drive with volatile write cache enabled, every 100 power loss events you will have 1-2 data loss events (in certain situations, the drive replies with ack before the volatile cache is actually flushed). I think even PLP did not prevent data loss in all cases.

Its all down to bugs in firmware that fail to catch all corner cases and internal race conditions with ops scheduling. Vendors will very often take priority for performance over fixing a rare race condition and I will not take nor recommend to take chances.

I think this kind of advice should really not be given in a ceph context without also referring to the pre-requisites: perfect firmware. Ceph is a scale-out system and any large sized cluster will have enough drives to see low-probability events on a regular basis. At least recommend to test that thoroughly, that is, perform power-loss tests under load, and I mean many power loss events per drive with randomised intervals under different load patterns.

Same applies to disk controllers with cache. Nobody recommends using the controller cache because of firmware bugs that seem to be present in all models. We have sufficient cases on this list for data loss after power loss with controller cache being the issue. The recommendation is to enable HBA mode and write-through. Do the same with your disk firmware, get better sleep and better performance in one go.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Dan van der Ster <dan@xxxxxxxxxxxxxx>
Sent: 29 November 2021 09:24:29
To: Frank Schilder
Cc: huxiaoyu@xxxxxxxxxxxx; YiteGu; ceph-users
Subject: Re:  Re: Rocksdb: Corruption: missing start of fragmented record(1)

Hi Frank,

That's true from the performance perspective, however it is not unsafe
to leave the cache enabled -- ceph uses fsync appropriately to make
the writes durable.

This issue looks rather to be related to concurrent hardware failure.

Cheers, Dan

On Mon, Nov 29, 2021 at 9:21 AM Frank Schilder <frans@xxxxxx> wrote:
>
> This may sound counter-intuitive, but you need to disable write cache to enable PLP cache only. SSDs with PLP have usually 2 types of cache, volatile and non-volatile. The volatile cache will experience data loss on power loss. It is the volatile cache that gets disabled when issuing the hd-/sdparm/smartctl command to switch it off. In many cases this can increase the non-volatile cache and also performance.
>
> It is the non-volatile cache you want your writes to go to directly.
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: huxiaoyu@xxxxxxxxxxxx <huxiaoyu@xxxxxxxxxxxx>
> Sent: 26 November 2021 22:41:10
> To: YiteGu; ceph-users
> Subject:  Re: Rocksdb: Corruption: missing start of fragmented record(1)
>
> wal/db are on Intel S4610 960GB SSDs, with PLP and write back on
>
>
>
> huxiaoyu@xxxxxxxxxxxx
>
> From: YiteGu
> Date: 2021-11-26 11:32
> To: huxiaoyu@xxxxxxxxxxxx; ceph-users
> Subject: Re: Rocksdb: Corruption: missing start of fragmented record(1)
> It look like your wal/db device loss data.
> please check your wal/db device whether have writeback cache, and power loss cause data loss. replay log failure when rocksdb restart.
>
>
>
> YiteGu
> ess_gyt@xxxxxx
>
>
>
> ------------------ Original ------------------
> From: "huxiaoyu@xxxxxxxxxxxx" <huxiaoyu@xxxxxxxxxxxx>;
> Date: Fri, Nov 26, 2021 06:02 PM
> To: "ceph-users"<ceph-users@xxxxxxx>;
> Subject:  Rocksdb: Corruption: missing start of fragmented record(1)
>
> Dear Cephers,
>
> I just had one Ceph osd node (Luminous 12.2.13) power-loss unexpectedly, and after restarting that node, two OSDs out of 10 can not be started, issuing the following errors (see below image), in particular, i see
>
> Rocksdb: Corruption: missing start of fragmented record(1)
> Bluestore(/var/lib/ceph/osd/osd-21) _open_db erroring opening db:
> ...
> **ERROR: OSD init failed: (5)  Input/output error
>
> I checked the db/val SSDs, and they are working fine. So I am wondering the following
> 1) Is there a method to restore the OSDs?
> 2) what could be the potential causes of the corrupted db/wal? The db/wal SSDs have PLP and not been damaged during the power loss
>
> Your help would be highly appreciated.
>
> best regards,
>
> samuel
>
>
>
>
> huxiaoyu@xxxxxxxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx