Re: 3 OSDs can not be started after a server reboot - rocksdb Corruption

Sebastian Mazza <sebastian@xxxxxxxxxxx> · Thu, 17 Mar 2022 21:54:42 +0100

Hi Igor,

thank you very much for your explanation. I much appreciate it.

You was right, as always. :-)
There was not a single corrupted object. I did run `time ceph-bluestore-tool fsck --path /var/lib/ceph/osd/ceph-$X` and `time ceph-bluestore-tool fsck --path /var/lib/ceph/osd/ceph-$X --deep yes` for every HDD based OSD and every fsck printed “fsck success”. However, I observed something odd. Since every OSD except the one that was recovered was reading with around 44 MByte/s while the recovered OSD.4 was much faster with around 125MByte/s, during the fsck.

Thanks,

Sebastian

> On 15.03.2022, at 09:52, Igor Fedotov <igor.fedotov@xxxxxxxx> wrote:
> 
> Hi Sebastian,
> 
> I don't think you have got tons of corrupted objects. The tricky thing about the bug is that corruption might occur if new allocation occurred in a pretty short period only:  when OSD is starting but haven't applied deferred writes yet. This mostly applies to Bluefs/RocksDB performing some housekeeping on startup, e.g. recovery or compaction.
> 
> Additionally there is a questionable period shortly before the shutdown when similar issue is apparently possible. But that's just a speculation so far.
> 
> So IMO highly likely there are no corrupted objects in your cluster at all. You can try to run fsck with deep flag on per OSD basis to check that though -  it will verify  object content using checksums. Expect it to take some time though...
> 
> 
> Thanks,
> 
> Igor
> 
> On 3/14/2022 9:33 PM, Sebastian Mazza wrote:
>> Hi Igor,
>> 
>> great that you was able to reproduce it!
>> 
>> I did read your comments at the issue #54547. Am I right that I probably have hundreds of corrupted objects on my EC pools (cephFSD and RBD)? But I only ever noticed when a rocksDB was damaged. A deep scrub should find the other errors, right? But the PGs of my EC pools was never scrubbed because the are all “undersized" and “degraded”. Is it possible to scrub a “undersized" / “degraded” PG? I’m not able to (deep) scrub any of the PGs on my EC pools. I did already set osd_scrub_during_recovery = true, but this does not help. Scrubbing and deep scrubbing works as expected on the replicated pools.
>> 
>> 
>> Thank you very much for helping me with my problem so patiently within the last 3 month!
>> 
>> 
>> Thanks,
>> Sebastian
>> 
>> 
>>> On 14.03.2022, at 17:07, Igor Fedotov <igor.fedotov@xxxxxxxx> wrote:
>>> 
>>> Hi Sebastian,
>>> 
>>> the proper parameter name is 'osd fast shutdown".
>>> 
>>> As with any other OSD config parameter one can use either ceph.conf or 'ceph config set osd.N osd_fast_shutdown false' command to adjust it.
>>> 
>>> I'd recommend the latter form.
>>> 
>>> And yeah from my last experiments it looks like setting it to false it's a workaround indeed...
>>> 
>>> I've managed to reproduce the bug locally and it doesn't occur without fast shutdown.
>>> 
>>> Thanks for all the information you shared, highly appreciated!
>>> 
>>> FYI: I've created another ticket (missed I already had one): https://tracker.ceph.com/issues/54547
>>> 
>>> So please expect all the related new there.
>>> 
>>> 
>>> Kind regards,
>>> 
>>> Igor
>>> 
>>> 
>>> 
>>> On 3/14/2022 5:54 PM, Sebastian Mazza wrote:
>>>> Hallo Igor,
>>>> 
>>>> I'm glad I could be of help. Thank you for your explanation!
>>>> 
>>>>>  And I was right this is related to deferred write procedure  and apparently fast shutdown mode.
>>>> Does that mean I can prevent the error in the meantime, before you can fix the root cause, by disabling osd_fast_shutdown?
>>>> Can it be disabled by the following statements in the ceph.conf?
>>>> ```
>>>> [osd]
>>>> 	fast shutdown = false
>>>> ```
>>>> (I wasn’t able to find any documentation for “osd_fast_shutdown”, e.g.: https://docs.ceph.com/en/pacific/rados/configuration/osd-config-ref/ does not contain “osd_fast_shutdown”)
>>>> 
>>>> 
>>>>> 1) From the log I can see you're using RBD over EC pool for the repro, right? What's the EC profile?
>>>> Yes, one of the two EC pools contains one RBD image and the other EC pool is used for cephFS.
>>>> 
>>>> Rules from my crush map:
>>>> ```
>>>> rule ec4x2hdd {
>>>> 	id 0
>>>> 	type erasure
>>>> 	min_size 8
>>>> 	max_size 8
>>>> 	step set_chooseleaf_tries 5
>>>> 	step set_choose_tries 100
>>>> 	step take default class hdd
>>>> 	step choose indep 4 type host
>>>> 	step choose indep 2 type osd
>>>> 	step emit
>>>> }
>>>> rule c3nvme {
>>>> 	id 1
>>>> 	type replicated
>>>> 	min_size 1
>>>> 	max_size 10
>>>> 	step take default class nvme
>>>> 	step chooseleaf firstn 0 type host
>>>> 	step emit
>>>> }
>>>> ```
>>>> The EC rule requires 4 hosts, but as I have already mentioned the cluster does only have 3 nodes. Since I did never add the fourth node after the problem occurred the first time.
>>>> 
>>>> I created the EC profile by:
>>>> ```
>>>> $ ceph osd erasure-code-profile set ec4x2hdd-profile \
>>>> 	k=5 \
>>>> 	m=3 \
>>>> 	crush-root=ec4x2hdd
>>>> ```
>>>> 
>>>> And the EC pools by:
>>>> ```
>>>> $ ceph osd pool create fs.data-hdd.ec-pool 64 64 erasure ec4x2hdd-profile ec4x2hdd warn 36000000000000 	# 36TB
>>>> $ ceph osd pool create block-hdd.ec-pool 16 16 erasure ec4x2hdd-profile ec4x2hdd warn 1000000000000	#  1TB
>>>> ```
>>>> 
>>>> Replicated pools on NVMes
>>>> ```
>>>> ceph osd pool create block-nvme.c-pool 32 32 replicated c3nvme 0 warn 549755813888	# 512 GB
>>>> ceph osd pool create fs.metadata-root-pool 8 8 replicated c3nvme 0 warn 4294967296	# 4 GB
>>>> ceph osd pool create fs.data-root-pool 16 16 replicated c3nvme 0 warn 137438953472	# 128 GB
>>>> ```
>>>> 
>>>> enable overwrite support:
>>>> ```
>>>> $ ceph osd pool set fs.data-hdd.ec-pool allow_ec_overwrites true
>>>> $ ceph osd pool set block-hdd.ec-pool allow_ec_overwrites true
>>>> ```
>>>> 
>>>> Set Application:
>>>> ```
>>>> $ ceph osd pool application enable fs.data-hdd.ec-pool cephfs
>>>> $ ceph osd pool application enable "block-hdd.ec-pool" rbd
>>>> ```
>>>> 
>>>> Use the rbd tool to initialize the pool for use by RBD:
>>>> ```
>>>> $ rbd pool init "block-nvme.c-pool"
>>>> $ rbd pool init "block-hdd.ec-pool"
>>>> ```
>>>> 
>>>> Create RBD image:
>>>> ```
>>>> $ rbd --pool "block-nvme.c-pool" create vm-300-disk-1 --data-pool "block-hdd.ec-pool" --size 50G
>>>> ```
>>>> 
>>>> Create cephFS:
>>>> ```
>>>> $ ceph osd pool create fs.metadata-root-pool 8 8 replicated c3nvme 0 warn 4294967296	# 4 GB
>>>> $ ceph osd pool create fs.data-root-pool 16 16 replicated c3nvme 0 warn 137438953472	# 128 GB
>>>> $ ceph fs new cephfs fs.metadata-root-pool fs.data-root-pool
>>>> ```
>>>> 
>>>> Add the EC pool to the cephFS:
>>>> ```
>>>> $ ceph fs add_data_pool cephfs fs.data-hdd.ec-pool
>>>> ```
>>>> 
>>>> Set the Layout of the directory “shared” so that files that are created inside the folder “shares” uses the EC pool:
>>>> ```
>>>> $ setfattr -n ceph.dir.layout.pool -v fs.data-hdd.ec-pool /mnt/cephfs-olymp/shares/
>>>> ```
>>>> 
>>>> 
>>>> 
>>>>> 2) How did you generate the load during last experiments? Was it some benchmarking tool or any other artificial load generator? If so could you share job desriptions or scripts if any?
>>>> I did not use a benchmarking tool or any other artificial load generator. The RBD image “vm-300-disk-1” is formatted as BTRFS. I used rsync to copy daily website backups to the volume. The backups are versioned and go back 10 days. If a new version is created, a new folder is added and all files that did not change are hard linked to the file in the folder of the last day. A new or updated file is copied form the production server.
>>>> This backup procedure was not done on the ceph cluster. It is done on another server, I just copied the web backup directory of that server to the RBD image. However, I used rsync with the "-H” flag to copy the data from the “productive” backup server to the RBD image, therefore the hard links are preserved.
>>>> 
>>>> I copied shared group folders and apple time machine backups to CephFS, also with rsync from another server. The shared user folder contains typical office date, text document, CSV, files, JPEG images,…. The  apple time machine backups are basically folders with lots of 8MB or 32MB files that contains encrypted data.
>>>> 
>>>> 
>>>> In order to reproduce the rocksDB corruption I started the rsync process whenever I did find some time. I did wait until rsync found at least some files that has change and copied it to the RBD image or cephFS. Then, I triggered a shutdown on all 3 cluster nodes.
>>>> 
>>>> However, best to my knowledge, there was no IO at all between the last 3 restarts of the OSDs, except the IO that is caused by mounting the cephFS on 4 systems and the BTRFS file system on the RBD image on one system.
>>>> 
>>>> 
>>>> If you want to create a minimal example that is able to trigger the error, I would suggest to create an RBD image on an EC pool with  k=5, m=3 on a cluster with only 6 HDD based OSDs. Format the RBD image with BTRFS and start creating and deleting hard links on it. I don’t know if this is really the trigger, but it is currently my best idea.
>>>> 
>>>> 
>>>> Best wishes,
>>>> Sebastian
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> On 14.03.2022, at 12:49, Igor Fedotov <igor.fedotov@xxxxxxxx> wrote:
>>>>> 
>>>>> Hi Sebastian,
>>>>> 
>>>>> this is superhelpful indeed. And I was right this is related to deferred write procedure  and apparently fast shutdown mode. I'm still in the process of ontaining more details on the root cause and looking for a way to fix (and preferably reproduce locally) that.
>>>>> 
>>>>> Some additional questions on your setup meanwhile.  Sorry if some have been asked before..
>>>>> 
>>>>> 1) From the log I can see you're using RBD over EC pool for the repro, right? What's the EC profile?
>>>>> 
>>>>> 2) How did you generate the load during last experiments? Was it some benchmarking tool or any other artificial load generator? If so could you share job desriptions or scripts if any?
>>>>> 
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> Igor
>>>>> 
>>>>> 
>>>>> On 3/10/2022 10:36 PM, Sebastian Mazza wrote:
>>>>>> Hi Igor!
>>>>>> 
>>>>>> I hope I've cracked the checkpot now. I have logs with osd debug level 20 for bluefs, bdev, and bluestore. The log files ceph-osd.4.log shows 2 consecutive startups of the osd.4 where the second startup results in:
>>>>>> ```
>>>>>> rocksdb: Corruption: Bad table magic number: expected 9863518390377041911, found 0 in db/002193.sst
>>>>>> ```
>>>>>> 
>>>>>> The log ceph-osd.4.log shows the last two starts of the OSD. However, I send you also the previous log that contains the “*** Immediate shutdown (osd_fast_shutdown=true) ***” that happened before the last successful OSD startup, as the last line. It is the last line, because the server was down for about 39 hours. Based on what you explained to me, the corruption of the rocks DB must have occurred during the OSD startup, which starts at the first line of ceph-osd.4.log.
>>>>>> 
>>>>>> Logs and the corrupted rocks DB file: https://we.tl/t-yVsF3akxqX
>>>>>> 
>>>>>> Does the logs contain what you need?
>>>>>> Pleas tell me If you need more data from the OSD. If not I would rebuild it.
>>>>>> 
>>>>>> 
>>>>>> Best wishes,
>>>>>> Sebastian
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On 26.02.2022, at 00:12, Igor Fedotov <igor.fedotov@xxxxxxxx> wrote:
>>>>>>> 
>>>>>>> Sebastian,
>>>>>>> 
>>>>>>> On 2/25/2022 7:17 PM, Sebastian Mazza wrote:
>>>>>>>> Hi Igor,
>>>>>>>> 
>>>>>>>>> Unfortunately the last logs (from 24-02-2022) you shared don't include the point where actual corruption happened as the previois log did. These new logs miss the last successful OSD start which apparently corrupts DB data. Do you have any output prior to their content?
>>>>>>>> I have logs from the last successful startup of all OSDs, since this was the cluster boot where the OSD.7 failed (21-02-2022 at 13:49:28.452+0100). That also means that the logs does not include the bluestore debug info, only bluefs and bdev. I will include all logs that I have for the crashed OSD 5 and 6. The last successful startup is inside the files that ends with “4”. While the startups are collected with  bluefs and bdev debug infos, the logs between the startup (21-02-2022 at 13:49:28.452+0100) and the crash (2022-02-24T00:04:57.874+0100) was collected without any debug logging, since during that time I have rebuild OSD 7.
>>>>>>>> 
>>>>>>>> Logs: https://we.tl/t-x0la8gZl0S
>>>>>>> Unfortunately they're of little help without verbose output :(
>>>>>>>> Since I have to manually rebuild the the cluster, pools, cephFS, and RBD images now, It will take several days before I can try to reproduce the problem.
>>>>>>>> 
>>>>>>>> Should, or could I enable more verbose logging then
>>>>>>>> ```
>>>>>>>> [osd]
>>>>>>>>         debug bluefs = 20
>>>>>>>>         debug bdev = 20
>>>>>>>>         debug bluestore = 20
>>>>>>>> ```
>>>>>>>> in ceph.conf? Or somehow else?
>>>>>>> This can be done either through ceph.conf or using "ceph config set" CLI command.
>>>>>>> 
>>>>>>> To make resulting log less you can use a bit less verbose logging levels:
>>>>>>> 
>>>>>>> debug bluefs = 10
>>>>>>> 
>>>>>>> debug bdev = 5
>>>>>>> 
>>>>>>> debug bluestore = 20
>>>>>>> 
>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Sebastian
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On 25.02.2022, at 16:18, Igor Fedotov <igor.fedotov@xxxxxxxx> wrote:
>>>>>>>>> 
>>>>>>>>> Hi Sebastian,
>>>>>>>>> 
>>>>>>>>> I submitted a ticket https://tracker.ceph.com/issues/54409 which shows my analysis based on your previous log (from 21-02-2022). Which wasn't verbose enough at debug-bluestore level to make the final conclusion.
>>>>>>>>> 
>>>>>>>>> Unfortunately the last logs (from 24-02-2022) you shared don't include the point where actual corruption happened as the previois log did. These new logs miss the last successful OSD start which apparently corrupts DB data. Do you have any output prior to their content?
>>>>>>>>> 
>>>>>>>>> If not could you please reproduce that once again? Generally  I'd like to see OSD log for a broken startup along with a couple of restarts back - the event sequence for the failure seems to be as follows:
>>>>>>>>> 
>>>>>>>>> 1) OSD is shutdown for the first time. It (for uncear reasons) keeps a set of deferred writes to be applied once again.
>>>>>>>>> 
>>>>>>>>> 2) OSD is started up which triggers deferred writes submissions. They overlap (again for unclear reasons so far) with DB data content written shortly before. The OSD starts properly but DB data corruption has happened at this point
>>>>>>>>> 
>>>>>>>>> 3) OSD is restarted again which reveals the data corruption and since that point OSD is unable to start.
>>>>>>>>> 
>>>>>>>>> So these last logs new logs include 3) only for now. While I need 1) & 2) as well...
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> 
>>>>>>>>> Igor
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On 2/24/2022 3:04 AM, Sebastian Mazza wrote:
>>>>>>>>>> Hi Igor,
>>>>>>>>>> 
>>>>>>>>>> I let ceph rebuild the OSD.7. Then I added
>>>>>>>>>> ```
>>>>>>>>>> [osd]
>>>>>>>>>> 	debug bluefs = 20
>>>>>>>>>>         debug bdev = 20
>>>>>>>>>>         debug bluestore = 20
>>>>>>>>>> ```
>>>>>>>>>> to the ceph.conf of all 3 nodes and shut down all 3 nodes without writing anything to the pools on the HDDs (the Debian VM was not even running).
>>>>>>>>>> Immediately at the first boot OSD.5 and 6 crashed with the same “Bad table magic number” error. The OSDs 5 and 6 are on the same node, but not on the node of OSD 7, wich crashed the last two times.
>>>>>>>>>> 
>>>>>>>>>> Logs and corrupted rocks DB Files: https://we.tl/t-ZBXYp8r4Hq
>>>>>>>>>> I have saved the entire /var/log directory of every node and the result of
>>>>>>>>>> ```
>>>>>>>>>> $ ceph-bluestore-tool bluefs-export --path /var/lib/ceph/osd/ceph-5 --out-dir /tmp/osd.5-data
>>>>>>>>>> $ ceph-bluestore-tool bluefs-export --path /var/lib/ceph/osd/ceph-6 --out-dir /tmp/osd.6-data
>>>>>>>>>> ```
>>>>>>>>>> Let me know if you need something else.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> I hop you can now track it down. I'm really looking forward to your interpretation of the logs.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Best Regards,
>>>>>>>>>> Sebastian
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On 22.02.2022, at 11:44, Igor Fedotov <igor.fedotov@xxxxxxxx> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Hi Sebastian,
>>>>>>>>>>> 
>>>>>>>>>>> On 2/22/2022 3:01 AM, Sebastian Mazza wrote:
>>>>>>>>>>>> Hey Igor!
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> thanks a lot for the new logs - looks like they provides some insight.
>>>>>>>>>>>> I'm glad the logs are helpful.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> At this point I think the root cause is apparently a race between deferred writes replay and some DB maintenance task happening on OSD startup. It seems that deferred write replay updates a block extent which RocksDB/BlueFS are using. Hence the target BlueFS file gets all-zeros content. Evidently that's just a matter of chance whether they use conflicting physical extent or not hence the occasional nature of the issue...
>>>>>>>>>>>> Do I understand that correct: The corruption of the rocksDB (Table overwritten by zeros) happens at the first start of the OSD after  “*** Immediate shutdown (osd_fast_shutdown=true) ***”? Before the system launches the OSD Service the RocksDB is still fine?
>>>>>>>>>>> Looks like that. From logs I can see an unexpected write to specific extent (LBA 0x630000) which shouldn't occur and at which RocksDB subsequently fails.
>>>>>>>>>>>>> So first of all I'm curious if you have any particular write patterns that can be culprits? E.g. something like disk wiping procedure which writes all-zeros to an object followed by object truncate or removal comes to my mind. If you can identify something like that - could you please collect OSD log for such an operation (followed by OSD restart) with debug-bluestore set to 20?
>>>>>>>>>>>> Best to my knowledge the OSD was hardly doing anything and I do not see any pattern that would fit to you explanation.
>>>>>>>>>>>> However, you certainly understand a lot more about it than I do, so I try to explain everything that could be relevant.
>>>>>>>>>>>> 
>>>>>>>>>>>> The Cluster has 3 Nodes. Each has a 240GB NVMe m.2 SSD as boot drive, which should not be relevant. Each node has 3 OSDs, one is on an U.2 NVMe SSD with 2TB and the other two are on 12TB HDDs.
>>>>>>>>>>>> 
>>>>>>>>>>>> I have configured two crush rules ‘c3nvme’ and ‘ec4x2hdd’. The ‘c3nvme’ is a replicated rule that uses only OSDs with class ’nvme’. The second rule is a tricky erasure rule. It selects exactly 2 OSDs on exactly 4 Hosts with class ‘hdd’. So it only works for a size of exactly 8. That means that a pool that uses this rule has always only “undersized” placement groups, since the cluster has only 3 nodes. (I did not add the fourth server after the first crash in December, since we want to reproduce the problem.)
>>>>>>>>>>>> 
>>>>>>>>>>>> The pools device_health_metrics, test-pool, fs.metadata-root-pool, fs.data-root-pool, fs.data-nvme.c-pool, and block-nvme.c-pool uses the crush rule ‘c3nvme’ with a size of 3 and a min size of 2. The pools fs.data-hdd.ec-pool, block-hdd.ec-pool uses the crush rule ‘ec4x2hdd’ with k=5,m=3 and a min size of 6.
>>>>>>>>>>>> 
>>>>>>>>>>>> The pool fs.data-nvme.c-pool is not used and the pool test-pool was used for rados bench a few month ago.
>>>>>>>>>>>> 
>>>>>>>>>>>> The pool fs.metadata-root-pool is used as metadata pool for cephFS and fs.data-root-pool as the root data pool for the cephFS. The pool fs.data-hdd.ec-pool is an additional data pool for the cephFS and is specified as ceph.dir.layout for some folders of the cephFS. The whole cephFS is mounted by each of the 3 nodes.
>>>>>>>>>>>> 
>>>>>>>>>>>> The pool block-nvme.c-pool hosts two RBD images that are used as boot drives for two VMs. The first VM runes with Ubuntu Desktop and the second with Debian as OS. The pool block-hdd.ec-pool hosts one RBD image (the data part, metadata on block-nvme.c-pool) that is attached to the Debian VM as second drive formatted with BTRFS. Furthermore the Debian VM mounts a sub directory of the cephFS that has the fs.data-hdd.ec-pool set as layout. Both VMs was doing nothing, except from being booted, in the last couple of days.
>>>>>>>>>>>> 
>>>>>>>>>>>> I try to illustrate the pool usage as a tree:
>>>>>>>>>>>> * c3nvme (replicated, size=3, min_size=2)
>>>>>>>>>>>>     + device_health_metrics
>>>>>>>>>>>>     + test-pool
>>>>>>>>>>>>         - rados bench
>>>>>>>>>>>>     + fs.metadata-root-pool
>>>>>>>>>>>>         - CephFS (metadata)
>>>>>>>>>>>>     + fs.data-root-pool
>>>>>>>>>>>>         - CephFS (data root)
>>>>>>>>>>>>     + fs.data-nvme.c-pool
>>>>>>>>>>>>     + block-nvme.c-pool
>>>>>>>>>>>>         - RBD (Ubuntu VM, boot disk with ext4)
>>>>>>>>>>>> 	- RBD (Debian VM, boot disk with ext4)
>>>>>>>>>>>> * ec4x2hdd (ec,  k=5, m=3, min_size=6)
>>>>>>>>>>>>     + fs.data-hdd.ec-pool
>>>>>>>>>>>>         - CephFS (layout data)
>>>>>>>>>>>>     + block-hdd.ec-pool
>>>>>>>>>>>>         - RBD (Debian VM, disk 2 with BTRFS)
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Last week I was writing about 30TB to the CephFS inside the fs.data-hdd.ec-pool and around 50GB to the BTRFS volume on the block-hdd.ec-pool. The 50GB written to BTRFS contains hundred thousands of hard links. There was already nearly the same data on the storage and I used rsync to update it. I think something between 1 and 4 TB has changed and was updated by rsync.
>>>>>>>>>>>> 
>>>>>>>>>>>> I think the cluster was totally unused on Friday, but up and running and idling around. Then on Saturday I did a graceful shutdown of of all cluster nodes. Arround 5 minutes later when booted the servers again, the OSD.7 crashed. I copied the logs and exported the RocksDB. Then I deleted everything from the HDD and deployed the OSD.7 again. When, I checkt for the first time today at around 12:00, ceph was already finished with backfilling to OSD.7 and the cluster was idle again.
>>>>>>>>>>>> 
>>>>>>>>>>>> I then spend 70 minutes with writing 3 small files (one with about 500Byte and two with about 2KB) to the cephFS on the fs.data-hdd.ec-pool and shutting down the whole cluster. After I booted all 3 nodes again I did overwrite the 3 small files (some Bytes have changed) and shouted down all 3 nodes. I repeated that procedure 4 times within 70 minutes. Then the OSD 7 crashed again.
>>>>>>>>>>>> 
>>>>>>>>>>>> There may be was some more IO caused by mounting the RBD images with the BTRFS volume inside the Debian VM. But I do not know. I can only say that no user did use it. Only I was writing 4 times the 3 small files to cephFS on the fs.data-hdd.ec-pool.
>>>>>>>>>>>> 
>>>>>>>>>>>> Does this make any sense to you?
>>>>>>>>>>> Well, that's actually not a write pattern I would expect to cause the issue. But thanks a lot anyway. So could you please repeat that procedure with overwriting 3 small files and OSD restart and collect the logs with debug-bluestore set to 20.
>>>>>>>>>>>>> with debug-bluestore set to 20
>>>>>>>>>>>> Do you mean
>>>>>>>>>>>> ```
>>>>>>>>>>>> [osd]
>>>>>>>>>>>> 	debug bluestore = 20
>>>>>>>>>>>> ```
>>>>>>>>>>>> in ceph.conf?
>>>>>>>>>>> yes.  Or you might use 'ceph config set osd.7 debug-bluestore 20' instead
>>>>>>>>>>> 
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Sebastian
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> On 21.02.2022, at 21:19, Igor Fedotov <igor.fedotov@xxxxxxxx> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Hey Sebastian,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> thanks a lot for the new logs - looks like they provides some insight. At this point I think the root cause is apparently a race between deferred writes replay and some DB maintenance task happening on OSD startup. It seems that deferred write replay updates a block extent which RocksDB/BlueFS are using. Hence the target BlueFS file gets all-zeros content. Evidently that's just a matter of chance whether they use conflicting physical extent or not hence the occasional nature of the issue...
>>>>>>>>>>>>> 
>>>>>>>>>>>>> And now I'd like to determine what's wrong with this deferred write replay.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> So first of all I'm curious if you have any particular write patterns that can be culprits? E.g. something like disk wiping procedure which writes all-zeros to an object followed by object truncate or removal comes to my mind. If you can identify something like that - could you please collect OSD log for such an operation (followed by OSD restart) with debug-bluestore set to 20?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Igor
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On 2/21/2022 5:29 PM, Sebastian Mazza wrote:
>>>>>>>>>>>>>> Hi Igor,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> today (21-02-2022) at 13:49:28.452+0100, I crashed the OSD 7 again. And this time I have logs with “debug bluefs = 20” and "debug bdev = 20” for every OSD in the cluster! It was the OSD with the ID 7 again. So the HDD has failed now the third time! Coincidence? Probably not…
>>>>>>>>>>>>>> The important thing seams to be that a shutdown and not only a restart of the entire cluster is performed. Since, this time the OSD failed after just 4 shutdowns of all nodes in the cluster within 70 minutes.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I redeployed the OSD.7 after the crash from 2 days ago. And I started this new shutdown and boot series shortly after ceph had finished writing everything back to OSD.7, earlier today.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> The corrupted RocksDB file (crash) is again only 2KB in size.
>>>>>>>>>>>>>> You can download the RocksDB file with the bad  table magic number and the log of the OSD.7 under this link: https://we.tl/t-e0NqjpSmaQ
>>>>>>>>>>>>>> What else do you want?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> From the log of the OSD.7:
>>>>>>>>>>>>>> —————
>>>>>>>>>>>>>> 2022-02-21T13:47:39.945+0100 7f6fa3f91700 20 bdev(0x55f088a27400 /var/lib/ceph/osd/ceph-7/block) _aio_log_finish 1 0x96d000~1000
>>>>>>>>>>>>>> 2022-02-21T13:47:39.945+0100 7f6fa3f91700 10 bdev(0x55f088a27400 /var/lib/ceph/osd/ceph-7/block) _aio_thread finished aio 0x55f0b8c7c910 r 4096 ioc 0x55f0b8dbdd18 with 0 aios left
>>>>>>>>>>>>>> 2022-02-21T13:49:28.452+0100 7f6fa8a34700 -1 received  signal: Terminated from /sbin/init  (PID: 1) UID: 0
>>>>>>>>>>>>>> 2022-02-21T13:49:28.452+0100 7f6fa8a34700 -1 osd.7 4711 *** Got signal Terminated ***
>>>>>>>>>>>>>> 2022-02-21T13:49:28.452+0100 7f6fa8a34700 -1 osd.7 4711 *** Immediate shutdown (osd_fast_shutdown=true) ***
>>>>>>>>>>>>>> 2022-02-21T13:53:40.455+0100 7fc9645f4f00  0 set uid:gid to 64045:64045 (ceph:ceph)
>>>>>>>>>>>>>> 2022-02-21T13:53:40.455+0100 7fc9645f4f00  0 ceph version 16.2.6 (1a6b9a05546f335eeeddb460fdc89caadf80ac7a) pacific (stable), process ceph-osd, pid 1967
>>>>>>>>>>>>>> 2022-02-21T13:53:40.455+0100 7fc9645f4f00  0 pidfile_write: ignore empty --pid-file
>>>>>>>>>>>>>> 2022-02-21T13:53:40.459+0100 7fc9645f4f00  1 bdev(0x55bd400a0800 /var/lib/ceph/osd/ceph-7/block) open path /var/lib/ceph/osd/ceph-7/block
>>>>>>>>>>>>>> —————
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> For me this looks like that the OSD did nothing for nearly 2 minutes before it receives the termination request. Shouldn't this be enough time for flushing every imaginable write cache?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I hope this helps you.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Best wishes,
>>>>>>>>>>>>>> Sebastian
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> -- 
>>>>>>>>>>>>> Igor Fedotov
>>>>>>>>>>>>> Ceph Lead Developer
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Looking for help with your Ceph cluster? Contact us at https://croit.io
>>>>>>>>>>>>> 
>>>>>>>>>>>>> croit GmbH, Freseniusstr. 31h, 81247 Munich
>>>>>>>>>>>>> CEO: Martin Verges - VAT-ID: DE310638492
>>>>>>>>>>>>> Com. register: Amtsgericht Munich HRB 231263
>>>>>>>>>>>>> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
>>>>>>>>>>> -- 
>>>>>>>>>>> Igor Fedotov
>>>>>>>>>>> Ceph Lead Developer
>>>>>>>>>>> 
>>>>>>>>>>> Looking for help with your Ceph cluster? Contact us at https://croit.io
>>>>>>>>>>> 
>>>>>>>>>>> croit GmbH, Freseniusstr. 31h, 81247 Munich
>>>>>>>>>>> CEO: Martin Verges - VAT-ID: DE310638492
>>>>>>>>>>> Com. register: Amtsgericht Munich HRB 231263
>>>>>>>>>>> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
>>>>>>>>> -- 
>>>>>>>>> Igor Fedotov
>>>>>>>>> Ceph Lead Developer
>>>>>>>>> 
>>>>>>>>> Looking for help with your Ceph cluster? Contact us at https://croit.io
>>>>>>>>> 
>>>>>>>>> croit GmbH, Freseniusstr. 31h, 81247 Munich
>>>>>>>>> CEO: Martin Verges - VAT-ID: DE310638492
>>>>>>>>> Com. register: Amtsgericht Munich HRB 231263
>>>>>>>>> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
>>>>>>> -- 
>>>>>>> Igor Fedotov
>>>>>>> Ceph Lead Developer
>>>>>>> 
>>>>>>> Looking for help with your Ceph cluster? Contact us at https://croit.io
>>>>>>> 
>>>>>>> croit GmbH, Freseniusstr. 31h, 81247 Munich
>>>>>>> CEO: Martin Verges - VAT-ID: DE310638492
>>>>>>> Com. register: Amtsgericht Munich HRB 231263
>>>>>>> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
>>>>> -- 
>>>>> Igor Fedotov
>>>>> Ceph Lead Developer
>>>>> 
>>>>> Looking for help with your Ceph cluster? Contact us at https://croit.io
>>>>> 
>>>>> croit GmbH, Freseniusstr. 31h, 81247 Munich
>>>>> CEO: Martin Verges - VAT-ID: DE310638492
>>>>> Com. register: Amtsgericht Munich HRB 231263
>>>>> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
>>>>> 
>>> -- 
>>> Igor Fedotov
>>> Ceph Lead Developer
>>> 
>>> Looking for help with your Ceph cluster? Contact us at https://croit.io
>>> 
>>> croit GmbH, Freseniusstr. 31h, 81247 Munich
>>> CEO: Martin Verges - VAT-ID: DE310638492
>>> Com. register: Amtsgericht Munich HRB 231263
>>> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
>>> 
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> 
> -- 
> Igor Fedotov
> Ceph Lead Developer
> 
> Looking for help with your Ceph cluster? Contact us at https://croit.io
> 
> croit GmbH, Freseniusstr. 31h, 81247 Munich
> CEO: Martin Verges - VAT-ID: DE310638492
> Com. register: Amtsgericht Munich HRB 231263
> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
> 
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx