Error repairing malformed piece_expiration.db

BBQMan · March 8, 2024, 3:14pm

Hi All,

I had a SAS controllers cache fail which in turn corrupted this DB on 2 nodes. The first one I tried to fix had no errors on the repair but when I remade the container, the container wouldn’t start. So I then put the backed up malformed DB back in place the container is now working fine not giving any malformed errors. Odd but that one seems to work fine now.

The 2nd DB I’ve tried to repair throws the following error during the repair. Is this something to worry about or should I try the new .db? The new .db is an 1/8 of the size of the malformed one…

Error: near line 67435: stepping, UNIQUE constraint failed: piece_expirations.satellite_id, piece_expirations.piece_id (19)
Error: near line 67436: stepping, UNIQUE constraint failed: piece_expirations.satellite_id, piece_expirations.piece_id (19)
Error: near line 67441: stepping, UNIQUE constraint failed: piece_expirations.satellite_id, piece_expirations.piece_id (19)

JWvdV · March 9, 2024, 7:17am

Did you check the filesystem before repairing the recovery?

Actually you could start the node, without the databases (move them temporary). Then all databases are being recreated within a minute. Then just stop the node. Then move back all databases, except for the piece_expirations-database.

Alexey · March 9, 2024, 12:27pm

You need to check your disk and check your databases while the node is stopped:

BBQMan · March 9, 2024, 11:17pm

Thanks for the suggestion. I did check the filesystems before the repairs and they were fine. I was unsure about the system recreating the DBs. So how about just leaving the existing ones that are fine in place and let it rebuild only the piece_expirations?

BBQMan · March 9, 2024, 11:23pm

I did follow these instructions while all was shutdown. The disks files systems are in good shape. The error stated above was during step 13 on one of the DBs. I will try again this evening and see what happens.

BBQMan · March 10, 2024, 12:41am

Alexey, I re-ran the fix and I did get the same errors at the end of step 13 but the file was much larger at 1/2 the size instead of 1/8 on the last attempt. I put the new .db in place on the node, started it and all seemed well. There were 1000s of expired files because it deleted files for about 30 minutes or so. Once complete, I stopped and started the node and all was well. As the last test I removed and re-created the container and all was working so it appears the problem was solved.

Thanks again for all the awesome support!

Alexey · March 10, 2024, 2:59am

If you mean the script with PRAGMA integrity_check;, then It’s not a fix, it’s a detection. The fix is manual: you need to unload data from the corrupted database, clean this generated SQL script from transaction clauses, and load it to the empty database. This way you could recover as much as possible.
However, if you are ok with historical and statistical data loss - you may re-create this database from scratch instead:
https://support.storj.io/hc/en-us/articles/4403032417044-How-to-fix-database-file-is-not-a-database-errorhttps://support.storj.io/hc/en-us/articles/4403032417044-How-to-fix-database-file-is-not-a-database-error

For this database this will mean that your node will not remove expired pieces right away, they will be moved by a garbage collector to the trash first, then they will be removed after 7 days from the trash. So, expired pieces will be on your disk 7 more days, maybe more, if the garbage collector would fail or didn’t detect them as a garbage (it has only 90% probability due to how Bloom filters works).

BBQMan · March 10, 2024, 3:25am

I did go through the process of unloading and reloading the data with the scripts. Thanks again for all the help and explanation of how things work.