Database disk image is malformed error in piece_expiration db

brainstorm · June 24, 2024, 9:35pm

Hi all,

My node suddenly started spiking in cpu and disk IO (after the recent tests were done I guess… the bandwidth had already gone down back to 20Mb/s from the peak of 170Mb/s) so it was rather surprising. Eventually, I noticed the log full or errors, the dreaded “database disk image is malformed”. I followed https://support.storj.io/hc/en-us/articles/360029309111-How-to-fix-a-database-disk-image-is-malformed and found only the piece_expiration db had a problem, so I followed the guide to repair it.
My concern is that the original .db was 222MB in size. The point 14. of the guide to fix the db, where we load the unloaded data back into the database, seems to be working as I see the new piece_expiration db growing. However it grows at about 10KB/s… is this normal ? it’s never gonna finish, my node is going to be disqualified before it’s done
sqlite3 is also giving me a bunch of errors
“Runtime error near line 2401: UNIQUE constraint failed: piece_expirations.satellite_id, piece_expirations.piece_id (19)”
did I basically just lose my node with this db corruption ?

pangolin · June 24, 2024, 9:42pm

Don’t panic, it needs 30 days offline to get disqualified.

snorkel · June 24, 2024, 9:54pm

You can stop the node, delete that currupted db, move other db to some folder, and restart the node. All the db are recreated. Than stop the node and move back the other databases. Restart the node. I’m not sure if you need to run the start up filewalker too.
The db are not critical. If you delete all of them, the node will recreate them. You just need to let the startup filewalker finish, to update the used space and etc.
Also, you should move them to a SSD or SSD stick.

brainstorm · June 24, 2024, 9:59pm

replying to my own question, and I really think this should go into this guide https://support.storj.io/hc/en-us/articles/360029309111-How-to-fix-a-database-disk-image-is-malformed

sqlite by default does transactions for each record insert. So it syncs each record on the disk. That is super-slow.
To just have sqlite dump all the records without waiting on each to be flushed to disk, the first line of the no_trans.sql file should be
PRAGMA synchronous = OFF ;
you can achieve this by creating a file with this single line in it, then use cat to concatenate it and the no_trans.sql file created at step 11 into a new .sql file.
This will allow to recreate the db at a rate of at least a few MB/s

brainstorm · June 24, 2024, 10:00pm

ok wait you are saying that all the db files are created from the stored data under data/ ?
all the db are just caches ?

snorkel · June 24, 2024, 10:02pm

Not caches, but yes, they are reacreated by storagenode and walkers.
I believe all the critical info your node must have and not be damaged are blob folders and identity. Other stuff is not critical and can be recreated.
Yep, in a way they are caches.

brainstorm · June 24, 2024, 10:06pm

ok good to know.
But I am not sure I understand your procedure. You say to stop, delete offending db, then stop and overwrite with the older db ? why ?
I mean if they can be entirely recreated, then just delete them all and restart the node, no ?

brainstorm · June 24, 2024, 10:16pm

alright well I finished using the guide, and now I just got one error in the logs

2024-06-25T00:11:41+02:00 ERROR piecestore download failed {“Process”: “storagenode”, “Piece ID”: “X6CPU6ACV3VCXIEYRQYHLMKP44TKRAIYC6NO4WXX253MMV6FEUUQ”, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Action”: “GET”, “Offset”: 0, “Size”: 209152, “Remote Address”: “79.127.205.228:58914”, “error”: “untrusted: unable to get signee: trust: rpc: tcp connector failed: rpc: dial tcp: lookup us1.storj.io: operation was canceled”, “errorVerbose”: “untrusted: unable to get signee: trust: rpc: tcp connector failed: rpc: dial tcp: lookup us1.storj.io: operation was canceled\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).VerifyOrderLimitSignature:140\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).verifyOrderLimit:62\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:618\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:302\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:61\n\tstorj.io/common/experiment.(*Handler).HandleRPC:42\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:167\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:109\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:157\n\tstorj.io/drpc/drpcctx.(*Tracker).track:35”}

not sure how bad that is
probably because my node was again offline for 45 minutes

snorkel · June 24, 2024, 10:31pm

If you delete all databases, you loose the progress for curent month I believe. I mean the progress shown in dashboard. This dosen’t affect your node or payout in any way.
If you delete all databases, you must run the start up filewalker (piecescan on startup).
If you don’t want to loose the history and maybe not run the filewalker, just loose the affected database.
In any case, backup the databases and delete them all. Run once the node to recreate them and stop it. Than bring back the backup-ed databases except the one, if you want.

brainstorm · June 24, 2024, 10:38pm

ok well it seems to be fixed with the procedure from the guide.
now the weird thing is the nodestorage process is still spiking CPU. Even though network bandwidth is pretty low (50Mb/s) and disk activity also just about 20MB/s (I have a zfs raidz and it can take 150MB/s or more, no problem) . It’s not even the filewalker process that is taking all the cpu. just “storagenode run”

pangolin · June 24, 2024, 11:55pm

You only lose history for the decommissioned satellites.

jammerdan · June 25, 2024, 2:21am

They hold non-critical data.

However recreating and therefore losing the piece_expiration sounds bad as AFAIK as it stores the TTLs for the pieces. If deleted all the test data with TTL will then not get deleted by the collector process immediately but only through the retain process and moved to trash where they will sit another week before deletion.
That makes losing this db really costly.

Alexey · June 25, 2024, 3:16am

yes, but you may use tmpfs for the database, it will be processed faster.

Thank you for the tip!

Not all, but some. However, the current and history stat will be lost too, the filewalkers will restore only storage usage, but not for example a bandwidth usage. The payout information would be received from trusted satellites too, but will be lost for untrusted and so on.
Most of databases used as a persistent cache, yes, they are mostly for the dashboard.
However, this one database in question is designed to keep TTL for pieces. So, if it would be lost, these pieces wouldn’t be deleted automatically without using a trash as they should and would be collected by a garbage collector, moved to the trash and removed after 7 days permanently. So in a worst case the TTL deleted data will be on the disk up to two weeks longer than it should.

Yes, the node is unable to properly re-create only the one missing database, so you need to delete the broken one, and move remaining to the some folder. When you start the node, all databases will be recreated correctly, applied migrations and will be ready to work. You would then stop the node and move all previous databases back, so as a result only an offending database will be re-createad and you would not lose stat in all other.
However, it’s better to fix the database if possible, it would allow to keep the stat and history.

This is either a problem with a DNS or the node is stopping and this is a consequent error during the stop.

brainstorm · June 25, 2024, 9:37am

so the right fix would be, at point 11. instead of

cat /storage/dump_all.sql | grep -v TRANSACTION | grep -v ROLLBACK | grep -v COMMIT >/storage/dump_all_notrans.sql

use

{ echo “PRAGMA synchronous = OFF ;” ; cat /storage/dump_all.sql ; } | grep -v TRANSACTION | grep -v ROLLBACK | grep -v COMMIT >/storage/dump_all_notrans.sql

Alexey · June 26, 2024, 7:07am

Thank you, added. Please check.

By the way did it help?

brainstorm · June 26, 2024, 7:24am

the pragma statement helped yes, it recreated the database at MB/s instead of 10KB/s rate.

Also rebuilding the db according to the guide definitely helped as I stopped getting the piece expiration errors.

brainstorm · June 26, 2024, 7:40am

BTW, you can even do

to avoid spawning 3 grep processes

Also, the remark at point 11 regarding tmpfs use is not really needed anymore if you use the synchronous pragma, and using tmpfs would be (in theory) slower. I am pretty sure tmpfs speeds up things because it will return immediately to sqlite as it synced every record instantly in RAM. So sqlite still does the fsync calls and wait, although both calls return immediately. But with the synchronous pragma, those calls are not even made.
With that said, I did not test that, it’s just my understanding how it works.

Alexey · June 26, 2024, 8:09am

Thanks. However, the suggestion about tmpfs would remain until confirmed.
Please, check. Do you see anything else?

brainstorm · June 26, 2024, 8:18am

difference between tmpfs use to create the new db and just using the synchronous pragma would be very marginal anyway, probably insignificant. Biggest effect of the synchronous pragma is that you do not need to use RAM to do the database rebuild.

brainstorm · June 26, 2024, 9:39am

I replaced step 9 to 14 with this one liner (change database file name and path accordingly)

export DBNAME=piece_expiration.db ; { echo "PRAGMA synchronous = OFF ;"; echo -e ".mode insert \n.output \n.dump \n.exit" | sqlite3 $DBNAME ;} | grep -v -e TRANSACTION -e ROLLBACK -e COMMIT | sqlite3 $DBNAME.new ; mv $DBNAME.new $DBNAME

of course, make a backup copy first of the database to process just in case.
I unfortunately do not have a corrupted database to test this on now and make sure it’s the same result as the original procedure, but it worked for me and it’s basically all the same steps, just in one operation. Easier cut and paste

If someone else could verify and validate, that would be great