Database malformed out of the blue on docker

gingerbread233 · July 25, 2024, 6:43am

Hello,

Out of the blue my storj container stopped working. I tried restarting it, without success, then I stopped it, deleted the container and redeployed it, also without success. My node was working fine, without any problems, but all of a sudden, this happened. I hope somebody could help me out to minimize my downtime. TiA

Docker-Log:

Error: Error migrating tables for database on storagenode: migrate: v60: database disk image is malformed

storj.io/storj/private/migrate.(*Migration).Run:212

storj.io/storj/storagenode/storagenodedb.(*DB).MigrateToLatest:425

main.cmdRun:100

main.newRunCmd.func1:33

storj.io/common/process.cleanup.func1.4:393

storj.io/common/process.cleanup.func1:411

github.com/spf13/cobra.(*Command).execute:983

github.com/spf13/cobra.(*Command).ExecuteC:1115

github.com/spf13/cobra.(*Command).Execute:1039

storj.io/common/process.ExecWithCustomOptions:112

main.main:34

runtime.main:271

2024-07-25 06:33:54,874 INFO exited: storagenode (exit status 1; not expected

Alexey · July 25, 2024, 6:49am

Likely it was malformed before, but now when the node is tried to migrate it, it’s noticed that the database is malformed.

Please note, 1.108.x have a database migrations, and they could take a few hours to migrate. If you decide to kill the container during the migration, it could malform a database.

Now you need to fix it.

gingerbread233 · July 25, 2024, 7:00am

My node was offline for around 2 hours, so I restarted and redeployed my node, without success. I’ll try the troubleshooting and hope it will fix the issue.

Alexey · July 25, 2024, 7:05am

Yes, these actions likely corrupted a database. As I said, the databases migration can take hours and you killed it in the middle of the process.
I hope, that you can recover it from the malformed state. The alternative is to re-create it:

In the latter case it would implicate the TTL data being collected only by GC (+2 weeks), not when it’s expired.

gingerbread233 · July 25, 2024, 7:16am

I learned the hard way now. Fortunately it is not my one and only node. I sometimes had this behavior of some nodes in the past, that it hung up, and restarting fixed it. Unfortunately the logs weren’t clear enough or I wasn’t understanding it right, that the node was migrating the database, and shouldn’t be stopped/killed.

gingerbread233 · July 25, 2024, 2:52pm

I tried the tutorial you sent me, I managed to dump the db in the “dump_all.sql” and then ran the other command to “fix ist”. the “dump_all.sql” and “dump_all_notrans.sql” are both the exact same size meaning, nothing happened. Running the other commands to put it back together in a new .db file gives me an error:

PS C:\Users\User> $(echo “PRAGMA synchronous = OFF ;”; Get-Content C:\Users\User\Desktop\storj17DB\dump_all.sql) | Select-String -NotMatch “TRANSACTION|ROLLBACK|COMMIT” | Set-Content -Encoding utf8 C:\Users\User\Desktop\storj17DB\dump_all_notrans.sql
PS C:\Users\User> sqlite3 C:\Users\User\Desktop\storj17DB\piece_expiration.db “.read C:\Users\USER\Desktop\storj17DB\dump_all_notrans.sql”
Error: cannot open “C:\Users\USER\Desktop\storj17DB\dump_all_notrans.sql”
PS C:\Users\User> sqlite3 C:\Users\User\Desktop\storj17DB\piece_expiration.db “.read C:\Users\User\Desktop\storj17DB\dump_all_notrans.sql”
Parse error near line 161323: no such table: versions
Parse error near line 161324: no such table: versions
Parse error near line 161325: no such table: versions
PS C:\Users\User> $(echo “PRAGMA synchronous = OFF ;”; Get-Content dump_all.sql) | Select-String -NotMatch “TRANSACTION|ROLLBACK|COMMIT” | Set-Content -Encoding utf8 dump_all_notrans.sql

Is my DB unfixable? It is just seem to be one DB which is broken (piece_expiration.db).

This is what my malformed SQL looks like

This is a working SQL

The effected lines

Something seem to be wrong here.

Alexey · July 26, 2024, 5:31am

You are correct unfortunately, you can see that the log is not moving forward (no uploads/downloads/audits/repairs).

I do not like this error:

if you tried to read an SQL dump to the not existing database, sqlite will create the database file first then would start to read. Since it was unable to read the file for some reason (perhaps a wrong path?), it created a database but likely corrupted it because of error.
After that you need to delete the newly created database file and start the next attempt to read with a correct path.

So, please delete a C:\Users\User\Desktop\storj17DB\piece_expiration.db file

rm C:\Users\User\Desktop\storj17DB\piece_expiration.db

and try to read again:

sqlite3 C:\Users\User\Desktop\storj17DB\piece_expiration.db ".read C:\Users\User\Desktop\storj17DB\dump_all_notrans.sql"

gingerbread233 · July 26, 2024, 5:49am

I tired it several times yesterday. Also with the help of perplexity. I first copied all .db files, and checked the integrity of all with sqlite3, then I sorted the good ones out, so the corrupted is left. Then I tried the things in the tutorial. My .db file is around 56.6MB “large”, the “dump_all.sql” were just 33MB, after the filtering into the “dump_all_notrans.sql” the size stayed the same. I did an second attempt, resulting the “dump_all.sql” in 66MB size, wich is more legit since the .db is 56.6MB. Then I retried the filtering command, but in the end the “dump_all_notrans.sql” also got trimmed to around 33MB. Trying to write back the SQL file to a new .db resulted in the same error with the 3 Lines affected. My other node with the almost exact same amount of saved data has a .db file of double the size. I don’t know if its worth try to fix it, when presumably half of the db is missing. So I don’t know if it’s the best to do your second “fix” starting the db from zero, if this will not work. I don’t want to let the node offline for too much time.

Alexey · July 26, 2024, 6:52am

You should not compare the size of the text with the size of the binary DB.
Yes, if the DB is malformed, not all data can be dumped. This is just a way how to recover at least something. It doesn’t guarantee, that it could recover everything.

Then likely you are were unlucky and not all data have been dumped to SQL.
There is a second option: re-create this database, then stop the node and try to load the SQL dump. The database will have all needed pieces of the data and likely you can load the remaining data (with warnings, because something may be duplicated), but at least this database will be not malformed and contains at least some TTL data, so less work for the Garbage Collector.

gingerbread233 · July 26, 2024, 2:56pm

I tried it again, and the error with the 3 affected lines always appeared, even with no errors regarding opening the file. Just for fun, I tried to cpoy back the “dump_all” instead of the “dump_all_notrans” sql file into the DB, with success. It went through, and my node is running again, no malformed error messaghe anymore. Now I have lots of “unable to delete file” maybe because the database is missing some entries and it tries to delete files which already has been deleted.