Bandwidthdb: database disk image is malformed

dan · January 18, 2022, 10:00pm

Hey,

I followed this how to:

 ERROR   piecestore      failed to add bandwidth usage   {"error": "bandwidthdb: database disk image is malformed", "errorVerbose": "bandwidthdb: database disk image is malformed\n\tstorj.io/storj/storagenode/storagenodedb.(*bandwidthDB).Add:60\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).beginSaveOrder.func1:722\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Upload:434\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func1:220\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:58\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:104\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:60\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:97\n\tstorj.io/drpc/drpcctx.(*Tracker).track:52"}

sqlite3 storage/bandwith.db "PRAGMA integrity_check;"
ok

But the return is ok, so someone have a idea where the problem could be?

Bivvo · January 18, 2022, 11:48pm

Have you checked the other db files, too?

There can be interdependencies leading to the false-positive alert in the logs, linked to bandwidth.db.

igaass · January 19, 2022, 1:52pm

@dan I would also check other db files, like @Bivvo recommended

dan · January 19, 2022, 7:45pm

Okay, i’ve cheked the othe db’s and the piece_expiration.db have problems.

I tried to rebuild the db, but it will stop at 4mb without an error.
The original db has 20 mb.

SGC · January 19, 2022, 8:32pm

if you just migrated the node, shut it down and do rsync again without the --delete parameter, then when rsync is down you start it up again and the bandwidth db will work just fine…

no idea why…

dan · January 21, 2022, 4:50pm

Na, i didn’t migrated it with rsync.
I just moved the virtual disc (it’s on a proxmox server)

Bivvo · January 21, 2022, 5:16pm

Can you share the result of ls -al of the db folder and the steps/commands you were firing for fixing the issue?

dan · January 21, 2022, 5:24pm

root@storjNode1:/mnt/storj/storage# ls -lah
insgesamt 307M
drwx------ 6 storj storj 4,0K 21. Jan 18:17 .
drwxr-xr-x 5 root  root  4,0K 21. Jan 14:27 ..
-rw-r--r-- 1 storj storj 272M 21. Jan 18:15 bandwidth.db
-rw-r--r-- 1 storj storj  32K 21. Jan 18:18 bandwidth.db-shm
-rw-r--r-- 1 storj storj 4,0M 21. Jan 18:18 bandwidth.db-wal
-rw-r--r-- 1 storj storj    0 18. Jan 22:56 bandwith.db
drwx------ 8 storj storj 4,0K 19. Jan 2021  blobs
-rw-r--r-- 1 root  root  3,4M 19. Jan 20:11 dump_all_notrans.sql
-rw-r--r-- 1 root  root  3,4M 19. Jan 20:11 dump_all.sql
drwx------ 2 storj storj 4,0K 21. Jan 17:51 garbage
-rw-r--r-- 1 storj storj  48K 21. Jan 09:22 heldamount.db
-rw-r--r-- 1 storj storj  16K 19. Jan 21:20 info.db
-rw-r--r-- 1 storj storj  24K 19. Jan 21:20 notifications.db
-rw-r--r-- 1 storj storj  32K 19. Jan 21:20 orders.db
-rw-r--r-- 1 storj storj  32K 21. Jan 17:46 orders.db-shm
-rw-r--r-- 1 storj storj    0 21. Jan 17:46 orders.db-wal
-rw-r--r-- 1 root  root  4,4M 21. Jan 17:54 piece_expiration.db
-rw-r--r-- 1 root  root   20M 19. Jan 20:10 piece_expiration.db.bak
-rw-r--r-- 1 root  root   32K 21. Jan 18:17 piece_expiration.db-shm
-rw-r--r-- 1 root  root  375K 21. Jan 18:17 piece_expiration.db-wal
-rw-r--r-- 1 storj storj  24K 19. Jan 21:20 pieceinfo.db
-rw-r--r-- 1 storj storj  32K 21. Jan 17:51 pieceinfo.db-shm
-rw-r--r-- 1 storj storj    0 21. Jan 17:51 pieceinfo.db-wal
-rw-r--r-- 1 storj storj  24K 21. Jan 17:28 piece_spaced_used.db
-rw-r--r-- 1 storj storj  32K 21. Jan 17:58 piece_spaced_used.db-shm
-rw-r--r-- 1 storj storj  29K 21. Jan 17:58 piece_spaced_used.db-wal
-rw-r--r-- 1 storj storj  24K 19. Jan 21:20 pricing.db
-rw-r--r-- 1 storj storj  40K 21. Jan 16:59 reputation.db
-rw-r--r-- 1 storj storj  32K 21. Jan 18:14 reputation.db-shm
-rw-r--r-- 1 storj storj    0 21. Jan 18:14 reputation.db-wal
-rw-r--r-- 1 storj storj  32K 19. Jan 21:20 satellites.db
-rw-r--r-- 1 storj storj  32K 21. Jan 18:17 satellites.db-shm
-rw-r--r-- 1 storj storj    0 21. Jan 18:17 satellites.db-wal
-rw-r--r-- 1 storj storj  24K 19. Jan 21:20 secret.db
-rw-r--r-- 1 storj storj   32 17. Jan 2021  storage-dir-verification
-rw-r--r-- 1 storj storj 396K 21. Jan 09:22 storage_usage.db
drwx------ 2 storj storj  12K 21. Jan 18:18 temp
drwx------ 8 storj storj 4,0K 28. Feb 2021  trash
-rw-r--r-- 1 storj storj  20K 19. Jan 21:20 used_serial.db

sqlite3 /storage/bandwidth.db

.mode insert
.output piece_expiration.db
.dump
.exit

at dump_all.sql | grep -v TRANSACTION | grep -v ROLLBACK | grep -v COMMIT >dump_all_notrans.sql

rm piece_expiration.db && sqlite3 piece_expiration.db ".read dump_all_notrans.sql"

Bivvo · January 21, 2022, 8:41pm

Except the case, that your files are owned by root instead of storj, it seems to be fine that the new file is smaller:

Because of grep -v some contents are sorted out, before being exported to dump_all_notrans.sql.

Have you tried to sudo chown storj piece_exp* && sudo chgrp storj piece_exp* and started the node then?

You’re using root I guess (or at least root to perform the dump) and the user storj is not allowed to access the files owned by root.

BrightSilence · January 22, 2022, 12:56am

This isn’t really the reason the files end up smaller as all that removes is lines that combine the inserts into a transaction. The resulting data is actually exactly the same without those lines.

It’s smaller because SQLite doesn’t really shrink the database size if data is removed. This will only happen if you vacuum the db file. For storj that really isn’t necessary as most databases stay small anyway. And the node software doesn’t vacuum the db’s. But since this repair method starts over with a clean db file and then inserts the still existing data back in, it removes all that no longer necessary data as well.

Tl;Dr: the smaller file is perfectly fine.

One additional note. This db stores expiration dates for piece data on your node and the data in it is technically nonessential. If data is missing your node won’t remove pieces when they expire immediately. But garbage collection processes that run frequently on your node will still figure out that the files are no longer needed and the cleanup will still happen relatively quickly after expiration anyway. So even if you don’t trust my explanation above, your node is going to be fine even if data is missing from that db.

dan · January 22, 2022, 7:24am

they are owned by root, because i tried the rebuild as root user.

dan · January 23, 2022, 8:28am

Okay, thanks for that information.

My storage node is working again without any errors!
First I had to “repair” the piece_expiration.db.
After that, I tried to restart the node, and get again the errors, but this time the bandwith.db had problems.

Again the same workflow for this db file, now all is working fine.

Thanks for the help!

BrightSilence · January 23, 2022, 9:00am

It worries me that you’re having problems again this quickly. You had previously checked that file and it was fine.

So a question. Are you running your storage location over a network share like SMB or NFS? If so, stop. Sqlite doesn’t support this and you will keep having corrupted files. The only network protocol that doesn’t suffer from this is iSCSI.

If that’s not it, prevent interrupting the node abruptly. When stopping the docker container, always include -t 300 to ensure it has enough time to stop gracefully. Stop the node before restarting the system and never do a hard restart without shutting the system down first. If that last part has happened, please also check the integrity of your file system (fsck) to ensure that is not causing the corruption issues.

dan · January 23, 2022, 12:50pm

The only thing I’ve done, was that I moved the virtual disk from one zpool to another.
The zpool is locally, so no NFS or SMB is involved.

A good point is the -t 300 flag, that could be the problem.

Thanks for the help!