What happens if I would remove databases

Skyblockpro1 · September 8, 2022, 8:25am

Hi guys I just got a lovely error today:
2022-09-08T10:09:35.950+0200 FATAL Unrecoverable error {“error”: “Error starting master database on storagenode: database: piece_expiration opening file "D:\\piece_expiration.db" failed: disk I/O error: The file or directory is corrupted and unreadable.\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).openDatabase:324\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).openExistingDatabase:306\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).openDatabases:281\n\tstorj.io/storj/storagenode/storagenodedb.OpenExisting:248\n\tmain.cmdRun:193\n\tstorj.io/private/process.cleanup.func1.4:378\n\tstorj.io/private/process.cleanup.func1:396\n\tgithub.com/spf13/cobra.(*Command).execute:852\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:960\n\tgithub.com/spf13/cobra.(*Command).Execute:897\n\tstorj.io/private/process.ExecWithCustomConfigAndLogger:93\n\tstorj.io/private/process.ExecWithCustomConfig:75\n\tstorj.io/private/process.Exec:65\n\tmain.(*service).Execute.func1:61\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57”, “errorVerbose”: “Error starting master database on storagenode: database: piece_expiration opening file "D:\\piece_expiration.db" failed: disk I/O error: The file or directory is corrupted and unreadable.\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).openDatabase:324\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).openExistingDatabase:306\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).openDatabases:281\n\tstorj.io/storj/storagenode/storagenodedb.OpenExisting:248\n\tmain.cmdRun:193\n\tstorj.io/private/process.cleanup.func1.4:378\n\tstorj.io/private/process.cleanup.func1:396\n\tgithub.com/spf13/cobra.(*Command).execute:852\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:960\n\tgithub.com/spf13/cobra.(*Command).Execute:897\n\tstorj.io/private/process.ExecWithCustomConfigAndLogger:93\n\tstorj.io/private/process.ExecWithCustomConfig:75\n\tstorj.io/private/process.Exec:65\n\tmain.(*service).Execute.func1:61\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57\n\tmain.cmdRun:195\n\tstorj.io/private/process.cleanup.func1.4:378\n\tstorj.io/private/process.cleanup.func1:396\n\tgithub.com/spf13/cobra.(*Command).execute:852\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:960\n\tgithub.com/spf13/cobra.(*Command).Execute:897\n\tstorj.io/private/process.ExecWithCustomConfigAndLogger:93\n\tstorj.io/private/process.ExecWithCustomConfig:75\n\tstorj.io/private/process.Exec:65\n\tmain.(*service).Execute.func1:61\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57”}

The VM crashed today and was unreachable via RDP and via Virtual monitor. so I had to hard power off, after power on this error came up.

Im wondering as after deleting the Databases, storj recreates them. if I delete them and the are remade will the node function or will it get DQ, do the databases hold sth Important?

Thanks

BrightSilence · September 8, 2022, 8:47am

You can do that eventually, but try fixing the file system issue first. Stop the node and run chkdsk on the HDD.

Alexey · September 8, 2022, 8:51am

You can also follow this article to recreate only corrupted database (if that’s the case): https://support.storj.io/hc/en-us/articles/4403032417044-How-to-fix-database-file-is-not-a-database-error , but better to fix the filesystem first.

andrew2.hart · September 8, 2022, 4:56pm

I have never been DQ for deleting the db, wal and shm files and I have done it several times due powercuts and failing disks

Toyoo · September 8, 2022, 9:10pm

Several times? What kind of setup did that to you? I’ve got several nodes for 2 years and haven’t lost a single database file yet. Had some power cuts in the meantime.

Skyblockpro1 · September 9, 2022, 7:04am

So my question now that I’m waiting for the disk to get scanned, what do the databases hold. I never investigated that it is more just a curiosity.

BrightSilence · September 9, 2022, 7:53am

They mostly hold data for reporting on the dashboard. Things like bandwidth used, storage used and payout info. They used to hold piece metadata, but that has been moved to the pieces themselves. So it’s all non-critical right now. I believe the only remaining db with a function to the operation of the node is the piece expiration db, which holds expiration dates for pieces. Losing that means your node doesn’t immediately remove expired pieces. However, those pieces will eventually be cleaned up by garbage collection as well.

So you’ll lose stats, mostly historic ones. But information like held amount might never be correct again, because the node is missing that information for previous months. It’s not ideal to lose the info, but it won’t break your node if you start over with clean db’s. I recommend following the info on the page @Alexey linked as that instructs you on how to recover only the corrupted db’s, limiting loss of information.

andrew2.hart · September 9, 2022, 4:29pm

Ok.
The first time was moving from PC to pi4 (I think). I forgot the --delete step of the rsync and the WALs corrupted the dbs. Donald helped me out there.

Last time was when I ran out of inodes due to using a disk from chia, that I formatted as largefile4. It complained about the orders db so I deleted that, then all the dbs then all the files in orders. Then it worked again for a while, rinse and repeat until I managed to move to a brand new 18tb. B.S. helped out there.

In between those I can’t remember specifics but I had an NVM ssd that did weird short writes where you could read the whole file but not each block individually. That was weird. That node got moved to a wd elements and spent a few months almost DQ but survived.

Deleting the dbs does not cause DQ

Skyblockpro1 · September 14, 2022, 3:48pm

Hey guys need a bit more help

im getting this fatal error

2022-09-14T17:46:37.636+0200 FATAL Unrecoverable error {“error”: “piecestore monitor: disk space requirement not met”, “errorVerbose”: “piecestore monitor: disk space requirement not met\n\tstorj.io/storj/storagenode/monitor.(*Service).Run:125\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:40\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57”}

I dont get it the drive is 5tb in size

all help is warmly welcomed

Thanks

Skyblockpro1 · September 14, 2022, 3:51pm

Small update the node started after a second start i am getting these errors now, should I be worried?

2022-09-14T17:50:25.634+0200 ERROR collector unable to delete piece {“Satellite ID”: “12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs”, “Piece ID”: “CUKWIXQUNUEQYQSF7AJJ6GDML5FEN4X5CRDXJKLR5DQVUCLNX2SQ”, “error”: “pieces error: filestore error: file does not exist”, “errorVerbose”: “pieces error: filestore error: file does not exist\n\tstorj.io/storj/storage/filestore.(*blobStore).Stat:103\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).pieceSizes:245\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).Delete:226\n\tstorj.io/storj/storagenode/pieces.(*Store).Delete:299\n\tstorj.io/storj/storagenode/collector.(*Service).Collect:97\n\tstorj.io/storj/storagenode/collector.(*Service).Run.func1:57\n\tstorj.io/common/sync2.(*Cycle).Run:99\n\tstorj.io/storj/storagenode/collector.(*Service).Run:53\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:40\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57”}

Thanks

BrightSilence · September 14, 2022, 3:58pm

if the pieced_spaced_used.db is removed or recreated, the node isn’t aware of data already used by the node anymore. If there is less than 500gb free space on the disk, you see this error as a result.

This should resolve itself after the filewalker runs once. But if not you can change the minimum requirement in the config.yaml. The setting is storage2.monitor.minimum-disk-space

Seeing these too lately. I’m not sure why, but they appear on all my nodes now. As long as it’s the collector it’s not a problem. The piece probably got deleted the normal way before the collector got to it. Still annoying though.

Skyblockpro1 · September 14, 2022, 4:04pm

Thanks for the quick response much apprichiated

I will keep monitoring this node for a few hours to see any new changes

hatred · September 15, 2022, 10:30am

I have same question. My node almost 1 year old, and I find that in logs appear error:

ERROR	bandwidth	Could not rollup bandwidth usage	{"Process": "storagenode", "error": "bandwidthdb: database disk image is malformed"

How long it lasts I don’t know. I tryed to recover it with manual linked above but with no success. Can I just delete faulty file? What will happens after that? Will my node be suspended or disqualified?

Skyblockpro1 · September 15, 2022, 10:55am

OK so just to summarize, the good news first I suppose. The DB`s weren’t actually courted. YEY

So what caused it?

Well the virtual drive crashed because the VM crashed it seems. I tried to remove all the DB`s but one temp file was stuck open. mote it was like 500Mb so not the smallest one. After running the disk scan utility in windows as recommended by @BrightSilence found and corrected a few errors. This also got rid off the stuck temp file. After the temp file was removed the Node would respond to start signal but would still fail.

At this point I was kind of disappointed and was frustrated so I don’t fully remember what I did, but I kind of returned al the DB`s back and after few back and for, the node started with all data working and even the web GUI loaded with all valid data. Thanks again to all who helped me out. This was one of my oldest nodes 3rd one wouldn’t want to loose it.

Thanks

BrightSilence · September 15, 2022, 10:57am

That’s good news, glad you got it working again!

Erikvv · September 15, 2022, 11:05am

We have a fix for this in the works.

https://review.dev.storj.io/c/storj/storj/+/8394

github.com/storj/storj

[storagenode] The garbage collector is trying to delete a piece over and over again

opened 08:24AM - 16 Sep 21 UTC

closed 12:59PM - 15 Sep 22 UTC

AlexeyALeonov

Bug SNO

``` $ grep "unable to delete piece" /mnt/y/storagenode2/storagenode.log | jq -R… '. | split("\t") | (.[4] | fromjson) as $body | {SatelliteID: $body."Satellite ID", ($body."Piece ID"): {(.[0]): .[3]}}' | jq -s 'reduce .[] as $item ({}; . * $item)' { "SatelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "PROQQDUMVAOIRVJVSVBRFQZOYTERNDAYXIUW55JAOKBXYFUYW5LQ": { "2021-08-31T21:30:07.645Z": "unable to delete piece", "2021-08-31T22:30:07.886Z": "unable to delete piece", "2021-08-31T23:30:08.010Z": "unable to delete piece", "2021-09-01T00:30:19.988Z": "unable to delete piece", "2021-09-01T01:30:07.814Z": "unable to delete piece", "2021-09-01T02:30:08.050Z": "unable to delete piece", "2021-09-01T03:30:08.362Z": "unable to delete piece", "2021-09-01T04:30:07.837Z": "unable to delete piece", "2021-09-01T05:30:08.137Z": "unable to delete piece", "2021-09-01T05:44:59.782Z": "unable to delete piece", "2021-09-01T06:45:00.599Z": "unable to delete piece", "2021-09-01T07:44:59.638Z": "unable to delete piece" } } ``` If we check the history: ``` $ cat /mnt/y/storagenode2/storagenode.log | grep "PROQQDUMVAOIRVJVSVBRFQZOYTERNDAYXIUW55JAOKBXYFUYW5LQ"| jq -R '. | split("\t") | (.[4] | fromjson) as $body | {SatelliteID: $body."Satellite ID", ($body."Piece ID"): {(.[0]): .[3]}}' | jq -s 'reduce .[] as $item ({}; . * $item)' { "SatelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "PROQQDUMVAOIRVJVSVBRFQZOYTERNDAYXIUW55JAOKBXYFUYW5LQ": { "2021-08-31T21:30:07.645Z": "unable to delete piece", "2021-08-31T22:30:07.886Z": "unable to delete piece", "2021-08-31T23:30:08.010Z": "unable to delete piece", "2021-09-01T00:30:19.988Z": "unable to delete piece", "2021-09-01T01:30:07.814Z": "unable to delete piece", "2021-09-01T02:30:08.050Z": "unable to delete piece", "2021-09-01T03:30:08.362Z": "unable to delete piece", "2021-09-01T04:30:07.837Z": "unable to delete piece", "2021-09-01T05:30:08.137Z": "unable to delete piece", "2021-09-01T05:44:59.782Z": "unable to delete piece", "2021-09-01T06:45:00.599Z": "unable to delete piece", "2021-09-01T07:44:59.638Z": "unable to delete piece", "2021-09-01T08:44:59.763Z": "delete expired" } } ``` and origin: ``` $ zcat /mnt/y/storagenode2/storagenode.log.1.gz | grep "PROQQDUMVAOIRVJVSVBRFQZOYTERNDAYXIUW55JAOKBXYFUYW5LQ"| jq -R '. | split("\t") | (.[4] | fromjson) as $body | {SatelliteID: $body."Satellite ID", ($body."Piece ID"): {(.[0]): .[3]}}' | jq -s 'reduce .[] as $item ({}; . * $item)' { "SatelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "PROQQDUMVAOIRVJVSVBRFQZOYTERNDAYXIUW55JAOKBXYFUYW5LQ": { "2021-08-12T20:28:23.817Z": "upload started", "2021-08-12T20:28:40.583Z": "uploaded", "2021-08-27T21:28:50.118Z": "unable to delete piece", "2021-08-27T22:28:49.868Z": "unable to delete piece", "2021-08-27T23:28:49.768Z": "unable to delete piece", "2021-08-28T00:28:49.555Z": "unable to delete piece", "2021-08-28T01:28:49.512Z": "unable to delete piece", "2021-08-28T02:28:49.265Z": "unable to delete piece", "2021-08-28T03:28:49.156Z": "unable to delete piece", ... "2021-08-31T18:30:08.254Z": "unable to delete piece", "2021-08-31T19:30:08.105Z": "unable to delete piece", "2021-08-31T20:30:07.630Z": "unable to delete piece" } } ``` As we can see, the piece has been uploaded, but seems canceled in between and then the customer has deleted it (or the piece has been placed to other node) and the garbage collector started to try to delete this piece. The garbage collector should report that a piece has been removed if it doesn't exist, instead of endlessly trying to remove what has been removed / canceled earlier. Or the uplink/node should report that the piece is actually not stored on this node. Community forum discussion about it: https://forum.storj.io/t/error-collector-unable-to-delete-piece/12916/17