My node was recently suspended from one satellite, and I am now investigating my logs to try and find out what the root cause was, in hopes of fixing it before getting disqualified (my node is back online and operational on the other satellites though).
Backstory:
System: mid-grade NUC with 2x 4TB Seagate SMR connected via USB 3, win 10, set up in simple RAID.
Around three days ago I was performing a lot of extensive tasks on my computer, more or less maxing out various system resources, including CPU and RAM. Today, I was doing the same when suddenly the machine crashed. When restarted, one of my two HDDs did not register in the array, and after mucking about I was able to get them working again. But, I couldn’t get my node back online. There was an error message in the logs for Storj stating that " database disk image is malformed", so I followed the steps to repair this, and node went online after that.
However, the dashboard now stated that I was suspended on one satellite. Scrolling down to the satellite I could see that the error was related to the category “suspension”.
I tried searching in the logs using this command in admin Powershell
sls GET_AUDIT "D:\Log files/storagenode.log" | sls failed
but, it returned nothing. However, when using it like this
sls failed "D:\Log files/storagenode.log"
it produces multiple entries. Does this mean that | acts as “and” operator instead of “or” in this context? Regardless, the recurring error since around three days ago was:
D:\Log files\storagenode.log:432:2020-12-09T13:35:12.170+0100 FATAL Unrecoverable error {"error": "Error during preflight check for storagenode databases: preflight: database \"pieceinfo\": failed create test_table: disk I/O error\n\tstorj.io/storj/storagenode/storageno dedb.(*DB).preflight:418\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).Preflight:352\n\tmain.cmdRun:208\n\tstorj.io/private/process.cleanup.func1.4:362\n\tstorj.io/private/process.cleanup.func1:380\n\tgithub.com/spf13/cobra.(*Command).execute:842\n\tgithub.com/spf13 /cobra.(*Command).ExecuteC:950\n\tgithub.com/spf13/cobra.(*Command).Execute:887\n\tstorj.io/private/process.ExecWithCustomConfig:88\n\tstorj.io/private/process.Exec:65\n\tmain.(*service).Execute.func1:66\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57", "errorVerbose": "Error during preflight check for storagenode databases: preflight: database \"pieceinfo\": failed create test_table: disk I/O error\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).preflight:418\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).Preflig
ht:352\n\tmain.cmdRun:208\n\tstorj.io/private/process.cleanup.func1.4:362\n\tstorj.io/private/process.cleanup.func1:380\n\tgithub.com/spf13/cobra.(*Command).execute:842\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:950\n\tgithub.com/spf13/cobra.(*Command).Execute:887\n\ tstorj.io/private/process.ExecWithCustomConfig:88\n\tstorj.io/private/process.Exec:65\n\tmain.(*service).Execute.func1:66\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57\n\tmain.cmdRun:210\n\tstorj.io/private/process.cleanup.func1.4:362\n\tstorj.io/private/process.cl eanup.func1:380\n\tgithub.com/spf13/cobra.(*Command).execute:842\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:950\n\tgithub.com/spf13/cobra.(*Command).Execute:887\n\tstorj.io/private/process.ExecWithCustomConfig:88\n\tstorj.io/private/process.Exec:65\n\tmain.(*service) .Execute.func1:66\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}
I wonder if this was some kind of corruption caused by all CPU threads being exhausted, and thus negatively impacting the write cache buffer or something like that.
Is there any other way I can keep searching for clues? Because that error is not reoccuring, but I am still suspended, and the percentage suspension shown on dashboard for that satellite is not moving at all, so I don’t know if things are working as they should.
When running this command
Get-ChildItem F:\STORJ\*.db -File | %{$_.Name + " " + $(sqlite3.exe $_.FullName "PRAGMA integrity_check;")}
it seems all dbs are fine.