Identifying suspension cause

Propagandalf · December 12, 2020, 12:42am

My node was recently suspended from one satellite, and I am now investigating my logs to try and find out what the root cause was, in hopes of fixing it before getting disqualified (my node is back online and operational on the other satellites though).

Backstory:
System: mid-grade NUC with 2x 4TB Seagate SMR connected via USB 3, win 10, set up in simple RAID.

Around three days ago I was performing a lot of extensive tasks on my computer, more or less maxing out various system resources, including CPU and RAM. Today, I was doing the same when suddenly the machine crashed. When restarted, one of my two HDDs did not register in the array, and after mucking about I was able to get them working again. But, I couldn’t get my node back online. There was an error message in the logs for Storj stating that " database disk image is malformed", so I followed the steps to repair this, and node went online after that.

However, the dashboard now stated that I was suspended on one satellite. Scrolling down to the satellite I could see that the error was related to the category “suspension”.

I tried searching in the logs using this command in admin Powershell

sls GET_AUDIT "D:\Log files/storagenode.log" | sls failed

but, it returned nothing. However, when using it like this

sls failed "D:\Log files/storagenode.log"

it produces multiple entries. Does this mean that | acts as “and” operator instead of “or” in this context? Regardless, the recurring error since around three days ago was:

D:\Log files\storagenode.log:432:2020-12-09T13:35:12.170+0100 FATAL Unrecoverable error {"error": "Error during preflight check for storagenode databases: preflight: database \"pieceinfo\": failed create test_table: disk I/O error\n\tstorj.io/storj/storagenode/storageno dedb.(*DB).preflight:418\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).Preflight:352\n\tmain.cmdRun:208\n\tstorj.io/private/process.cleanup.func1.4:362\n\tstorj.io/private/process.cleanup.func1:380\n\tgithub.com/spf13/cobra.(*Command).execute:842\n\tgithub.com/spf13 /cobra.(*Command).ExecuteC:950\n\tgithub.com/spf13/cobra.(*Command).Execute:887\n\tstorj.io/private/process.ExecWithCustomConfig:88\n\tstorj.io/private/process.Exec:65\n\tmain.(*service).Execute.func1:66\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57", "errorVerbose": "Error during preflight check for storagenode databases: preflight: database \"pieceinfo\": failed create test_table: disk I/O error\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).preflight:418\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).Preflig
ht:352\n\tmain.cmdRun:208\n\tstorj.io/private/process.cleanup.func1.4:362\n\tstorj.io/private/process.cleanup.func1:380\n\tgithub.com/spf13/cobra.(*Command).execute:842\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:950\n\tgithub.com/spf13/cobra.(*Command).Execute:887\n\ tstorj.io/private/process.ExecWithCustomConfig:88\n\tstorj.io/private/process.Exec:65\n\tmain.(*service).Execute.func1:66\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57\n\tmain.cmdRun:210\n\tstorj.io/private/process.cleanup.func1.4:362\n\tstorj.io/private/process.cl eanup.func1:380\n\tgithub.com/spf13/cobra.(*Command).execute:842\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:950\n\tgithub.com/spf13/cobra.(*Command).Execute:887\n\tstorj.io/private/process.ExecWithCustomConfig:88\n\tstorj.io/private/process.Exec:65\n\tmain.(*service) .Execute.func1:66\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}

I wonder if this was some kind of corruption caused by all CPU threads being exhausted, and thus negatively impacting the write cache buffer or something like that.

Is there any other way I can keep searching for clues? Because that error is not reoccuring, but I am still suspended, and the percentage suspension shown on dashboard for that satellite is not moving at all, so I don’t know if things are working as they should.

When running this command

Get-ChildItem F:\STORJ\*.db -File | %{$_.Name + " " + $(sqlite3.exe $_.FullName "PRAGMA integrity_check;")}

it seems all dbs are fine.

BrightSilence · December 12, 2020, 12:51am

Make sure the data location is readable and writeable and please check your file system for errors.

Propagandalf · December 12, 2020, 12:59am

I added this to the initial paragraph for clarity: “(my node is back online and operational on the other satellites though)”.

I wrote in the post already that I repaired the file system errors, and as far as I know that was what brought the node back online, but I still wonder what caused it, and how I can tell if I am “clean” and recovering from suspension or not.

BrightSilence · December 12, 2020, 1:06am

Probably that. It likely didn’t help that an HDD was kicked from the array either. It’s not uncommon for that to result in file system issues. However, while you mentioned fixing the db file, you didn’t mention running chkdsk on it to actually fix the underlying file system issues. I recommend you still do that as more issues may exist. Keep a close eye on it in the mean time.

Alexey · December 12, 2020, 2:23pm

Suggests that it’s a disk corruption, not the CPU usage. And this is a main reason why your node could be suspended.
I would like suggest to stop your storagenode and check disk for errors include full scan. Especially for the “Simple” type (i.e. it’s a RAID0 - most of unreliable solution, with one disk corruption the whole volume is lost).
And for the best case - move part of space out of simple spaces volume, create a separate volume on a separate disk (not the another Simple volume) and migrate your node there, then disassemble the simple volume to two separate disks and use them independently for two nodes, it will be much robust than your current highly unreliable solution.

Propagandalf · December 12, 2020, 10:28pm

When using chkdsk, is it necessary to run

chkdsk F: /f /r /x

if you have first run chkdsk F: (without parameters) and been told there were no errors and no furter action was required?

F: is my simple RAID volume consisting of two drives. Will running chkdsk on F: check both drives, or should I first disassemble the array, and run chkdsk on each individual drive?
For migrating node to another disk on same computer, then disassembling RAID, and migrating node “back” to one of the now two individual drives: Is it enough to move all data over to the temporary drive on the same computer, and then move it back once I am done with disassembly?

Alexey · December 12, 2020, 11:02pm

Perhaps no, if you do not see such an errors in the storagenode’s log anymore.

You cannot disassembly them without data loss, so no, do not disassemble them until your data is backed up to somewhere. The check should be done on the F:

yes

Propagandalf · December 13, 2020, 11:18pm

I have now disassembled the array, and moved my backed up storage node data back onto the single 4TB drive, and started the node successfully. I made sure to change path and capacity allocation in config. Chkdsk shows no errors, and I can no longer see any errors appear in the logs.

Saltlake satellite shows 31.64 % suspension score. Is this now something I can expect will increase until I am out of the “danger zone” of becoming disqualified? I read somewhere that you can be suspended for 7 days before getting disqualified.

Propagandalf · December 14, 2020, 1:13pm

Just a little while ago I received an e-mail stating that my node is no longer suspended, and that I will begin receiving new data again. However, when I check the dashboard for the node, it states that it is still suspended, and the suspension percentage of 31.64 % is still intact. Shall I simply wait for a bit until it clear, as maybe there’s some lag between realtime status and what the dashboard shows?

baker · December 14, 2020, 3:31pm

I believe there is a delay between the satellite and the storagenode when it comes to these scores. I expect you will see the score recover on the dashboard soon.

Propagandalf · December 19, 2020, 7:27pm

Some days have passed now, and I still have the same percentage shown for suspension (31.64 %), and there is still a notice on the dashboard that the node is suspended. What is interesting, though, is that I got an e-mail a few days ago congratulating me that my node was no longer suspended, and then another e-mail after that saying it was suspended again.

Additionally, it suddenly says “0 % online” on the Eurupe North satellite. It was not gradual, it just showed 0 one day.

As far as I can tell, my logs look normal. The actions I am getting are “get, put and put_repair”. Sometimes there’s a “piecedeleter” deleting something, but I think all of this is normal?

Unless I figure out what is going on here, I think the best solution might be to start a new node. I am fortunate that there is not much data here, so it’s not a big loss. What would be the best way to kill the node and start a new one? It has 24.5 GB.

Alexey · December 19, 2020, 8:03pm

The best way would be to figure out, why it’s falling offline from time to time.
Because with a new node you likely will have the same problem.

If you would going to create a new node - please, split your simply array to two different drives and use them independently.
Caution - it will destroy all data!

Propagandalf · December 19, 2020, 9:28pm

I did already disassamble the array like you advised me to, so the original node is now running on a single disk independently. It also doesn’t seem to be intermittently going offline, I just instantly got 0 % online status on one satellite, which was 100 % right before that.

Is there anything in particular I can look for in the logs now that might be still causing any issues? Otherwise, I am inclined to believe that whatever it was that happened originally, with the corruption of the DBs and failling array setup, has permanently messed up the integrity of my node, maybe.

If creating new node
Should I be doing something like graceful exit, with so little data in the node? Or just format and create new node?

Alexey · December 19, 2020, 10:23pm

The best thing is to monitor your node externally. The uptimerobot.com is a very popular candidate.

You cannot invoke a Graceful Exit if node is not older than 15 months (temporary reduced to 6 months at the moment).
So, you can just start a second node on your second drive. Since you use Windows, you can use a Docker version (the supported way) or use the Windows Toolbox (you can search for it here on the forum).

Propagandalf · December 21, 2020, 8:01pm

I just wanted to follow up and say that all my negative stats are all of the sudden improving day by day. If this keep up, it looks like the suspension will be gone for good, so I am leaving the node alone until further notice. But, I’m still setting up my second node just to learn about how to have multiple nodes.