Received Node Disqualified Email But Node Still 100% Audit 100% Suspension

nyancodex · May 8, 2024, 8:48pm

I checked the docker logs and found many errors like this

I wonder how a GET_AUDIT failed is just a normal error and not FATAL? You know where I can find a list of error types that will lead to Audit failure and node DQ? Thank you!

BrightSilence · May 8, 2024, 8:58pm

FATAL is for errors that don’t allow the node process to keep running. A single audit failure isn’t reason to kill the node, so hence ERROR.

I don’t think there is a list of specific errors that can lead to DQ and you may not even see an error in the log if your node responds to an audit with corrupt data without knowing it. But any issue where your node is not able to reply with the correct data to an audit request would eventually lead to DQ.

arrogantrabbit · May 8, 2024, 9:03pm

My approach to migrating nodes (or anything else for that matter that is fragile or complex or labor intensive) is to never do anything manually; instead, to write a script that would do it for me and test it on some inconsequential stuff, in this case, small temporary node.

For example, this is how I migrated my node to a new dataset.

I had to rebalance data on the array when adding another vdev; the process is equivalent to migrating node to another storage. This is how my script does it. I have used it already 4 times on different machines, and once across 2000 miles, sending a small node to offsite location, with a small modification, piping through the ssh link.

#!/usr/local/bin/zsh

# Abort on error
set -e 

dataset=pool1/storagenode-one
jail=storagenode-one

tmp_target=pool1/target

echo "Copying first snapshot"
zfs snapshot -r ${dataset}@cloning1
zfs send -Rv ${dataset}@cloning1 | zfs receive ${tmp_target}

echo "Copying second snapshot"
zfs snapshot -r ${dataset}@cloning2
zfs send -Rvi ${dataset}@cloning1 ${dataset}@cloning2 | zfs receive ${tmp_target}

echo "Copying third snapshot"
zfs snapshot -r ${dataset}@cloning3
zfs send -Rvi ${dataset}@cloning2 ${dataset}@cloning3 | zfs receive ${tmp_target}

echo "Stopping node"
iocage stop $jail

echo "Copying fourth snapshot"
zfs snapshot -r ${dataset}@cloning4
zfs send -Rvi ${dataset}@cloning3 ${dataset}@cloning4 | zfs receive ${tmp_target}

echo "Renaming datasets"
zfs rename ${dataset} ${dataset}-old
zfs rename ${tmp_target} ${dataset}

echo "Starting node"
iocage start $jail

echo "Press enter to destroy old dataset"
read
zfs destroy -r "${dataset}-old"

nyancodex · May 17, 2024, 8:23am

I tried this but 24hrs passed and the data of untrusted sats is still on the disk. Somehow I can manually delete it?

BrightSilence · May 17, 2024, 8:26am

Did you try the --force flag and listing the specific satellites you want to clean up?

nyancodex · May 17, 2024, 8:44am

I specified the sats in config.yaml, no --force yet. I’ll try again now. Thank you!

snorkel · May 17, 2024, 2:41pm

I stopped the node, manualy deleted all folders related to exited sats, start the node and run the --force command specifying the sats in the run command.

https://forum.storj.io/t/how-to-forget-untrusted-satellites/23821/107?u=snorkel

nyancodex · May 17, 2024, 7:36pm

untrusted sats data is gone now. Thank you!