Storagenode had corrupted data in over 400 blocks on the hard-drive storage

I’ve spent over a week battling with this.
To top the previous issue, somehow, the storagenode had corrupted data in over 400 blocks on the hard-drive storage resides on. I’ve went over them manually, trying to recover data, but I am at a point of giving up. I have the node up and running, but there are so many

2019-12-31T07:04:38.713Z        INFO    piecestore      download started        {"Piece ID": "TDTVFDGCRPISDAYVQACCNIDATFMXWHRD4HZPHL5OMCMBJRM23UWA", "Satellite ID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW", "Action": "GET"}
2019-12-31T07:04:38.715Z        INFO    piecestore      download failed {"Piece ID": "TDTVFDGCRPISDAYVQACCNIDATFMXWHRD4HZPHL5OMCMBJRM23UWA", "Satellite ID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW", "Action": "GET", "error": "file does not exist"}

errors that I’m not sure about the data anymore.
Also, this issue has resulted in approximately 25% uptime on this node and I’m not sure what the consequences would be.

Anyhow, what course of action would you suggest with regards to the node and its data?

This should probably be a different topic. (Edit: Thanks @Alexey!)
I don’t think uptime is what you should be worried about. First would be to find out why the data got corrupted in the first place. It may just be a dying hdd. Even if you’ve managed to rescue the remaining data, your node will likely keep failing audits until it’s disqualified.
What to do next really depends on the damage. You can try to keep it alive long enough to eventually attempt a graceful exit or start over now. Keep a close eye on your audit scores (not the percentages on the dashboard). You can use the Storage node dashboard API or the Earnings calculator (Update 2019-12-20: v8.1.0 - Now with Uptime and Audit scores, Vetting progress and DQ indication!) . The latter was updated to display this info as well.

1 Like

I’ve tested the hard-drive extensively before attempting to rectify the information (which took a couple of days), so it was not only by reading the smart data. In conclusion, all seems well. As for the reason for the failure - I am at an absolute loss. There have been no interruptions except for one graceful shutdown for an update to a new version without a node reboot.

As for the node itself - I’ve automated a collection of audit reports for the past few hours and I’ll keep collecting data for at least one more day before I check how the node pro-/regresses.

Thanks for your assistance, @BrightSilence and @Alexey!

I don’t think that would even work. Graceful exit requires transferring all pieces to another node. If you’re missing pieces, you can’t gracefully exit.

1 Like

I guess it’ll depend on how big the impact is. I’m sure one missing piece won’t make graceful exit impossible. But a significant amount… yeah that probably can’t be fixed with a graceful exit.

From what I recall reading, this is a satellite-managed process and it will have your node send all pieces to other nodes. If any piece is missing, the exit is not graceful. (One of the engineers can correct me on this point, but I’m fairly certain this is what I’ve read on this forum before.)

Graceful exit is not in my agenda anyway. Right now I am still running the node and as long as it serves pieces, I will keep it. I hope that eventually any blobs that have been “abandoned” by the journal, or however they are kept track of, will get cleaned out and the node will gradually rebuild reputation and fill up with new data… as long as it doesn’t get disqualified.

From what I see, though, more than 50% of the download audits are failing. Repair downloads are at a 0% success rate, while uploads are normal (99.4% success with 100% acceptance).

Download failing just a signal of too slow node against competitors, it’s not related to the missed data unless you have the reason for failing as “file not found”.

That is exactly the problem.
Further, normal uploads and upload audits indicate a good node.

The download failed line had this

"Action": "GET", "error": "file does not exist"

Ooops. Then it doesn’t looks good. How your score @fragamemnon ?

I’m kind of surprised that’s an INFO line and not an ERROR line though.

There are errors too.

2020-01-01T14:53:22.990Z        INFO    piecestore      download failed {"Piece ID": "HVREJGXXJ2KOM2HWVIUAGBQH57TS4PDTYCPR6DLZ2KZDVGCVETNQ", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET", "error": "file does not exist"}
2020-01-01T14:53:22.990Z        ERROR   server  gRPC stream error response      {"error": "file does not exist"}

@Alexey my scores are 0.59, 0.59, 0.67 and 0.6 for the 4 satellites.

looks like you may be disqualified alredy

Looks like one or maybe 2 are still going, but most likely not for much longer. You get disqualified when below 0.6.

Well, that’s fantastic, there go all the months since v3 launch.

What course of action would you guys suggest I take?

i think there is only posible to make new node, but you will loose escro anyway.

i my opinion Storj should read all posible data from node as soon as it geting disqualification, this will lower posible data loss and money loss to recover this data.
As i see every day i get repar blocks, this mean that every day someone disqualified and going data repear. Why need to repair data if you can just sendd all posible data from disqualified, and repear only 400 lost bloks from some TB of all data. This function needed last year.

1 Like

I fully agree, but there seems to be an underlying issue, possibly hiding in one more of these 400 blocks. It doesn’t make sense to fail over 50% of all download requests and audits if only a couple of megabytes are corrupted.

As for the escrow - I couldn’t care less. The veto process and lost reputation, however (and also my identity generated with a fancy difficulty), really sadden me.

I know it sucks, but if the files aren’t there any more or got corrupted it’s kind of the only course of action to take while still ensuring data protection on the network.

Honestly I think new nodes gain reputation pretty fast, so I wouldn’t worry about that too much. Vetting will take a month or so though, not much you can do about that.

1 Like