Storage node to Satellite : I've lost this block, sorry

Only if the incentive is right. If it’s treated exactly like an audit failure, I’d disable it because I’d pay in additional I/O of the scrub only for my audit score to potentially go lower. On the other side if it didn’t cost anything, people would abuse it to have instant graceful exit without having to transfer data out and without the node having to be old enough.

1 Like

Just an update to say that I rsynced the data to another disk and there were 300 incomplete files but the disk is OK. I think the cause was a power fault where the ssd failed to write.
A scrub could have fixed the problem with about 500M of download, those cents could have been taken from my held amount and saved me a day

There is no practical way to verify that the node supposedly has all other data intact after an audit failure.

The satellite uses a probability algorithm to check random pieces from your node. To check every single piece it need a lot of time and computation, it will take weeks to check a whole single node, it is very expensive to use the whole satellite to audit only one node from thousands.

The node doesn’t have a knowledge, what data it supposedly should have, it’s untrusted by default.
It can check checksums and store information on some trusted storage, as proposed in the linked thread, but:

Thus it’s much cheaper for the network just disqualify your node and mark all hold pieces as lost.
The result will be the same, as in your proposition:

  • the held amount will be used to recover lost data;
  • the reputation of the node will be reset back to zero, so start with vetting and 75% held back (because the held amount has been used);
  • since all pieces are marked as probably lost, they will be recovered when the trigger is fired, thus they will be deleted from your node by the Garbage collector.

So it’s the same as disqualification, but with slow cleaning of the used space.
It’s much simpler just start over.

I guess one difference would be that if my node was disqualified on only one satellite (which might happen), I would not have to run two nodes - a new node for that satellite and the old node for all the other satellites.

Yes, I have to agree. However, all that time the data of the satellite will be stored on the disqualified node and slowly delete during the time.

5 posts were split to a new topic: The node wipes data after disqualification, the satellite wipes reputation and node can start over

It is not about disqualification for me. It is preemptive.
I am not even sure that the 300 files were ever completed as storagenode seems not to care much about leaving junk files around.

I’ll post an update if I see any failed audits

i would argue junk files are storj’s problem and they should have to pay for their storage… SNO’s run a storage space business… :smiley:

we cannot really care much about what kind of files are stored and what isn’t… people just have to pay for their allotted space…

so really imo there shouldn’t be junk files… ofc those kind of files would be included in the 10% extra space we are suppose to overallocate to your node… so really so long as they stay within their 10% it’s not like we where not informed and i’m sure they will get around to cleaning up after themselves eventually… they cannot expect us to clean up in their software… thats just downright dangerous for their own data / software integrity

i do believe that files often take like a week to get into the deletion phase… which makes sense this is a network after all and we store data… if a piece is missing then it makes sense the sats can ask / search for it and maybe find it in junk data… in stead of having to repair…

i duno… i mean i use my trash folder on my computer… because sometimes i hit the wrong button or something goes wrong and it’s nice to easily retrieve it without having to use undelete software.
so i do see the point in why i think they are keeping junk files for extended periods…
i think SNO’s should be more interested in the customer experience of tardigrade… that is after all how we will make our money, if people think the system works great.,…

we won’t be making money on checking out system and deleting 100 mb files here and there to try and optimize space…
in most cases what you can delete manually adds up to less than 0.001% of what you earn on renting out the storage…

i don’t see what use deleting junk files really is… and i assume they will be deleted eventually anyways all by themselves… or by storj xD

That would be awesome, I’m gonna fill my blobs folder with junk right now!

Junk data is data that isn’t supposed to be there. Under normal operation, the node has been told about this and should have deleted it right away. Why would the node keep being paid if it didn’t clean up data when it should have? The only reason we’re even talking about this is because this system was used one time as a clean up for a lot of zombie segments on stefan-benten. Data that wasn’t owned by a customer, but nodes were still paid for for a loooong time regardless. I would not complain about that. This is an exception that shouldn’t really occur again. My trash folder is already nearly empty again.

Anyway, I shouldn’t have even responded to this since this topic isn’t about that.

I get that. But if you tell the satellite you lost 100 files. Why should the satellite trust your node not to do that again? You can claim the issue has been resolved and no other files were effected, but your node is inherently treated as an untrusted entity. And the satellite has no (affordable) way to verify that claim. That’s really the end of the story. So the only option left to the satellite is to assume you’re lying and repair all pieces to make sure your assumed lie doesn’t hurt segment health.

The only exception to this I can imagine is if a node failed a few audits and the node says: “My bad, I had temporarily lost access and I can now prove I still have the data”. After which the satellite could resent all audits the node previously failed and offer the node a chance to wipe their slate clean. This would still be a very generous system and I don’t even know if it’s worth it for the network to implement this, but it would help out many SNOs. It’s also a different situation from the one describes in this topic.

wasn’t exactly meant like that… more like if storj put the files there and it’s their fault it hasn’t gotten deleted correctly, then they should pay for the space… but yeah if it’s a problem on the node / SNO’s fault then ofc they shouldn’t pay for it…
i like how you think tho… :smiley:

That would require satellites to keep metadata for all deleted pieces ever and keep track of which pieces were the nodes fault and which pieces were the satellites fault. That’s a ton of data for the metadata and I have no solution on how to find out who’s fault the junk data is to begin with. Either way, it’s generally a non-issue as the trash usually doesn’t even get to 0.1% of your nodes size.

i didn’t say it would be practical nor worthwhile… heheh
but atleast in a perfect world it would work like that… :smiley:

But if you tell the satellite you lost 100 files. Why should the satellite trust your node not to do that again?
Because the information was volunteered

That’s not a reason to believe it.

What your node would be claiming has 2 parts.
“I lost these 100 files, but I swear the rest are all fine”
Is semantically the same as
“I broke into his house, but I swear I didn’t take anything”

The information being volunteered may make it more likely for that first part to be true, but not the second part. In fact the first part is a pretty good indicator that the second part is probably a lie. Nobody is arguing about that first part, I’m sure you lost those 100 files. But that fact happening makes me as an outsider really doubt that the rest is fine. In fact losing the 100 files could be a sign of an HDD starting to fail or other issues that could easily have effected more files or effect more files in the future. There is no reason for a satellite to accept that second part of the statement.

2 Likes

i think the issue for the satellites is not that they need data to help them rebuild… they can rebuild from the pieces they got and because of this don’t need anything from a bad node… at best they save a bit on repair… at worst the node holds unreliable data in other erasure coding strings… thus making the satellites assume something is good when it isn’t…

remember this isn’t raid or any other normal way of storing data… this is basically an active computation creating unique pieces which is then stored and then retrieved to recompute more pieces when the number of pieces gets to low…

this means the system has to know exactly how many good pieces it has… else the entire system breaks down…

so surprise surprise… data integrity for SNO’s is a critical thing to their node reliability… because the satellites will throw a fit if you start to loose stuff.

how would you feel if you bank was like… well we have 99.99% correct bits,…

so only 1 transfer in 10000 has a wrong number in it… :smiley:
money would simply migrate around between the accounts and payments controlled… at random seemingly… sometimes somebody would pay a million for something others times 10 cents…

:smiley: do you feel lucky… well do ya punk xD

Update: The node is on a new disk with 7TB more space and is not logging failed audits in the docker log, however, it is showing varying 80%-99% good in both earnings.py and in the dashboard for two satellites.
My plan is to swamp the 1TB of bad with 7TB of new good.

In answer to your comment in another thread B.S., a node could not use this “I’ve lost this block, sorry” message to cheat and discard low download blocks because
a) the satellite would say “sorry, you already lost 100 blocks, you are DQ”
b) the satellite would use held amount for repair “sorry you have no held amount, you are DQ”
c) the satellite could even count this as a failed audit “why thanks for letting me know, you are DQ”. Why would a “selfish” SNO turn this feature on then, you ask? To get it out of the way and find out if this node has a future or not.

That was indeed a correct prediction, I would ask that. I see your point and it might be nice to be able to do some self diagnostics. But it would likely be more profitable to let 100 pieces just be missing. On a sufficiently large node that will not lead to disqualification, so forcing disqualification on such nodes because they did a self diagnosis would be a bit harsh. At that point it would just be a self diagnosis tool that informs SNOs they can shoot themselves in the foot by reporting issues to the satellite. I would also like to know if something is damaged or corrupt. But such a tool wouldn’t exactly be high on the list of priorities to develop first.

I think the misunderstanding is mostly based on the assumption that you will eventually be disqualified. If we’re talking about 100 missing or corrupt pieces, I’m almost certain that will not happen at all. I’m also certain that if your scores are dropping as low as 80, you have much more significant issues than that.

Maybe something about the bad data causes the logging to fail too?

It’s possible. Is the log written to the same HDD as the data? Do you see timestamp gaps in the log?

Which score is dropping? Suspension or Audit? If the storage location is overloaded I would expect time outs which should be counted towards the suspension score, not the audit score.

I didn’t find anything
root@raspberrypi1:~/storj_success_rate# ./successrate.sh storagenode1
========== AUDIT ==============
Critically failed: 0
Critical Fail Rate: 0.000%
Recoverable failed: 0
Recoverable Fail Rate: 0.000%
Successful: 715
Success Rate: 100.000%
========== DOWNLOAD ===========
Failed: 460
Fail Rate: 3.958%
Canceled: 908
Cancel Rate: 7.812%
Successful: 10255
Success Rate: 88.230%
========== UPLOAD =============
Rejected: 0
Acceptance Rate: 100.000%
---------- accepted -----------
Failed: 2
Fail Rate: 0.006%
Canceled: 18
Cancel Rate: 0.058%
Successful: 31039
Success Rate: 99.936%
========== REPAIR DOWNLOAD ====
Failed: 0
Fail Rate: 0.000%
Canceled: 0
Cancel Rate: 0.000%
Successful: 6641
Success Rate: 100.000%
========== REPAIR UPLOAD ======
Failed: 0
Fail Rate: 0.000%
Canceled: 0
Cancel Rate: 0.000%
Successful: 11331
Success Rate: 100.000%
========== DELETE =============
Failed: 0
Fail Rate: 0.000%
Successful: 4880
Success Rate: 100.000%
root@raspberrypi1:~/storj_success_rate# cd /data/data1
root@raspberrypi1:/data/data1# python ~/storj_earnings/earnings.py
August 2020 (Version: 9.3.0) [snapshot: 2020-08-05 07:01:39Z]
TYPE PRICE DISK BANDWIDTH PAYOUT
Upload Ingress -not paid- 9.26 GB
Upload Repair Ingress -not paid- 14.51 GB
Download Egress 20 USD / TB 20.67 GB 0.41 USD
Download Repair Egress 10 USD / TB 14.78 GB 0.15 USD
Download Audit Egress 10 USD / TB 539.90 KB 0.00 USD
Disk Current Storage -not paid- 1.02 TB
Disk Average Month Storage 1.50 USD / TBm 137.77 GBm 0.21 USD
Disk Usage Storage -not paid- 100.57 TBh
_______________________________________________________________________________________________________+
Total 137.77 GBm 59.22 GB 0.77 USD
Estimated total by end of month 994.86 GBm 427.65 GB 5.55 USD

Payout and held amount by satellite:
SATELLITE MONTH JOINED HELD TOTAL EARNED HELD% HELD PAID
us-central-1 8 2020-01-12 0.75 USD 0.0781 USD 25% 0.0195 USD 0.0586 USD
Status: OK >> Audit[0.0% DQ|0.0% Susp]
europe-west-1 8 2020-01-12 2.10 USD 0.2089 USD 25% 0.0522 USD 0.1567 USD
Status: OK >> Audit[0.0% DQ|0.0% Susp]
europe-north-1 5 2020-04-18 0.52 USD 0.0892 USD 50% 0.0446 USD 0.0446 USD
Status: OK >> Audit[0.0% DQ|0.0% Susp]
asia-east-1 8 2020-01-12 0.63 USD 0.0622 USD 25% 0.0155 USD 0.0466 USD
Status: OK >> Audit[0.0% DQ|0.0% Susp]
saltlake 7 2020-02-11 9.70 USD 0.3223 USD 25% 0.0806 USD 0.2417 USD
Status: WARNING >> Audit[13.1% DQ|0.0% Susp]
stefan-benten 8 2020-01-12 2.55 USD 0.0072 USD 25% 0.0018 USD 0.0054 USD
Status: OK >> Audit[0.0% DQ|0.0% Susp]
_____________________________________________________________________________________________________+
TOTAL 16.26 USD 0.7679 USD 0.2143 USD 0.5536 USD

root@raspberrypi1:/data/data1# docker logs storagenode1 2>&1 | grep -i audit | grep “downloaded” | wc
716 8592 153605
root@raspberrypi1:/data/data1# docker logs storagenode1 2>&1 | grep -i audit | grep “download started” | wc
716 9308 157901
root@raspberrypi1:/data/data1# docker logs storagenode1 2>&1 | grep -i audit | grep -i fail
root@raspberrypi1:/data/data1# docker logs storagenode1 2>&1 | grep -i audit | grep -i error
root@raspberrypi1:/data/data1# docker logs storagenode1 2>&1 | grep -i audit | grep -i unkn
root@raspberrypi1:/data/data1# docker logs storagenode1 2>&1 | grep -i audit | grep -i debug
root@raspberrypi1:/data/data1#