That’s an interesting suggestion. Though it would come with some challenges.
I looked at some stats for last month. My node stored on average 7.5TB and all audits added up to 6.9MB. So less than 1/1,000,000th of data is audited every month. If you would report all missing files as a failed audit, you’d probably instantly disqualify a node that would have easily survived just a few missing files.
So there’s have to be some system that weighs these self reported failures against normal audits.
Currently traffic is quite homogenic. With more mixed customer use in the future, this will change eventually though. You can’t build a system for now, it needs to work in all scenarios.
My NAS has an option to lower the impact of system performance by slowing down the process. With a reasonable frequency of scrubbing this is a decent option to prevent annoying slowdowns. You don’t want to rely on manually triggering this stuff anyway. Schedule it frequently and just let it do its thing.
well i’m not quite done with my system yet… so haven’t automated everything i need to… just like i haven’t gotten my watchdog to work… i should really also get that checked…
my system is pretty adaptive, going to be setting up a BXE server on the pool tonight… then my main computer will run it’s OS from the pool lol
thats going to be fun, or terrible… but from what i can see it should work fine… my pool might have many configuration errors, but it manages fairly okay.
yes i should automate scrubs … but for now i like to do it when the traffic is low… maybe ill setup some sort of trigger for that… and include some time factors on to that… the scrub in zfs i’m not sure can be adjusted… but it does take less priority than everything else… but still my pool does get like 80ms latency atleast from what i can measure, but not sure thats real for actual requested important data… that might just jump ahead which the iostat metric might not account for…
seems to run while when scrubbing also… but when i get my OS on the pool it should give me a much better sense of how it really reacts…
the graphs can only give one so much… the other day i had this weird latency thing what didn’t show up anywhere… all the monitoring data i got back was fine… but in nano the … marker blinky thing a majig
would lag … never did figure out why tho and it seemed to vanish after a while…
Only if the incentive is right. If it’s treated exactly like an audit failure, I’d disable it because I’d pay in additional I/O of the scrub only for my audit score to potentially go lower. On the other side if it didn’t cost anything, people would abuse it to have instant graceful exit without having to transfer data out and without the node having to be old enough.
Just an update to say that I rsynced the data to another disk and there were 300 incomplete files but the disk is OK. I think the cause was a power fault where the ssd failed to write.
A scrub could have fixed the problem with about 500M of download, those cents could have been taken from my held amount and saved me a day
There is no practical way to verify that the node supposedly has all other data intact after an audit failure.
The satellite uses a probability algorithm to check random pieces from your node. To check every single piece it need a lot of time and computation, it will take weeks to check a whole single node, it is very expensive to use the whole satellite to audit only one node from thousands.
The node doesn’t have a knowledge, what data it supposedly should have, it’s untrusted by default.
It can check checksums and store information on some trusted storage, as proposed in the linked thread, but:
Thus it’s much cheaper for the network just disqualify your node and mark all hold pieces as lost.
The result will be the same, as in your proposition:
the held amount will be used to recover lost data;
the reputation of the node will be reset back to zero, so start with vetting and 75% held back (because the held amount has been used);
since all pieces are marked as probably lost, they will be recovered when the trigger is fired, thus they will be deleted from your node by the Garbage collector.
So it’s the same as disqualification, but with slow cleaning of the used space.
It’s much simpler just start over.
I guess one difference would be that if my node was disqualified on only one satellite (which might happen), I would not have to run two nodes - a new node for that satellite and the old node for all the other satellites.
It is not about disqualification for me. It is preemptive.
I am not even sure that the 300 files were ever completed as storagenode seems not to care much about leaving junk files around.
i would argue junk files are storj’s problem and they should have to pay for their storage… SNO’s run a storage space business…
we cannot really care much about what kind of files are stored and what isn’t… people just have to pay for their allotted space…
so really imo there shouldn’t be junk files… ofc those kind of files would be included in the 10% extra space we are suppose to overallocate to your node… so really so long as they stay within their 10% it’s not like we where not informed and i’m sure they will get around to cleaning up after themselves eventually… they cannot expect us to clean up in their software… thats just downright dangerous for their own data / software integrity
i do believe that files often take like a week to get into the deletion phase… which makes sense this is a network after all and we store data… if a piece is missing then it makes sense the sats can ask / search for it and maybe find it in junk data… in stead of having to repair…
i duno… i mean i use my trash folder on my computer… because sometimes i hit the wrong button or something goes wrong and it’s nice to easily retrieve it without having to use undelete software.
so i do see the point in why i think they are keeping junk files for extended periods…
i think SNO’s should be more interested in the customer experience of tardigrade… that is after all how we will make our money, if people think the system works great.,…
we won’t be making money on checking out system and deleting 100 mb files here and there to try and optimize space…
in most cases what you can delete manually adds up to less than 0.001% of what you earn on renting out the storage…
i don’t see what use deleting junk files really is… and i assume they will be deleted eventually anyways all by themselves… or by storj xD
That would be awesome, I’m gonna fill my blobs folder with junk right now!
Junk data is data that isn’t supposed to be there. Under normal operation, the node has been told about this and should have deleted it right away. Why would the node keep being paid if it didn’t clean up data when it should have? The only reason we’re even talking about this is because this system was used one time as a clean up for a lot of zombie segments on stefan-benten. Data that wasn’t owned by a customer, but nodes were still paid for for a loooong time regardless. I would not complain about that. This is an exception that shouldn’t really occur again. My trash folder is already nearly empty again.
Anyway, I shouldn’t have even responded to this since this topic isn’t about that.
I get that. But if you tell the satellite you lost 100 files. Why should the satellite trust your node not to do that again? You can claim the issue has been resolved and no other files were effected, but your node is inherently treated as an untrusted entity. And the satellite has no (affordable) way to verify that claim. That’s really the end of the story. So the only option left to the satellite is to assume you’re lying and repair all pieces to make sure your assumed lie doesn’t hurt segment health.
The only exception to this I can imagine is if a node failed a few audits and the node says: “My bad, I had temporarily lost access and I can now prove I still have the data”. After which the satellite could resent all audits the node previously failed and offer the node a chance to wipe their slate clean. This would still be a very generous system and I don’t even know if it’s worth it for the network to implement this, but it would help out many SNOs. It’s also a different situation from the one describes in this topic.
wasn’t exactly meant like that… more like if storj put the files there and it’s their fault it hasn’t gotten deleted correctly, then they should pay for the space… but yeah if it’s a problem on the node / SNO’s fault then ofc they shouldn’t pay for it…
i like how you think tho…
That would require satellites to keep metadata for all deleted pieces ever and keep track of which pieces were the nodes fault and which pieces were the satellites fault. That’s a ton of data for the metadata and I have no solution on how to find out who’s fault the junk data is to begin with. Either way, it’s generally a non-issue as the trash usually doesn’t even get to 0.1% of your nodes size.
But if you tell the satellite you lost 100 files. Why should the satellite trust your node not to do that again?
Because the information was volunteered
What your node would be claiming has 2 parts. “I lost these 100 files, but I swear the rest are all fine”
Is semantically the same as “I broke into his house, but I swear I didn’t take anything”
The information being volunteered may make it more likely for that first part to be true, but not the second part. In fact the first part is a pretty good indicator that the second part is probably a lie. Nobody is arguing about that first part, I’m sure you lost those 100 files. But that fact happening makes me as an outsider really doubt that the rest is fine. In fact losing the 100 files could be a sign of an HDD starting to fail or other issues that could easily have effected more files or effect more files in the future. There is no reason for a satellite to accept that second part of the statement.
i think the issue for the satellites is not that they need data to help them rebuild… they can rebuild from the pieces they got and because of this don’t need anything from a bad node… at best they save a bit on repair… at worst the node holds unreliable data in other erasure coding strings… thus making the satellites assume something is good when it isn’t…
remember this isn’t raid or any other normal way of storing data… this is basically an active computation creating unique pieces which is then stored and then retrieved to recompute more pieces when the number of pieces gets to low…
this means the system has to know exactly how many good pieces it has… else the entire system breaks down…
so surprise surprise… data integrity for SNO’s is a critical thing to their node reliability… because the satellites will throw a fit if you start to loose stuff.
how would you feel if you bank was like… well we have 99.99% correct bits,…
so only 1 transfer in 10000 has a wrong number in it…
money would simply migrate around between the accounts and payments controlled… at random seemingly… sometimes somebody would pay a million for something others times 10 cents…
Update: The node is on a new disk with 7TB more space and is not logging failed audits in the docker log, however, it is showing varying 80%-99% good in both earnings.py and in the dashboard for two satellites.
My plan is to swamp the 1TB of bad with 7TB of new good.
In answer to your comment in another thread B.S., a node could not use this “I’ve lost this block, sorry” message to cheat and discard low download blocks because
a) the satellite would say “sorry, you already lost 100 blocks, you are DQ”
b) the satellite would use held amount for repair “sorry you have no held amount, you are DQ”
c) the satellite could even count this as a failed audit “why thanks for letting me know, you are DQ”. Why would a “selfish” SNO turn this feature on then, you ask? To get it out of the way and find out if this node has a future or not.
That was indeed a correct prediction, I would ask that. I see your point and it might be nice to be able to do some self diagnostics. But it would likely be more profitable to let 100 pieces just be missing. On a sufficiently large node that will not lead to disqualification, so forcing disqualification on such nodes because they did a self diagnosis would be a bit harsh. At that point it would just be a self diagnosis tool that informs SNOs they can shoot themselves in the foot by reporting issues to the satellite. I would also like to know if something is damaged or corrupt. But such a tool wouldn’t exactly be high on the list of priorities to develop first.
I think the misunderstanding is mostly based on the assumption that you will eventually be disqualified. If we’re talking about 100 missing or corrupt pieces, I’m almost certain that will not happen at all. I’m also certain that if your scores are dropping as low as 80, you have much more significant issues than that.