Changelog v1.9.5

For Storage Nodes

Upload Canceled in Logfile
A few releases ago we changed the uplink behavior at the end of an upload. Instead of wasting time for sending a final success message back to the storage node the uplink just closed the connection. On the storage node side that was creating a lot of false upload canceled log messages. We fixed the storage node behavior. It now keeps track whether or not the storage node has submitted the signed piece hash back to the uplink at the end of the upload. The uplink needs to submit all signed piece hashes to the satellite. The satellite will reject any invalid or missing signed piece hash. This should be a very good checkpoint to distinguish between upload success and upload canceled.

Used and Free Space on Dashboard
CLI and webUI dashboard are now showing the same values for used space and free space.

Free Space Fix
Storage nodes are executing the used space calculation only once on startup. From there the storage node is keeping and updating used space and free space in memory. The storage node should notify the satellite below 500MB free space. This free space value can get outdated and for that reason, we added a second free space check a few releases ago. This second free space check gets executed on every upload but it didnā€™t notify the satellite. With this release, we are combining both free space checks to make sure the satellite stops selecting full storage nodes.

TBm on Payment Dashboard
The storage node payment dashboard will now show used space in TBm instead of TBh. That should be better to read and understand.

Held Amount History
We are adding a held amount history to the storage node payment dashboard. You will find it at the bottom of the payment dashboard.

Suspension Score
Finally, the suspension score is visible on the storage node dashboard. The calculation is the same as for the audit score. A score of close to 100% is good. You will get suspended for a suspension score below 60% and disqualified for an audit score below 60%. Successful audits will increase both scores. Most audit failures will decrease the suspension score except missing or corrupted pieces. These 2 failures will directly impact the audit score instead. A storage node can recover from suspension mode. Disqualification is permanent.

Graceful Exit Initiation
The storage node graceful exit command is now asking each selected satellite if graceful exit is possible. If the storage node is not old enough on one satellite the execution will error out without starting graceful exit for any of the selected satellites. The storage node operator has the choice to stay or call graceful exit again but this time without that one satellite that wouldnā€™t allow it.

Graceful Exit Cleanup
The repair job is uploading a few bonus pieces to compensate expected upload errors (disk full, storage node overloaded etc). If the repair job is lucky it might end up storing more than 80 pieces. Graceful exit will notice that and just skip these pieces. At the end of an successful graceful exit the storage node gets paid without having to transfer these pieces. There are a few other edge cases like this. If the storage node didnā€™t get the order to transfer a piece it will leave that piece on disk even after a successful graceful exit. Now a final cleanup should remove all remaining pieces at the end. If graceful exit failed the cleanup will not get triggered. We want to be able to investigate why graceful exit failed and maybe even restart the process.

Storage Node Update Notification
We fixed and enabled the storage node update notification. This time the update notification should not get triggered as soon as we start the rollout process. It should get triggered 4 days after the rollout finished. These 4 days include the docker rollout.

Loading Screen
The storage node dashboard has to query a lot of data and that might take a few seconds. To avoid possible confusion about empty values we are adding a loading screen.

For Customers

Revoke Access Key
If one of your shared access keys gets compromised you can revoke access with uplink revoke. This will revoke the access key and any sub access key that was derived from it.

Bucket Limit
We introduced a limit on the number of buckets you may have in a project. It is currently set to 100 per project. Please contact support if you need a higher limit.

16 Likes

what happens to docker nodes that have yet complete the previous release

They will update from current version to this version when its their time to update.

If youā€™re referring to 1.6.5, that one was never released for Docker. It didnā€™t have any code changes, but was just released to force a restart of all windows nodes.

Either way itā€™s not a problem. You can skip a release and update to the most recent version. But keep in mind it will take a while before itā€™s available on docker due to the phased rollout.

1 Like

Why stop the GE and force SNO to start over ?

I would like to propose something like this:

root@kali:~# storagenode exit-satellite
Please be aware that by starting a graceful exit from a satellite, you will no longer be allowed to participate in repairs or uploads from that satellite. This action can not be undone. Are you sure you want to continue? y/n

: y
Domain Name                       Node ID                                              Space Used         Age (months)         Eligibility for GE
satellite.stefan-benten.de:7777   118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW   1.8 TB                  2                     No
asia-east-1.tardigrade.io:7777    121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6  37.9 GB                 4                     No
us-central-1.tardigrade.io:7777   12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S  38.7 GB                 8                     Yes
europe-west-1.tardigrade.io:7777  12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs  0.8 TB                  10                    Yes

Satellites available for Graceful Exit: 
Domain Name                       Node ID                                              Space Used         Age (months)
us-central-1.tardigrade.io:7777   12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S  38.7 GB                 8
europe-west-1.tardigrade.io:7777  12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs  0.8 TB                  10

Would you like to continue (Y/N): 

Please enter a space delimited list of satellite domain names you would like to gracefully exit. Press enter to continue:

:+1: :+1: :+1: finally one unit of measurement for displaying metrics and for payment.

1 Like

Do access denied errors (due to filesystem permissions) fall under ā€œmissing or corrupted dataā€?

I believe inaccessible data is counted that way, yes.

This screen is gonna become a common sight for our SMR users :wink:

Jokes aside, I like the idea. Keeps people from wildly clicking around while the dashboard isnā€™t ready yet

4 Likes

i predict there will be an awful lot of traffic on the successrate comparison threadā€¦ xD

2 Likes

Great update, looking forward to it!

1 Like

Damn. Hope they fix this with some of the changes regarding inaccessible mountpoints resulting in disqualifications.

@Storgeez
just put the entire storagenode on the save mountā€¦ then if the mount is gone the storagenode should crashā€¦ and at worst no data is written onto the node when in that stateā€¦
canā€™t really control the satellites, but we can control our systemsā€¦

must admit i havenā€™t tried it, because well broken mounts arenā€™t really an issue of mineā€¦
and when i did break the mount and try to launch the storagenode it didnā€™t want to runā€¦

and the couple of times i stalled the system for a few hours it also seemed no worse for wearā€¦
everything was so stalled that, even tho i could write the shutdown command, neither the storagenode or OS wanted to shutdown and because my node was new i figured i would take up the fight and chance to see if i could recover the system and figure out what i had done wrongā€¦

eventually figure out that i had removed my zfs slog l2arc or slog device without writing log or cache in the remove commandā€¦ and thus just removed the attached device ā€¦ without turning off the l2arc or slog setupā€¦ i supposeā€¦

took like 1Ā½-2 hours before the system was able to shutdown, the first timeā€¦ the 2nd time i did a hard reboot, didnā€™t feel like waiting or risking the node for no gainsā€¦

one thing of note tho is i moved the identify onto the storagenode data locationā€¦ long agoā€¦
but no a 100% sure how much better that would workā€¦ ofc moving the entire storagenode onto the mount point might not be much betterā€¦ but atleast in theory it should make everything inaccessibleā€¦

else there are some shutdown scripts brightsilence and others have made, that will monitor files on a mount or check for audit fails and then shutdown the node if a problem arisesā€¦

i tested out Brights one line audit code using screen so it was running in the background on my server, seemed to work like a charm and would shut down the node on que even tho i havenā€™t had a audit failure in recent timeā€¦ then i simulated it by just making the script shut down on all errorsā€¦

something like that would also help protectā€¦ i must admit i never got around to fully implementing it myself lol, ofc if i had an semi unstable mount point or even thought it might be soā€¦ then i would be very worriedā€¦ but itā€™s all been rock stable for a long time now.

Whatā€™s a save mount? I canā€™t find any references to it.
Iā€™m sure I can do it myself as other have, it just takes me too much time to figure it out. Usually I get stuck on some nonintuitive design things.

my adaptation of @BrightSilence 's script ended up looking something like this i thinkā€¦
canā€™t find the source on the forum right now thoā€¦ and i seem to not have that savedā€¦

but i think this one should work for dockerā€¦ tho i ended up installing screen on my linux and running in that, while testing itā€¦

docker logs --since "$(date -d "$date -2 minutes" +"%Y-%m-%dT%H:%M")" --until "$(date -d "$date -1 minutes" +"%Y-%m-%dT%H:%M")" | awk '/(ERROR|canceled|failed).*GET_AUDIT/ {system ("docker stop -t 300 storagenode")}'

might be a bit resource heavy if you got a big docker log, but you can limit the docker log to max sizes or suchā€¦ all the $date stuff basically just says --since current time -2minutes --until current time -1 minute

thus it will always take 1 exact minute of log and check, and thus there will be no overloadā€¦ i use the same to export my docker logs.

there is a few different scripts and ways of doing something similar made by different peopleā€¦ but i cannot remember which thread it was inā€¦

running it in screen was nice tho, because even after it shutdown the node, the script will not terminate and while in a screen / virtual terminal then i can always pull it to the front and easily terminate it, instead of it becoming a lost thread i have to use a kill command onā€¦

I think, he mean the subfolder on the mounted disk, as recommended in the guide: Storage Node - Storj Docs

1 Like

@Storgeez
sorry still quite the rookie in linuxā€¦ :smiley:
the media you save your storj data toā€¦

mount point just sounds wrong, because thats not really where the data isā€¦ its just a virtual folder / location, i used the designation save to underline that i meant the same mount point where your data / blobs folder is located.

the mount point unto which you save your storj dataā€¦ just kinda got shorted down to something that didnā€™t make senseā€¦

the script should work ā€¦ tho it will shut down at any audit failureā€¦ just a single oneā€¦ so be sure you have some sort of downtime alert if you use it and that you donā€™t really have audit fails to often in the first placeā€¦

works like a charm tho

https://review.dev.storj.io/c/storj/storj/+/2272

8 Likes

Very happy to see this! Itā€™ll prevent a lot of lost nodes!

1 Like

@BrightSilence
likewise hereā€¦ it can easily be very disheartening to have spent a long time running a node and then having it die after like a year with little rim or reasonā€¦ just because of a cable that got bumped or whateverā€¦

and such SNOā€™s that could have been potential long term useful nodes, end up just giving bad PR about the project, so yeah this is a great changeā€¦ didnā€™t know it was in the pipe, but i suppose i should have assumed that it wasā€¦ only makes tactical sense

1 Like