Changelog v1.17.4

littleskunk · November 23, 2020, 4:33pm

For Storage Nodes

Graceful Exit Fix
The satellite keeps track of the last IP address of each storage node. When an uplink requests a list of storage nodes the satellite returns the last IP address so that the uplink doesn’t need to resolve any DNS entries. The same trick we will use now for a graceful exit. That way the graceful exit node doesn’t need to resolve any DNS entries. We believe this will reduce the number of connection issues caused by overloaded routers or network hardware.

Handling Corrupted Order Files
The storage node will now detect and handle corrupted orders or corrupted order files. Worst case it will skip an entire order file and just continue with the next one to make sure the storage node still gets paid for every valid order that it is still holding.

Untrusted Satellites in Payout History
With the removal of Stefan’s satellite, the corresponding line was missing in the payout history. The storage node dashboard will now display the payout but with an empty name. We hope this minimal solution will work for now.

Order Submission Phase 3
Over the last few releases, we have transitioned to a new accounting system. In the old accounting system, the satellite was keeping track of submitted and unsubmitted serial numbers. This was needed to reject double submissions but it was an expensive validation. In the new accounting system, storage nodes are supposed to group orders by creation hour and submit that batch once to the satellite. The satellite will write the sum into the accounting tables. If the storage node tries to submit an order twice, the satellite will notice that it has already an entry in the accounting table and reject the double submission. In the future (not the current release) we expect better performance and also better scaling. First, we need to remove some leftovers from the old accounting system.

sorry2xs · November 23, 2020, 7:35pm

“satellite ID”: “1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE”, “error”: “order: unable to connect to the satellite: rpc: dial tcp 35.236.51.151:7777: connectex: No connection could be made because the target machine actively refused it.”, “errorVerbose”: “order: unable to connect to the satellite: rpc: dial tcp 35.236.51.151:7777: connectex: No connection could be made because the target machine actively refused it.\n\tstorj.io/storj/storagenode/orders.(*Service).settleWindow:454\n\tstorj.io/storj/storagenode/orders.(*Service).sendOrdersFromFileStore.func1:412\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57”}
2020-11-23T14:25:56.414-0500 ERROR contact:service ping satellite failed {“Satellite ID”: “1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE”, “attempts”: 6, “error”: “ping satellite error: rpc: dial tcp 35.236.51.151:7777: connectex: No connection could be made because the target machine actively refused it.”, “errorVerbose”: “ping satellite error: rpc: dial tcp 35.236.51.151:7777: connectex: No connection could be made because the target machine actively refused it.\n\tstorj.io/common/rpc.TCPConnector.DialContextUnencrypted:107\n\tstorj.io/common/rpc.TCPConnector.DialContext:71\n\tstorj.io/common/rpc.Dialer.dialTransport:146\n\tstorj.io/common/rpc.Dialer.dial:116\n\tstorj.io/common/rpc.Dialer.DialNodeURL:80\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatelliteOnce:124\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatellite:95\n\tstorj.io/storj/storagenode/contact.(*Chore).updateCycles.func1:87\n\tstorj.io/common/sync2.(*Cycle).Run:152\n\tstorj.io/common/sync2.(*Cycle).Start.func1:71\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57”}

thej · November 23, 2020, 7:45pm

looks like saltlake is down

rafael · November 23, 2020, 7:58pm

We are migrating Kubernetes Clusters. What you are facing is a temporary DNS issue. This problem should disappear for you soon.

Toyoo · November 23, 2020, 8:03pm

What happens if the storage node code keeps producing malformed order files? For example if some future bug corrupts every new order file?

littleskunk · November 23, 2020, 8:16pm

Then the storage node would skip everything it can’t read.

Toyoo · November 23, 2020, 8:32pm

Wouldn’t this lead to another wave of forum posts of “I haven’t been paid this month”? Just asking.

littleskunk · November 23, 2020, 8:36pm

Yes that would be the consequence of your hypothetical future bug.

Vadim · November 23, 2020, 8:36pm

I do not know what do you have to do to currupt all order files.

Toyoo · November 23, 2020, 8:38pm

Bugs happen. Just that. Don’t worry, it’s just a hypothetical for now. Besides, if Storj engineers believe that the risk of having a systematic error in order file generation is smaller than an occasional error, then it’s fine.

Vadim · November 23, 2020, 8:41pm

i think it is much bettery system, as today if there is corruption, then we discover is next month usualy, when there is smaller payout, and it is too late to submit this orders, in new one we will lost only currupted part.

greener · November 23, 2020, 8:59pm

Handling Corrupted Order Files sounds great, thanks. Though does anyone know why the order files get corrupted in the first place? This is too common to be one-off hardware issue and would be good to understand the cause. If sno stored data files with this error rate we’d all be disqualified by now.

littleskunk · November 23, 2020, 9:03pm

Hard drive cache and/or file system without journaling are common mistakes.

greener · November 23, 2020, 10:46pm

Was there any evidence of this to be a cause?

I had corrupted orders on my few nodes which all have 100% audit/uptime and I’m using ext4 with Filesystem features: has_journal. What would I check to confirm the drive cache issue?

littleskunk · November 23, 2020, 11:29pm

I don’t know your specific setup. I am just saying what the common mistakes are.

Docker kill timeout would be another easy one.

BrightSilence · November 24, 2020, 9:13am

Or power interruptions or other unsafe shutdowns. Unplugging an external disk unexpectedly. There are many reasons that can cause data corruption.

cyber-arknet · November 29, 2020, 10:06pm

Will this version be released automatically? Means my node will update automatically or do i need to stop the node, remove the docker image and relaunch the node again? Is this already in production? 'cause my node is still on 1.16.1

Pac · November 30, 2020, 7:18am

Rollout of version 1.17 isn’t done yet. Give it a few more days.

As long as you have the watchtower setup (or equivalent if you’re on Windows… I’m not sure how it works on Windows), it should update automatically in the coming days.

nerdatwork · November 30, 2020, 7:22am

For GUI, Windows has storagenode-updater exe that automatically updates.

andrew2.hart · November 30, 2020, 7:27am

got it on docker this morning