Known issues we are working on

Pentium100 · April 14, 2020, 10:19am

I’d say that uplink can kill some cheap routers (network cards or drivers) as well, not just their DNS.

littleskunk · April 14, 2020, 10:23am

I can’t reproduce that. I have upgraded my router (the old one was too slow for my new internet connection anyway) and the DNS issue is gone. I am running some cheap boxes here and they are running rclone with 4 current transfers just fine (beside the high CPU usage). I don’t see any issues on these boxes.

Pentium100 · April 14, 2020, 10:46am

I managed to crash a “RealTek 8168/8111” network card (used as “inside/lan” port) of a pfSense router. The card lost its link for a few seconds and then stopped accepting packets. After reboot it worked, but starting the uplink again made it crash again. This happened uploading a big file (many gigabytes) rather quickly after starting the upload. Uploading a 100MB file worked. Uploading the big file over ssh worked as well.

And that is how my first attempt at using Storj for actual things (not just running tests) went.

Now, obviously, most of the blame goes to the network card and its drivers, but there should be some setting to limit the number of connections etc because I doubt that I managed to find the only router in the world that would crash like that.

littleskunk · April 14, 2020, 10:56am

How about we fix the DNS issue first and than you try again. I would like to see if that fixes it for you as well.

Pentium100 · April 14, 2020, 11:00am

Can’t the satellite just give node IPs to uplink instead of their hostnames? The satellite has to know the IPs so it can contact the nodes for audits etc.

littleskunk · April 14, 2020, 11:04am

Please read the description in the first post to understand what the issue currently is.

Pentium100 · April 14, 2020, 11:10am

Sorry, I missed that. The uplink binary for Linux does not seem to do that for me at least (three DNS requests total).

Storgeez · August 9, 2020, 4:23pm

How about we update this with current information and pin it so people can track it more easily?
For example the used-more-space-than-allocated issue.

littleskunk · August 14, 2020, 7:45pm

I have updated the post and added the following issues.

littleskunk:

Download limit not matching billing information
Every time a customer requests a download the satellite will allocate traffic. A few hours later storage nodes will submit orders and the satellite will increase settled traffic. We use settled traffic for customer billing. It is accurate but it has a delay of up to 2 days which is a problem for the download limit. Allocated traffic is not as accurate but it has no delay. We use that for the download limit.
If a customer requests a download but doesn’t execute it that will increase allocated traffic but not settled traffic.
The issue is that we use allocated traffic for the entire month. We are working on a fix to use allocated traffic only for the last 2 days and otherwise settled traffic. This would mean that the download limit and billing will only differ for 2 days. After 2 days the satellite can see how many downloads have been requested but not executed and can adjust the download limit.

littleskunk:

Graceful exit disqualification
Most of the storage nodes can finish graceful exit just fine but there are a few edge cases and some storage nodes are getting disqualified for bugs. Here are the bugs that we currently know (order by likelyhood):

The storage node has problems with corrupted pieces. The storage node should identify the corruption, report it back to the satellite and continue. As long as the failure rate is low graceful exit should still be succesful. For some reason the storage node identifies the corrupted piece but something is messing up the entire batch. The storage node doesn’t continue as expected. We are working on a fix.

Graceful exit is transferring the pieces in a specific order. It starts with pieces that are close to the repair threshold. These are most likely older pieces and they have a higher likelyhood getting corrupted over time. This means a storage node with a overall failure rate of 10% will get most of these failures at the beginning of graceful exit. The satellite is judging after each batch. The storage node might get disqualifed early and has no chance to show the low failure rate at the end of graceful exit. We are working on a fix.

After each successful batch the storage node reports the results back to the satellite. We have a connection timeout in place to prevent storage nodes from getting stuck but no retry. This means the results are getting lost and a retry will be triggered. Fix is incoming.

Conclusion:
Lets say there is a 50% chance that you might get disqualified when you execute graceful exit. Would you leave the network anyway or would you stay and wait for a fix? If you would leave anyway you can risk it. The overall success rate is high and most likely it will work. If you see any kind of audit errors in your logs (corrupted or missing pieces) I would recommend to wait for the fix. My expectation is that the combination of issue 1 and 2 is what is causing problems at the moment.

jennifer · August 21, 2020, 8:19pm

@littleskunk would you mind adding a note that node suspension for downtime likely won’t be released in prod for another few months? Thanks!

littleskunk · August 26, 2020, 12:46am

@jennifer I would rephrase that. We need good uptime otherwise the repair job will cause problems. For that reason I prefer to not always write down the truth. Let’s better say suspension mode for downtime is currently not enabled but it could be any time. I also try to avoid time expectations. We do know that it is disabled at the moment, not all related tickets are in the current sprint, most likely it will not get finished in the next 2 weeks. I can say that with a bit of confidence. I don’t know which priorites we might have in the next sprint. For that reason a few month is no statement I feel comfortable with.

littleskunk · August 26, 2020, 12:53am

On the graceful exit issue I found one more:

The storage node will submit graceful exit success in one message at the end of a batch. Graceful exit failures are not batched. The storage node will submit them one by one. Less powerful systems / routers can get overloaded by the number of connections. This creates a cylce. In the next batch the storage node will fail even more transfers which will increase the impact of this problem until the storage node finally gets disqualified for too many failures.

I am now 99% sure exactly that is the big issue in production. I will put it on the top of list.

Storgeez · August 27, 2020, 2:11pm

The used-more-space-than-allocated bug should be added to the list:

littleskunk · August 27, 2020, 2:20pm

I am currently not aware of any outstanding ticket in our backlog. There was a missmatch between the numbers in the backend and frontend but that was fixed a few releases ago.

Storgeez · August 27, 2020, 8:54pm

Mismatch? I don’t know, but I’m still having under 200 GiB overrun on the specified storage amount. As quoted, NikolaiYurchenko said it was being worked on. No idea.

Look at the original post here: Bug: SN gross space limit overstep.
Ignore the negative free space and high amount of trash: the node is physically using something like 160 GiB of space more than it shows as “Used” in SNO Board. Currently it shows 6.43 TiB used in the SNO Board with 6.59 TiB actually used on the disk, I’ve checked this with “du” command. This is accurate to within ~10 GiB or so due to rounding in the softwares.

Not sure if I’m doing something wrong perhaps but it seems to me that it’s off quite a bit for some reason.

Vadim · August 27, 2020, 8:58pm

Storagenode not account database size, and thay can be about 1GB in total or even more.

Storgeez · August 27, 2020, 9:00pm

I think I did calculation on the blob folder only. But database isn’t 150 GiB for sure.

Vadim · August 27, 2020, 9:04pm

then check trash and temp folder, also can make restart to node, it will recalculate used space, known issue that it show wrong time to time, but calculation take time depend on size, can tak lot of time.

Storgeez · August 27, 2020, 9:28pm

I restarted the node at least once some time in the last several months and trash and temp folders would not influence blob folder size. I can redo the calculations if needed though, it just takes time.