Known issues we are working on

I’d say that uplink can kill some cheap routers (network cards or drivers) as well, not just their DNS.

I can’t reproduce that. I have upgraded my router (the old one was too slow for my new internet connection anyway) and the DNS issue is gone. I am running some cheap boxes here and they are running rclone with 4 current transfers just fine (beside the high CPU usage). I don’t see any issues on these boxes.

I managed to crash a “RealTek 8168/8111” network card (used as “inside/lan” port) of a pfSense router. The card lost its link for a few seconds and then stopped accepting packets. After reboot it worked, but starting the uplink again made it crash again. This happened uploading a big file (many gigabytes) rather quickly after starting the upload. Uploading a 100MB file worked. Uploading the big file over ssh worked as well.

And that is how my first attempt at using Storj for actual things (not just running tests) went.

Now, obviously, most of the blame goes to the network card and its drivers, but there should be some setting to limit the number of connections etc because I doubt that I managed to find the only router in the world that would crash like that.

How about we fix the DNS issue first and than you try again. I would like to see if that fixes it for you as well.

1 Like

Can’t the satellite just give node IPs to uplink instead of their hostnames? The satellite has to know the IPs so it can contact the nodes for audits etc.

Please read the description in the first post to understand what the issue currently is.

Sorry, I missed that. The uplink binary for Linux does not seem to do that for me at least (three DNS requests total).

How about we update this with current information and pin it so people can track it more easily?
For example the used-more-space-than-allocated issue.

3 Likes

I have updated the post and added the following issues.

9 Likes

@littleskunk would you mind adding a note that node suspension for downtime likely won’t be released in prod for another few months? Thanks!

8 Likes

@jennifer I would rephrase that. We need good uptime otherwise the repair job will cause problems. For that reason I prefer to not always write down the truth. Let’s better say suspension mode for downtime is currently not enabled but it could be any time. I also try to avoid time expectations. We do know that it is disabled at the moment, not all related tickets are in the current sprint, most likely it will not get finished in the next 2 weeks. I can say that with a bit of confidence. I don’t know which priorites we might have in the next sprint. For that reason a few month is no statement I feel comfortable with.

1 Like

On the graceful exit issue I found one more:

The storage node will submit graceful exit success in one message at the end of a batch. Graceful exit failures are not batched. The storage node will submit them one by one. Less powerful systems / routers can get overloaded by the number of connections. This creates a cylce. In the next batch the storage node will fail even more transfers which will increase the impact of this problem until the storage node finally gets disqualified for too many failures.

I am now 99% sure exactly that is the big issue in production. I will put it on the top of list.

3 Likes

The used-more-space-than-allocated bug should be added to the list:

I am currently not aware of any outstanding ticket in our backlog. There was a missmatch between the numbers in the backend and frontend but that was fixed a few releases ago.

Mismatch? I don’t know, but I’m still having under 200 GiB overrun on the specified storage amount. As quoted, NikolaiYurchenko said it was being worked on. No idea.

Look at the original post here: Bug: SN gross space limit overstep.
Ignore the negative free space and high amount of trash: the node is physically using something like 160 GiB of space more than it shows as “Used” in SNO Board. Currently it shows 6.43 TiB used in the SNO Board with 6.59 TiB actually used on the disk, I’ve checked this with “du” command. This is accurate to within ~10 GiB or so due to rounding in the softwares.

Not sure if I’m doing something wrong perhaps but it seems to me that it’s off quite a bit for some reason.

Storagenode not account database size, and thay can be about 1GB in total or even more.

I think I did calculation on the blob folder only. But database isn’t 150 GiB for sure.

then check trash and temp folder, also can make restart to node, it will recalculate used space, known issue that it show wrong time to time, but calculation take time depend on size, can tak lot of time.

I restarted the node at least once some time in the last several months and trash and temp folders would not influence blob folder size. I can redo the calculations if needed though, it just takes time.