Known issues we are working on

littleskunk · January 26, 2020, 11:33am

I would like to avoid that we waste our time testing things that are known issues. Here is a list of issues that we are aware of and currently working on.
14.08.2020

Uplink fast at the beginning slow at the end
The uplink is currently unable to track the upload and download speed. It is tracking only the speed the file gets read into buffer. At the beginning of a segment the buffer gets filled quickly and at the end of the segment it looks like no progress while the buffer still gets drained and send into the network. We are working on it but we might not be able to fix it in time. We understand that the uplink progressbar is confusing but it would be even worse if we remove it. Please use external tools to watch the transfer speed.

Upload failing with not enough pieces
Our protocol is optimized for fast transfers. We cut off slow connections in order to finish the transfer as fast as possible. The downside is that we currenlty sometimes cut off too many connections and the upload might fail. This happens especially when there is low bandwidth on the uplink side. As a workaround we recommend reducing the number of parallel transfers. QoS on the uplink side will also increase the risk. Instead of QoS we recommend a bandwidth limit in the application. We are working on a solution. It should be allowed to upload large files even over very slow internet connections.

Billing includes download overhead
For fast download speed the uplink downloads more pieces than required to reconstruct the file. With the current reed solomon settings and the implementation of the uplink the overhead is up to 30%. This additional traffic will show up in billing. We promised to be halve the price of other cloud storage providers and are looking into other download implementation with similar speed but less overhead.

Billing includes zombie segments
We split file uploads into 64M segments. A file with lets say 3 segments will create a s0, s1 and l segment in the database. The l segment is the last segment and it is the important one. If you call a listing it will search in the database for all l segments. Zombie segments don’t have a l segment. This can happen by canceling an upload after the first few segments. Zombie segments can’t be listed and they can’t be deleted. You can get rid of them by uploading a file to the same path and then delete the new file. That will clean up the zombie segment.
On the satellite we are running a cleanup job twice per week. It will delete all zombie segments after 2 days. This is only a short term solution. Long term we want to eleminate zombie segments with the metainfo refactoring.

Open previous summary

We are working on improving the zombie segment reaper to catch all zombie segments. At the moment it will ignore files with more than 64 segments and we are running it less than once per month which means we are still billing the zombie segments more or less for an entire month.

Download limit not matching billing information
Every time a customer requests a download the satellite will allocate traffic. A few hours later storage nodes will submit orders and the satellite will increase settled traffic. We use settled traffic for customer billing. It is accurate but it has a delay of up to 2 days which is a problem for the download limit. Allocated traffic is not as accurate but it has no delay. We use that for the download limit.
If a customer requests a download but doesn’t execute it that will increase allocated traffic but not settled traffic.
The issue is that we use allocated traffic for the entire month. We are working on a fix to use allocated traffic only for the last 2 days and otherwise settled traffic. This would mean that the download limit and billing will only differ for 2 days. After 2 days the satellite can see how many downloads have been requested but not executed and can adjust the download limit.

Graceful exit disqualification
Most of the storage nodes can finish graceful exit just fine but there are a few edge cases and some storage nodes are getting disqualified for bugs. Here are the bugs that we currently know (order by likelyhood):

The storage node will submit graceful exit success in one message at the end of a batch. Graceful exit failures are not batched. The storage node will submit them one by one. Less powerful systems / routers can get overloaded by the number of connections. This creates a cylce. In the next batch the storage node will fail even more transfers which will increase the impact of this problem until the storage node finally gets disqualified for too many failures.
The storage node has problems with corrupted pieces. The storage node should identify the corruption, report it back to the satellite and continue. As long as the failure rate is low graceful exit should still be succesful. For some reason the storage node identifies the corrupted piece but something is messing up the entire batch. The storage node doesn’t continue as expected. We are working on a fix.
Graceful exit is transferring the pieces in a specific order. It starts with pieces that are close to the repair threshold. These are most likely older pieces and they have a higher likelyhood getting corrupted over time. This means a storage node with a overall failure rate of 10% will get most of these failures at the beginning of graceful exit. The satellite is judging after each batch. The storage node might get disqualifed early and has no chance to show the low failure rate at the end of graceful exit. We are working on a fix.
After each successful batch the storage node reports the results back to the satellite. We have a connection timeout in place to prevent storage nodes from getting stuck but no retry. This means the results are getting lost and a retry will be triggered. Fix is incoming.

Conclusion:
Lets say there is a 50% chance that you might get disqualified when you execute graceful exit. Would you leave the network anyway or would you stay and wait for a fix? If you would leave anyway you can risk it. The overall success rate is high and most likely it will work. If you see any kind of audit errors in your logs (corrupted or missing pieces) I would recommend to wait for the fix.
If you have a less powerful router I would recommend to reduce the batch size to reduce the risk of running into issue 1.

Remaining coupon value not updated
Meanwhile fixed

Open previous summary

The coupon is a one-time coupon with an expiry date. Please be aware that we will charge you as soon as the entire coupon was used or is expired. At the moment we are not showing the remaining coupon value on the satellite webUI. You might get charged at the end of the month.

DQ after finishing graceful exit
Meanwhile fixed

Open previous summary

Storage nodes can get disqualified after finishing graceful exit. Don’t worry as long as your node received the signed success message you will get the held back payment. If you failed graceful exit you still get nothing.

History of fixed issues:

Fixed issues

Libuplink kills router DNS
Fixed with latest libuplink version

Open previous summary

The satellite is resolving the DNS entries and returns a list of IP addresses to the uplink but for some reason it is still sending DNS requests to the router. For cheap routers that is too much and the overall internet connection will be impacted. I have seen this behavior especially with rclone but it should effect other tools as well.

Storage node payment dashboard showing wrong data
We have implemented the first version of it. Stay tuned for additional updates. Here is what you currently get:

Screenshot + explanation

On the first screen you can see the payout for the current month across all satellites.

The total held amount is a bit off in my case because I received already part of my held back amount. I believe that held amount reduction is not factored in. However, this is an edge case. For most storage nodes the amount should be accurate.
The little red box is because the held back amount for the current month is missing. I will not receive 6.17 payout. About 2 will be held back on saltlake. I would expect to see $ 4 or something like that.

On the second screen, you can see the breakdown for the selected satellite.

Breakdown per satellite is working.
The node age and held amount rate indicator at the button is correct as well.
The held amount of rate is sometimes not updated. I have seen it showing 0% while the held amount rate indicator at the button was showing the correct value.
The total held amount was not updated and is still showing the total for all satellites but there is a workaround to get that information (screen 3)

On the third screen, I have selected a previous month on the same satellite.

The total held amount is now showing the amount for this satellite. As long as I keep selecting the previous month I can go through all satellites and write down the held back amounts for each satellite.
One small issue it doesn’t work for new satellites like EU-north. I would expect to see 0 held amount on that satellite but instead, it will show me the held amount of the previously selected satellite. By selecting back and forth between different satellites you can find out if the value is updated or still showing the previously selected satellite.
The node age at the button should now display 10 months instead of 11 months because I am going back in time. If I select the first payout on this satellite I expect to see 75% held amount rate.
The USD amounts are without surge pricing. I would have to multiply the numbers with 2.5.

Zombie Segments
Fixed with v0.33.4. Please let us know if you still have problems with zombie segments.

Open previous summary

We split file uploads into 64M segments. A file with lets say 3 segments will create a s0, s1 and l segment in the database. The l segment is the last segment and it is the important one. If you call a listing it will search in the database for all l segments. Zombie segments don’t have a l segment. This can happen by canceling an upload after the first few segments. Zombie segments can’t be listed and they can’t be deleted. If you try to upload a file with the same path it will error out because you can’t overwrite the zombie segment. The only way around it is uploading the file to a different path.
We have a zombie segment reaper but even if we would call it every day it will only delete zombie segments that are older than 3 days. It is a cleanup job but not a solution. The developer team is working on it. I will keep you updated.

Slow Deletes
Fixed over several versions. Storage nodes should get most of the delete messages in time. Only exceptions are storage nodes in suspension mode or disqualified because repair will move the data without notification. Zombie segment reaper will also not send any delete message. In these edge cases garbage collection will still kick in.

Open previous summary

Fixed with v0.31.12 but we have a new bug now. The satellite is to slow to communicate with all storage nodes and will drop most of it. GC has to handle it.

Open previous summary

We already moved the delete messages from the uplink to the satellite. Now the uplink only has to tell the satellite to delete a file and the satellite will contact the storage nodes. The performance is better but still not as good as we would like to have it. We are working on it. The satellite has to return as quickly as possible and send the delete messages in the background.

Timeouts and slow satellite responses
Fixed with v0.31.12

Open previous summary

~~We are working on that one as well but I don’t have a good description of the problem at the moment.~~

Upload fails with less than 80 pieces
Fixed with v0.31.12 in combination with a few other bugs that still needs to be fixed.

Open previous summary

If you are using an old bucket that was created with an old uplink you are starting the upload with old reed solomon settings. In the next release we will change the behavior on the satellite side and make sure we ignore the reed solomon settings that are stored with the bucket. The new setting should be 110 instead of 95. This will give us a higher error tolerance. We are also hunting down the bad storage nodes and try to fix some of the errors messages they are returning. The next release will stop some of these nodes from starting.
I am not sure if it is a good idea to wait for the next release. If you are affected by this issue please make sure you are using the latest uplink and create a new bucket with that uplink. If possible please run uploads with log level debug and give us the output. It will contain all the storage nodes errors.

Bandwidth accounting delayed by 4 days
Fixed.

Open previous summary

Customer and storage node bandwidth accounting is delayed by 4 days.

Speed of graceful exit
Fixed with v0.31.12. Graceful exit is able to move data quickly. We have some tickets to improve the performance a bit more but at the current state the speed is acceptable.

Open previous summary

Should also get better with the next release. I have changed most of the graceful exit settings. I wouldn’t say it is fixed. I only try to get the maximum speed from the current graceful exit implementation. My hope is that this will be enough to ignore the limitation of the current implementation for the moment.

Garbage collection deletes unpaid data after 7 days
Fixed

Open previous summary

Uplinks don’t have to send delete messages to the storage nodes. The satellite keeps the deleted segments in memory and contacts the storage nodes in the background. At the moment the satellite is unable to drain the queue fast enough. Garbage collection will kick in even if the storage node was online all the time and didn’t miss any delete messages. The storage node will keep the unpaid data for additional 7 days. The 7 day delay is intentional and we are not going to change that. I have created an issue to make sure the satellite is able to send out the delete messages quickly and don’t fall back on garbage collection.

Project limits not rest
Fixed

Open previous summary

The project limits are getting reset after 30 days which is not the end of the month.

1 TB coupon missing
Fixed

Open previous summary

The 1 TB coupon is only getting triggered with a 50$ STORJ transaction or more. 2 transaction with 30$ each will not trigger it. Credit card should always trigger it.

Low request limit
Fixed

Open previous summary

By default each project is limited to 10 requests. The limit is too low. We are running tests to find a better value.

KernelPanick · January 28, 2020, 1:52am

How can node owners detect if they are one of the issues?

littleskunk · January 28, 2020, 1:58am

The storage node will stop working at some point. In the log message you will find the reason for it.

Alexey · February 4, 2020, 8:53pm

8 posts were split to a new topic: I had my node simply stop working with just download and upload requests

littleskunk · February 4, 2020, 2:26am

Updated with the current situation after v0.31.12

jensamberg · February 12, 2020, 7:44pm

What Issues are now fixed ? And what are the Issues to solve for go to production?

heunland · February 12, 2020, 7:51pm

This is likely to be published with the changelog for the next release. I would not expect an update here until the coming release is out.

twl · February 12, 2020, 7:54pm

When might that be?

Still a ton to do, keep it going, folks.

jensamberg · February 12, 2020, 8:10pm

When is the next release the Aha roadmap is not actual

heunland · February 12, 2020, 9:38pm

We don’t yet have an ETA on the next release at the moment

Odmin · February 13, 2020, 7:25am

@littleskunk Could you please add “DNS resolve issue” to the list?
I see the fix but not see this problem on the list.

Storgeez · February 14, 2020, 7:15am

This thread should be pinned.

Great, so they actually implemented that! That should significantly improve the speed!

jocelyn · February 14, 2020, 4:32pm

Hi @jensamberg I spoke with our product manager @brandon. He said there are actually 3 new roadmaps coming out. So in the next few weeks they should be visible. Thanks for paying attention, we love when the community does that!

littleskunk · February 20, 2020, 11:05pm

Updated to reflect the current version v0.33.4

littleskunk · February 26, 2020, 11:17pm

That is interesting. Now I can’t edit my post anymore

The bandwidth accounting offset is fixed with v0.34.2. No change on the other issues.

Alexey · February 26, 2020, 11:19pm

You can make it wiki.

Made it wiki

twl · February 27, 2020, 8:24am

“will be fixed” or “is fixed”?

nerdatwork · February 27, 2020, 9:00am

“is fixed” because the code is in the commit for 0.34.2

“will be fixed” would have been suitable when devs were working on a fix.

littleskunk · April 14, 2020, 10:18am

List updated. New known issues are:

Libuplink kills router DNS
Billing includes zombie segments
Remaining coupon value not updated
DQ after finishing graceful exit
Storage node payment dashboard showing wrong data