Tardigrade test results from may to august 2020

I would like to share my test results with community. I do not claim to be the ultimate truth and just want to share some interesting (in my opinion) moments that I found during this tests.
Test platform: several servers on AMD Zen2 (Ryzen and EPYC) CPU, a guaranteed 1 gigabit channel and NVMe SSD.
I’ve used several testing points in Europe (France, Germany, Finland, central (this is the western map) Russia), Canada and the USA.

Logs:

  1. 512 MiB file:
  2. Few 400GiB+ files

  3. Effective speed only 15% from 1Gbs! (Real avg. 386 inl. x2.75 overhead.)
    eth-speed
  4. Time to delete big file… and bonus from satellite :slight_smile:

2020-07-22T20:34:15.090+0200 DEBUG Unrecoverable error {“error”: “uplink: metainfo error: context canceled”, “errorVerbose”: “uplink: metainfo error: context canceled\n\tstorj.io/uplink/private/metainfo.(*Client).BeginDeleteObject:631\n\tstorj.io/uplink/private/metainfo.(*DB).DeleteObject:112\n\tstorj.io/uplink.(*Project).DeleteObject:99\n\tstorj.io/storj/cmd/uplink/cmd.deleteObject:51\n\tstorj.io/private/process.cleanup.func1.4:349\n\tstorj.io/private/process.cleanup.func1:367\n\tgithub.com/spf13/cobra.(*Command).execute:840\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:945\n\tgithub.com/spf13/cobra.(*Command).Execute:885\n\tstorj.io/private/process.ExecWithCustomConfig:88\n\tmain.main:16\n\truntime.main:203”}

Error after 10 minutes (Reproduced more than 30 times). However, the frequency of this error directly depends on whether I delete 1-2 files (then the deletion is processed in 6-9 minutes) or 10 (then in 100% of cases I received an error).
5. Zombie, zombie, zombie! I ran out of space and the file transfer was interrupted. Based on the experience of using other services, the support service should have managed to increase the limit to 5TB, but this did not happen in 18 hours.


6. Support testing. Here I had a failure, which is described in a separate topic.

7. Active (total detected by uplink) node count. It gradually decreased from ~ 4500 at the end of May to ~3000 a week ago.
My private IP counter, May 30))
image
8. The list of nodes that will receive the file is approximately the same for each satellite. Collecting data from the interface showed approximately the same set of addresses when loading a 100GB file

6 Likes

Thanks for sharing test result, it very interesting for me, especcilay to compare with my own on this thread.

1 Like

I was inspired by your testing and felt like I was missing an update. I wanted to see what has changed since the release.

1 Like

Update: more than 24h request to increase account space is still pending. Which looks a little strange (lack of resources?)

Update2: request not solved after ~48h.

  1. Uplink does not attempt to resend AND/OR re-download the segment if the first attempt fails. It’s “fun” when an error occurs on the last segments (not even through my fault) and the download is interrupted, but you have to pay for it.
1 Like

They have to fix these issues, it’s a very crude way to handle such errors.

I don’t know exactly what your test pattern looks like, but there are well more than 3000 nodes participating in the network. I suspect that your tests are mostly dealing with new objects- in that case, they would mostly only see nodes which are accepting new data (nodes which are not full).

If that fits your definition of “active”, then that looks reasonably accurate. There are a lot of full nodes. We are exploring how much of a problem that is, or if it’s a problem at all.

1 Like

Can you share some more details of the test scenario? I don’t think we expose any API for deleting multiple files at once in a single request, yet, so I presume you are making multiple requests. Are they being done in parallel on the same connection? And, when doing that with 10 simultaneous deletes, are you saying that all 10 deletes fail, individually, every time?

This is a curious comment to me. It sounds like you’re saying it’s strange for a tech startup to have limited resources?

Yes exactly. Active = there is free space to receive files. I would expect a reasonable comment that populated nodes maintain network integrity, BUT! It is necessary to think in terms of time: if there are 3000 active nodes now, then other X nodes will not participate in the storage and restoration of the file in the future, which means they will not ensure the integrity of the network for a particular file. The number of nodes that can serve me here and now is important for the user.
If we go to extremes, then we can imagine a situation where there are 100,000 nodes in the network and only 110 of them have free space. This is not to say that we have 100,000 active nodes due to the fact that there is nowhere to perform data recovery.

The files contain a very large number of segments and are removed in parallel requests. (uplink rm sj://a/b & uplink rm sj://a/c & …)

Yes, since my request was associated with an extremely small amount of resources. I don’t think that 4 TB among the 19 PB available is in any way noticeable.
And I also found myself in a situation where 1tb is available to me, but of which 700gb are zombie segments, and another 250gb are used to store backups.

I think you’re saying here that the number of nodes with free space is relevant. Definitely true! The more nodes that have free space, the more we can distribute the PUT load.

It also sounds like you are implying that file integrity suffers “in the future”, because of full nodes, but that would only be the case if the network permanently stopped growing.

Yep! I don’t think it would be particularly helpful to imagine that, though. It would be like imagining “what if AWS stopped buying digital storage devices out of spite, and ran out of space?” There are always options for us to take if we can’t recruit enough new nodes to store new data or repaired pieces. The most obvious one is to spin up our own nodes; it’s not the business we want to be in, but we could do it if necessary to keep the business running.

And you are saying that every individual invocation of uplink rm is failing when you do that?

Oh, of course that’s not significant. The resource that is limited is our team capacity. We’re still growing this product, and dealing with whatever problems arise as we do so. We’re in uncharted territory here, so there will be times when we have to deal with a flood of support requests and can’t respond as quickly as we could like. I’m not on the support team, but if I were, I think that requests for increased limits would not be at the very top of my priority list.

Yes, zombie segments are of course a current serious problem. We’re working on fixes both short-term and long-term. We cleaned up an awful lot of them over the weekend; maybe your project has more space available now.

I’d say the most obvious one right now is to remove a lot of the test data. Unless of course you get such large scale customers that that wouldn’t be enough.

They may not be high priority, but they’re also likely to be very low impact on resources. It may be a good idea to handle these requests fast, before diving into the more complex ones. Of course I don’t know how large the load of these requests is. I can imagine that these are also the kinds of requests that can be easily partially automated. As long as they are small size and a monitored total doesn’t go over certain thresholds, you could grant them automatically. Just a suggestion.

1 Like

Yes, that’s an option right now. It wouldn’t always be, though- I wanted an example that would always apply :smiley:

For sure. I think the difficulty in this case might be in deciding on the appropriate policy. This may be one of the customers with a free-1TB coupon and no STORJ token or credit card info on file. With such customers, if we raise their limit, we stand to lose more money if they use up their free credits; we have no way to bill them for overage. I believe discussions are ongoing to devise something appropriate and fair.

1 Like

Yeah, and having the fallback of hosting your own nodes means it’ll never be an issue that can’t be avoided.

Ahh I see, yeah that might change things. I can understand wanting to be more careful in situations like that. I guess that would make it one of the more complicated cases that naturally take a little longer. I guess it would be reasonable to require a credit card on file for limit increases over a certain level. Even just as a backup. If tokens are the preferred payment method you could perhaps even inform the customer prior to charging the credit card so they can get a chance to add more tokens to their balance.

3 Likes

This is defenetly not a problem, lot of people insted on old small HDD start to use new bigger one.
Last week I personaly added 40TB space, and week befor 16TB.

Client request shold be always top priority, clients pay you money, and I am not realy understand why there is so small limist, clients want to use it and want to pay for it, why Storj limit them? If limits needed, thay can be some more resonable like 10TB.

1 Like

https://documentation.tardigrade.io/concepts/limits

1 Like

In this case, professional company make automatic responce, that you request is submited we will answere in x time and just do it. Then people will understand in what time limit wait responce.

If there is problem with free coupon then responce shold contain some information about it, and how to solve it like by adding money to account, and …

Why do you think that is not done?

people normaly no complain then about responce timing.

I’m guessing you normally would be right about that. :slight_smile: But in my experience, this particular person is a big fan of complaining.

1 Like