How does 'storage2.max-concurrent-requests' really work?

jammerdan · August 11, 2022, 3:34am

Obviously I was having a false impression of how this setting works.
I tested it with set to 1. I believed this would mean, there is always at least one upload request that goes through.
However when I compare that with my logs I see myriads of this:

2022-08-11T03:26:46.616Z        ERROR   piecestore      upload rejected, too many requests      {"Process": "storagenode", "live requests": 2, "requestLimit": 1}
2022-08-11T03:26:46.667Z        ERROR   piecestore      upload rejected, too many requests      {"Process": "storagenode", "live requests": 2, "requestLimit": 1}
2022-08-11T03:26:48.117Z        ERROR   piecestore      upload rejected, too many requests      {"Process": "storagenode", "live requests": 3, "requestLimit": 1}
2022-08-11T03:26:48.419Z        ERROR   piecestore      upload rejected, too many requests      {"Process": "storagenode", "live requests": 3, "requestLimit": 1}
2022-08-11T03:26:48.518Z        ERROR   piecestore      upload rejected, too many requests      {"Process": "storagenode", "live requests": 3, "requestLimit": 1}
2022-08-11T03:26:48.708Z        ERROR   piecestore      upload rejected, too many requests      {"Process": "storagenode", "live requests": 3, "requestLimit": 1}
2022-08-11T03:26:51.794Z        ERROR   piecestore      upload rejected, too many requests      {"Process": "storagenode", "live requests": 3, "requestLimit": 1}
2022-08-11T03:26:52.204Z        ERROR   piecestore      upload rejected, too many requests      {"Process": "storagenode", "live requests": 3, "requestLimit": 1}
2022-08-11T03:26:54.930Z        ERROR   piecestore      upload rejected, too many requests      {"Process": "storagenode", "live requests": 3, "requestLimit": 1}
2022-08-11T03:26:55.913Z        ERROR   piecestore      upload rejected, too many requests      {"Process": "storagenode", "live requests": 3, "requestLimit": 1}
2022-08-11T03:26:58.945Z        ERROR   piecestore      upload rejected, too many requests      {"Process": "storagenode", "live requests": 3, "requestLimit": 1}
2022-08-11T03:26:59.378Z        ERROR   piecestore      upload rejected, too many requests      {"Process": "storagenode", "live requests": 3, "requestLimit": 1}
2022-08-11T03:27:02.076Z        ERROR   piecestore      upload rejected, too many requests      {"Process": "storagenode", "live requests": 3, "requestLimit": 1}
2022-08-11T03:27:03.164Z        ERROR   piecestore      upload rejected, too many requests      {"Process": "storagenode", "live requests": 3, "requestLimit": 1}
2022-08-11T03:27:03.463Z        ERROR   piecestore      upload rejected, too many requests      {"Process": "storagenode", "live requests": 3, "requestLimit": 1}
2022-08-11T03:27:06.437Z        ERROR   piecestore      upload rejected, too many requests      {"Process": "storagenode", "live requests": 4, "requestLimit": 1}
2022-08-11T03:27:08.708Z        ERROR   piecestore      upload rejected, too many requests      {"Process": "storagenode", "live requests": 4, "requestLimit": 1}
2022-08-11T03:27:08.936Z        ERROR   piecestore      upload rejected, too many requests      {"Process": "storagenode", "live requests": 4, "requestLimit": 1}
2022-08-11T03:27:09.368Z        ERROR   piecestore      upload rejected, too many requests      {"Process": "storagenode", "live requests": 4, "requestLimit": 1}
2022-08-11T03:27:10.026Z        ERROR   piecestore      upload rejected, too many requests      {"Process": "storagenode", "live requests": 3, "requestLimit": 1}
2022-08-11T03:27:12.064Z        ERROR   piecestore      upload rejected, too many requests      {"Process": "storagenode", "live requests": 3, "requestLimit": 1}
2022-08-11T03:27:18.102Z        ERROR   piecestore      upload rejected, too many requests      {"Process": "storagenode", "live requests": 3, "requestLimit": 1}
2022-08-11T03:27:19.002Z        ERROR   piecestore      upload rejected, too many requests      {"Process": "storagenode", "live requests": 3, "requestLimit": 1}
2022-08-11T03:27:19.834Z        ERROR   piecestore      upload rejected, too many requests      {"Process": "storagenode", "live requests": 3, "requestLimit": 1}
2022-08-11T03:27:20.025Z        ERROR   piecestore      upload rejected, too many requests      {"Process": "storagenode", "live requests": 3, "requestLimit": 1}
2022-08-11T03:27:21.475Z        ERROR   piecestore      upload rejected, too many requests      {"Process": "storagenode", "live requests": 3, "requestLimit": 1}
2022-08-11T03:27:22.145Z        ERROR   piecestore      upload rejected, too many requests      {"Process": "storagenode", "live requests": 3, "requestLimit": 1}
2022-08-11T03:27:23.511Z        ERROR   piecestore      upload rejected, too many requests      {"Process": "storagenode", "live requests": 3, "requestLimit": 1}

But only very few finished uploads. So if we see live requests": 3 why is it not that only 2 of them get rejected and one gets approved?

jammerdan · August 11, 2022, 7:58am

Further investigation.
In the meantime I had seen patterns in the log, that looked more like what I was expecting:
First there is a upload started, then while uploading I see rejected uploads, and finally the piece has been uploaded and a new upload starts.

But then I also see this,

2022-08-11T07:15:57.025Z        INFO    piecestore      upload started  {"Process": "storagenode", "Piece ID": "PIECEID", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "PUT", "Available Space": 34231223}
2022-08-11T07:15:57.158Z        ERROR   piecestore      upload rejected, too many requests      {"Process": "storagenode", "live requests": 2, "requestLimit": 1}
2022-08-11T07:15:58.192Z        ERROR   piecestore      upload rejected, too many requests      {"Process": "storagenode", "live requests": 2, "requestLimit": 1}

***This goes on and on and on, even too many lines to post it here.***

2022-08-11T07:39:42.978Z        ERROR   piecestore      upload rejected, too many requests      {"Process": "storagenode", "live requests": 4, "requestLimit": 1}
2022-08-11T07:39:43.146Z        ERROR   piecestore      upload rejected, too many requests      {"Process": "storagenode", "live requests": 4, "requestLimit": 1}
2022-08-11T07:39:45.407Z        INFO    piecestore      uploaded        {"Process": "storagenode", "Piece ID": "PIECEID", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "PUT", "Size": 399104}

Looking at the timestamps I am shocked:
Upload start: 07:15:57.025Z
Upload finished: 07:39:45.407Z

For a single piece!!!
How can this be? My worst thought is someone is trying to block nodes by conducting slow uploads on purpose?

BrightSilence · August 11, 2022, 8:09am

The node works highly parallel, so if connections are initiated simultaneously they might go through before the node starts blocking. It may also count egress connections, but it never blocks those.

A setting of 1 is a really bad idea, though I’m sure you were just doing this for testing.

And this is why you don’t want to only allow very few connection. Customers can be slow as well and it just takes one slow customer to block off your node from receiving anything if you set it to 1.

In general, this setting is a bad idea to use. Not only will it stop you from receiving data, it will also reject customer transactions, increasing the possibility of their upload failing. If your system is overloaded, there are better options to alleviate that. Like moving the databases to a different HDD/SSD.

I would not recommend using this option unless you’ve tried everything else first.

This wouldn’t do anything on most nodes as it would happily wait and just deal with other transactions in parallel.

jammerdan · August 11, 2022, 8:20am

This node is currently getting hammered with uploads and with file walker running and all this, it has issues to keep up. The setting of one was a start to see how it does.
However

This slow? 24 minutes for a single piece? I am not sure I want that on this node. I am really looking forward to the new option to disable file walker to see if that has a positive impact.

BrightSilence · August 11, 2022, 8:25am

What kind of setup and hardware is it on? Did you already move the db’s or is this not an option?

Yeah, I’ve looked at this before. You can also see stats on this in monkit I believe (looks for the debugging your storagenode topic). It happens quite frequently. It would be nice to be able to set a time out on the node side too. Though an option like that could again make the customer experience worse. It shouldn’t be set by default. Maybe it should only even do anything if you have also used the max-concurrent-requests option and not allow for settings below 1 minute.

Do you know if the new setting will also block garbage collection?

jammerdan · August 11, 2022, 8:33am

Unfortunately this is not an option. The node has only one HDD.
Currently after every change the file walker will run again after restart, so its a bit bad for testing different settings…

I thought it would only stop the file walking on restart. So I would be surprised if it would prevent garbage collection.

BrightSilence · August 11, 2022, 8:37am

I guess you’d still have the same problems during garbage collection then. That’s a bummer, but then again, not running garbage collection, especially on a slow node, is probably also a bad idea.

I wouldn’t ever set this setting below 5 though. Otherwise you basically block your node from receiving anything most of the time. For temporary relieve you can also lower the allocated space to below what is already stored. That will just tell the satellites to not even try to send data your way, so customers won’t even initiate connections. This prevents your node from having to reject them to begin with. Though of course you don’t want to keep it that way. This would be a lot easier once you can stop the file walker from running after restart though. Because a restart is unfortunately required to change the allocated space.

Stob · August 11, 2022, 10:16am

+1 for monkit. Now that I’ve been checking it more regularly, it has a lot of information.

Just at this random moment in time my node receives 15 uploads while the filewalker is still running:

BrightSilence · August 11, 2022, 10:31am

Check the /mon/funcs endpoint for aggregated stats on functions.

It does feel like some uploads just get stuck forever.

Toyoo · August 11, 2022, 10:49am

This actually means: do not allow new uploads if there is at least 1 concurrent data transfer (either an upload or a download). Only uploads can be rejected, but both uploads and downloads go into the count of concurrent requests.

jammerdan · August 11, 2022, 11:07am

(Post must be at least 20 characters)

jammerdan · August 11, 2022, 12:52pm

Yesm I wish I had set it up before I restarted. Now I cannot really see if the filewalker is still running. But I believe it is.

Stob · August 11, 2022, 1:01pm

There should be a way to find the random port number … but I don’t know linux.

jammerdan · August 11, 2022, 4:33pm

Wow that was a great suggestion! This idea brought me to something, I think I should share it:

You can install the netstat package to look for established network connections. By default, these utilities may not be available inside the container.
So to install it, use:
apt update
apt install net-tools
Now, we can use the netstat command as usual.

To see processes listening on ports, you can use:
netstat -tulnp

I have found a port that does not belong there, so I guessed, this is the right one. Installed curl, doing a curl localhost:port/mon/funcs and I get some output.

I have no idea what this is I am seeing I have to go over this.

jammerdan · August 11, 2022, 4:52pm

Now how do I read this?

[3326624525364586402] storj.io/storj/storagenode/pieces.(*CacheService).Run() (elapsed: 1h4m57.006355716s, orphaned)
 [1205168693044936770] storj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite() (elapsed: 1h4m57.006278507s)
  [2188566969813173146] storj.io/storj/storagenode/pieces.(*Store).WalkSatellitePieces() (elapsed: 1h4m57.005517671s)
   [7169027342028649691] storj.io/storj/storage/filestore.(*Dir).WalkNamespace() (elapsed: 1h4m57.005386962s)
    [312960706685582609] storj.io/storj/storage/filestore.(*Dir).walkNamespaceInPath() (elapsed: 1h4m57.005435128s)

[1183853040753984134] storj.io/storj/storagenode/pieces.(*Store).EmptyTrash() (elapsed: 51m35.743054515s)
 [3551158442265692859] storj.io/storj/storage/filestore.(*blobStore).EmptyTrash() (elapsed: 51m35.743030556s)
  [5918463843777401585] storj.io/storj/storage/filestore.(*Dir).EmptyTrash() (elapsed: 51m35.743014265s)
   [8285769245289110310] storj.io/storj/storage/filestore.(*Dir).walkNamespaceInPath() (elapsed: 51m35.742993806s)

[4984102521334705395] storj.io/storj/storagenode/retain.(*Service).Run() (elapsed: 1h4m56.996615879s, orphaned)

[3727133373339822375] storj.io/storj/storagenode/version.(*Chore).Run() (elapsed: 1h4m57.009240397s, orphaned)

Stob · August 11, 2022, 7:03pm

jammerdan:

[3326624525364586402] storj.io/storj/storagenode/pieces.(*CacheService).Run() (elapsed: 1h4m57.006355716s, orphaned)
 [1205168693044936770] storj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite() (elapsed: 1h4m57.006278507s)
  [2188566969813173146] storj.io/storj/storagenode/pieces.(*Store).WalkSatellitePieces() (elapsed: 1h4m57.005517671s)
   [7169027342028649691] storj.io/storj/storage/filestore.(*Dir).WalkNamespace() (elapsed: 1h4m57.005386962s)
    [312960706685582609] storj.io/storj/storage/filestore.(*Dir).walkNamespaceInPath() (elapsed: 1h4m57.005435128s)

This is the filewalker process. It runs every time the node starts.

jammerdan:

[1183853040753984134] storj.io/storj/storagenode/pieces.(*Store).EmptyTrash() (elapsed: 51m35.743054515s)
 [3551158442265692859] storj.io/storj/storage/filestore.(*blobStore).EmptyTrash() (elapsed: 51m35.743030556s)
  [5918463843777401585] storj.io/storj/storage/filestore.(*Dir).EmptyTrash() (elapsed: 51m35.743014265s)
   [8285769245289110310] storj.io/storj/storage/filestore.(*Dir).walkNamespaceInPath() (elapsed: 51m35.742993806s)

This seems to empty the trash of anything older than x days. It seems to run every 2 hours.

SGC · August 11, 2022, 7:09pm

running at even max concurrent 120, i still see periods of uploads being rejected.
does help a lot from keeping the network spamming one’s system to death during startup or such tho.

also max concurrent only seems to apply to writes… so only ingress / uploads.
lots of users might have poor connections or whatever… so with 1 max concurrent.
you might have one user uploading a 4KB/s… and you are maxed out…
i’m sure you can see the problem with this.

120 for me i think gives an avg rejected uploads of 0.5%

jammerdan · August 11, 2022, 7:19pm

So the processes that I can see there are the ones that are currently running?
What does it mean when it say orphaned?

And I am seeing an additional process now:

[4984102521334705395] storj.io/storj/storagenode/retain.(*Service).Run() (elapsed: 3h32m16.918975441s, orphaned)

I wonder why this is running so long?

It is hard to find a good setting when every time you have to restart and the filewalker kicks in.

Stob · August 11, 2022, 7:25pm

Yes, an instant snapshot of the current processes.

This just seems to sit there waiting for sub processes to spawn.

There are a bunch of these ‘master’ processes just ‘waiting’ for slave processes on my Windows node:

[4222075399394417348] storj.io/storj/storagenode/bandwidth.(*Service).Run() (elapsed: 26h12m42.7942891s, orphaned)

[1520630502694312314] storj.io/storj/storagenode/collector.(*Service).Run() (elapsed: 26h12m42.809719s, orphaned)

[5109427644553286243] storj.io/storj/storagenode/console/consoleserver.(*Server).Run() (elapsed: 26h12m42.8117044s, orphaned)

[8222499173770810147] storj.io/storj/storagenode/contact.(*Chore).Run() (elapsed: 26h12m42.8115867s, orphaned)

[2343883767729150055] storj.io/storj/storagenode/gracefulexit.(*Chore).Run() (elapsed: 26h12m42.810148s, orphaned)

[3051294481966191890] storj.io/storj/storagenode/monitor.(*Service).Run() (elapsed: 26h12m42.8112076s, orphaned)

[1816414584413935279] storj.io/storj/storagenode/orders.(*Service).Run() (elapsed: 26h12m42.8106406s, orphaned)

[8634125806288229018] storj.io/storj/storagenode/version.(*Chore).Run() (elapsed: 26h12m42.811635s, orphaned)

SGC · August 11, 2022, 8:07pm

true, but i can run the filewalker in an hour or two… maybe less… so not really a big issue here.
can’t say i tried a great amount of different max concurrent and i really barely need to limit it…
its just sometimes if stuff goes wrong, then the network tends to r*pe my storage.
so setting boundaries is helpful…

tried a great many settings on the max concurrent over the years and it does seem like the demand has been going up…
basically one wants it to block something, but not to much… so doesn’t really have to be super accurate…

just helps a ton when the network doesn’t start trying to upload 8000 pieces at one time… can be the difference between the system crashing or just lagging a bit.

atleast for my use case i just want the max concurrent to cut off the crazy demands that can arise.

and also why i bother actually commenting, to help people select a good setting because it can be difficult to guestimate.

100 to 200 should be the range i would use… now its infinitely easier to select since the range is limited rather than infinite.