How does 'storage2.max-concurrent-requests' really work?

jammerdan · August 12, 2022, 2:13am

It really depends on the hardware, the setup but also on node age and size. Now with nodes getting older and therefore bigger and the increasing usage pattern due to Storj adoption by customers, this can become an issue. The decision to make the file walker or similar processes to run independently of current load was not a good one.
Here the file walker is running now for 10 hours straight.

jammerdan · August 12, 2022, 5:39am

14 hours now…

Stob · August 12, 2022, 8:33am

I feel your pain. 40 hours 0 minutes and still going…

BrightSilence · August 12, 2022, 9:22am

You’re seeing stats for every function, collected over time. It’s a really long list, but if you look for upload, you’ll find stats on how long the upload function takes to complete.

Retain is the garbage collection process. So yeah, that’s a long one.

That the parent thread/function no longer exists. This is fairly normal in how go works I think. Functions just create a new thread to do what needs to be done and then end.

Pac · August 12, 2022, 9:54am

Ah really? Did not know that, good to know thx
That’s a shame, I think we should be able to configure both separately.

One of my nodes on a cheap 2.5" drive was taking more than 36 hours, but since my RPi 4B was switched to 64bit Raspbian, it is way faster for reasons I can’t explain (less than 10 jours I think - still incredibly long, but way better anyways).

I do agree the max-concurrent-requests setting shouldn’t be used, but personally I did not find any other way for using cheap 2.5" SMR drives which are not performing well “by design” and really aren’t supposed to be used this way…

I can confirm that even on my crawling poor 2.5" SMR disks, everything works fine with such a low setting. In fact, all of my SMR nodes have higher values than that.

Toyoo · August 12, 2022, 11:17am

I think the way it works now makes sense. You cannot deny downloads to customers, as by accepting the upload you promise that you will store and make files available on demand. But you can deny new uploads (e.g. because of lack of free space). Yet both uploads and downloads require your node to perform some potentially time consuming operations, and require bandwidth. So if your node can’t cope with, let say, 100 such concurrent operations, uploads is the only thing that you can control while not violating T&C.

The hidden assumption is that an upload and a download both take comparable resources to handle. Judging from I/O code paths in the storage node code, they’re indeed quite close now… except for SMR drives, which were not yet a known problem when this switch was invented. So, if anything, I’d probably think of some tunable cost of upload vs. download, while keeping storage2.max-concurrent-requests as it is now… but, frankly, I’d see it as an ugly workaround, as I believe it would be much more productive to just optimize storage node code instead and help all node operators, not just those with SMR drives.

jammerdan · August 12, 2022, 4:32pm

Oh yes! Now after the file walker was running for more than 24 hours, update decided it was a good time to kick in and restarted the node. And here goes the file walker again right from the start.

SGC · August 12, 2022, 5:11pm

yeah the filewalker should really be smarter… maybe that was a good feature request…
that it doesn’t just reset when a node is restarted…

jammerdan · August 13, 2022, 5:51am

It is still running and it is a pain in the .

I see the new feature has been merged and a pre-release is being offered:

Now the question: Would it be possible to install it already? And would it be safe?
And more important: If something breaks, would it be possible to go back to an officially released version?

SGC · August 13, 2022, 8:09am

the versions are automated now, based upon node… so StorjLabs basically has a list somewhere of which version the node id should have, and when it updates to the next one…

and its checked on start up of the node.
we are running a 14 day update schedule now… sure it doesn’t always happen that there is a new update after 14 days… but it does happen… that means that if your filewalker takes 48 hours or so… lets call it days… then by the start of each 14 day period you would be forced to run the filewalker.

so near 20% of ones uptime would in the best case be spent running the filewalker… and doubling if one has even one single break down or reboot.

having the filewalker take 48 hours is imo not viable, not just because it makes everything annoying to work with… but it also puts a very heavy workload of the disk, which certainly won’t help its lifespan…

tho i must admit, i’m not sure how much this actually matters, but its possible that these kinds of things are why manufactures of HDD’s put limits on how much data they can transfer / write before warranty is no longer valid.

isn’t even that high in some cases…
think some consumer HDD’s today come with like 500TB limits… which is sort of absurd … thats like worse than most SSD’s, and SSD’s have historically been the devices with the worst wear in storage.

ofc it’s possible that modern HDD’s are less wear resistant and ofc SSD tech is getting better all the time… or HDD manufactures just found it a good way to get around upholding warranty.

in anycase… limiting workload should or could extend HDD life and lower temps.
i mean if the head move at 50% pace rather than 100% pace then could be argued it might have double the life span… atleast if its the mechanics of the head movement that is first to wear out.

more wear is rarely better… so… i think ill just stop there… lol

atleast now we will get the ability to turn off the filewalker… not sure i really understand how that is a good idea… it sort of makes sense… for like troubleshooting…

but then one might just turn it off permanently and what happens then… if i keep adding capacity faster than its used does the filewalker even matter then…

will be interesting… i won’t be the guinepig that turns off the filewalker, i got way to much caching to care.

Alexey · August 13, 2022, 1:55pm

Unless there were database migrations.

jammerdan · August 14, 2022, 7:40am

Did the filewalker restart again? How to interpret that:

[3341639376970616775,4948462480863905764] storj.io/storj/storagenode/pieces.(*CacheService).Run() (elapsed: 39h48m9.101477867s, orphaned)
 [8290101857834522541,4948462480863905764] storj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite() (elapsed: 39h48m9.101399658s)
  [233612088496725171,4948462480863905764] storj.io/storj/storagenode/pieces.(*Store).WalkSatellitePieces() (elapsed: 1h31m34.008860976s)
   [1470727708712701612,4948462480863905764] storj.io/storj/storage/filestore.(*Dir).WalkNamespace() (elapsed: 1h31m34.008843935s)
    [2707843328928678054,4948462480863905764] storj.io/storj/storage/filestore.(*Dir).walkNamespaceInPath() (elapsed: 1h31m34.00889831s)

Im surprised about the 2 different times showing:
39h48m and 1h31m. The second one seems to indicate that the filewalker started again about 1 and half hours ago.

jammerdan · August 15, 2022, 10:52am

Now finally it seems the filewalker has come to an end. I don’t have exact times, but surely it was 38hrs +.
Looking at the metrics I can tell, that the filewalker is putting a lot of stress on the node. It is unbelievable. Load metrics are now 1 tenth of when filewalker was running and I could increase the concurrency again. But wait… when I do that and restart the node, the filewalker will come back.
This filewalker design is

BrightSilence · August 15, 2022, 11:25am

I think that’s just because it started with the next satellite.