It really depends on the hardware, the setup but also on node age and size. Now with nodes getting older and therefore bigger and the increasing usage pattern due to Storj adoption by customers, this can become an issue. The decision to make the file walker or similar processes to run independently of current load was not a good one.
Here the file walker is running now for 10 hours straight.
14 hours nowâŚ
I feel your pain. 40 hours 0 minutes and still goingâŚ
Youâre seeing stats for every function, collected over time. Itâs a really long list, but if you look for upload, youâll find stats on how long the upload function takes to complete.
Retain is the garbage collection process. So yeah, thatâs a long one.
That the parent thread/function no longer exists. This is fairly normal in how go works I think. Functions just create a new thread to do what needs to be done and then end.
Ah really? Did not know that, good to know thx
Thatâs a shame, I think we should be able to configure both separately.
One of my nodes on a cheap 2.5" drive was taking more than 36 hours, but since my RPi 4B was switched to 64bit Raspbian, it is way faster for reasons I canât explain (less than 10 jours I think - still incredibly long, but way better anyways).
I do agree the max-concurrent-requests
setting shouldnât be used, but personally I did not find any other way for using cheap 2.5" SMR drives which are not performing well âby designâ and really arenât supposed to be used this wayâŚ
I can confirm that even on my crawling poor 2.5" SMR disks, everything works fine with such a low setting. In fact, all of my SMR nodes have higher values than that.
I think the way it works now makes sense. You cannot deny downloads to customers, as by accepting the upload you promise that you will store and make files available on demand. But you can deny new uploads (e.g. because of lack of free space). Yet both uploads and downloads require your node to perform some potentially time consuming operations, and require bandwidth. So if your node canât cope with, let say, 100 such concurrent operations, uploads is the only thing that you can control while not violating T&C.
The hidden assumption is that an upload and a download both take comparable resources to handle. Judging from I/O code paths in the storage node code, theyâre indeed quite close now⌠except for SMR drives, which were not yet a known problem when this switch was invented. So, if anything, Iâd probably think of some tunable cost of upload vs. download, while keeping storage2.max-concurrent-requests
as it is now⌠but, frankly, Iâd see it as an ugly workaround, as I believe it would be much more productive to just optimize storage node code instead and help all node operators, not just those with SMR drives.
Oh yes! Now after the file walker was running for more than 24 hours, update decided it was a good time to kick in and restarted the node. And here goes the file walker again right from the start.
yeah the filewalker should really be smarter⌠maybe that was a good feature requestâŚ
that it doesnât just reset when a node is restartedâŚ
It is still running and it is a pain in the .
I see the new feature has been merged and a pre-release is being offered:
Now the question: Would it be possible to install it already? And would it be safe?
And more important: If something breaks, would it be possible to go back to an officially released version?
the versions are automated now, based upon node⌠so StorjLabs basically has a list somewhere of which version the node id should have, and when it updates to the next oneâŚ
and its checked on start up of the node.
we are running a 14 day update schedule now⌠sure it doesnât always happen that there is a new update after 14 days⌠but it does happen⌠that means that if your filewalker takes 48 hours or so⌠lets call it days⌠then by the start of each 14 day period you would be forced to run the filewalker.
so near 20% of ones uptime would in the best case be spent running the filewalker⌠and doubling if one has even one single break down or reboot.
having the filewalker take 48 hours is imo not viable, not just because it makes everything annoying to work with⌠but it also puts a very heavy workload of the disk, which certainly wonât help its lifespanâŚ
tho i must admit, iâm not sure how much this actually matters, but its possible that these kinds of things are why manufactures of HDDâs put limits on how much data they can transfer / write before warranty is no longer valid.
isnât even that high in some casesâŚ
think some consumer HDDâs today come with like 500TB limits⌠which is sort of absurd ⌠thats like worse than most SSDâs, and SSDâs have historically been the devices with the worst wear in storage.
ofc itâs possible that modern HDDâs are less wear resistant and ofc SSD tech is getting better all the time⌠or HDD manufactures just found it a good way to get around upholding warranty.
in anycase⌠limiting workload should or could extend HDD life and lower temps.
i mean if the head move at 50% pace rather than 100% pace then could be argued it might have double the life span⌠atleast if its the mechanics of the head movement that is first to wear out.
more wear is rarely better⌠so⌠i think ill just stop there⌠lol
atleast now we will get the ability to turn off the filewalker⌠not sure i really understand how that is a good idea⌠it sort of makes sense⌠for like troubleshootingâŚ
but then one might just turn it off permanently and what happens then⌠if i keep adding capacity faster than its used does the filewalker even matter thenâŚ
will be interesting⌠i wonât be the guinepig that turns off the filewalker, i got way to much caching to care.
Unless there were database migrations.
Did the filewalker restart again? How to interpret that:
[3341639376970616775,4948462480863905764] storj.io/storj/storagenode/pieces.(*CacheService).Run() (elapsed: 39h48m9.101477867s, orphaned)
[8290101857834522541,4948462480863905764] storj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite() (elapsed: 39h48m9.101399658s)
[233612088496725171,4948462480863905764] storj.io/storj/storagenode/pieces.(*Store).WalkSatellitePieces() (elapsed: 1h31m34.008860976s)
[1470727708712701612,4948462480863905764] storj.io/storj/storage/filestore.(*Dir).WalkNamespace() (elapsed: 1h31m34.008843935s)
[2707843328928678054,4948462480863905764] storj.io/storj/storage/filestore.(*Dir).walkNamespaceInPath() (elapsed: 1h31m34.00889831s)
Im surprised about the 2 different times showing:
39h48m
and 1h31m
. The second one seems to indicate that the filewalker started again about 1 and half hours ago.
Now finally it seems the filewalker has come to an end. I donât have exact times, but surely it was 38hrs +.
Looking at the metrics I can tell, that the filewalker is putting a lot of stress on the node. It is unbelievable. Load metrics are now 1 tenth of when filewalker was running and I could increase the concurrency again. But wait⌠when I do that and restart the node, the filewalker will come back.
This filewalker design is
Im surprised about the 2 different times showing:
39h48m
and1h31m
. The second one seems to indicate that the filewalker started again about 1 and half hours ago.
I think thatâs just because it started with the next satellite.