Two weeks working for free in the waste storage business :-(

Mad_Max · July 18, 2024, 12:41pm

No, this node was not killed, at the moment it continues to work on without restarts.
But your wording gave me an idea - this behavior may be just because of the large cache in memory - files are not written to the disk during upload process (because they completely fit in RAM and there is no need to write to disk at the moment), when upload canceled, a piece of code which you mentioned has nothing to delete from the disk (the content of the file still in RAM only), and after executing this part when later where-then the cache with canceled upload is flushed to disk instead of discarding it.

Now it looks like a potential bug, tomorrow I will collect some more tests and issue a new bug report on GitHub.

littleskunk · July 18, 2024, 1:04pm

It is up to the operating system to manage the filesystem cache. I don’t think any operating system has such a race condition that would allow the delete operation to bypass the filesystem cache.

nerdatwork · July 18, 2024, 1:18pm

I think you mean the temp directory.

Julio · July 18, 2024, 7:10pm

Maybe test with async: false (filestore.force-sync: true)… see if the behavior is reproduced. Mayhaps they just changed the working dir from the .temp to the final piece id destination to constituted the files there instead; then forgot to subsequently redirect a .temp move and|or delete routine to the new async “on” piece ID location.

Also of note, I believe they changed the default cache to 4 Meg (~152kb avg file size lately for test data = ~3.85 megs of ram per piece waste) from 128 kb. At the rates of test bandwidth saturation this can often cause a race condition; overwhelming windows memory management, and abruptly kill off a node (at least so noted on windows server, lower mem footprint.) Changing the config, obviously fixes that runaway condition, either by b/w restriction or (filestore.write-buffer-size: 128.0 KiB.) However, I haven’t yet noticed others posting with problems that aren’t logged yet.

2 cents

Alexey · July 19, 2024, 4:35am

github.com/storj/storj

Cancelled upload files aren't immediately deleted

opened 11:59PM - 18 Jul 24 UTC

zipiju

Bug

Noticed this on the forum, where an user was asked to create a bug report after …finding this bug, but I'm seeing none here, so creating one after was able to reproduce the issue. It looks like the node isn't actually deleting files from uploads that were cancelled at some point. This probably is a significant source of garbage and additional IOPS, especially in situations where the upload success rate tanks - either because the disk subsystem cannot keep up (as during the filewalks, and even with SSD cache in front of the platter drive), or when the circuit bandwidth is all used up. As per the logs, these example uploads were all cancelled, the files were however still created in the blobs folder: ``` # tail -f /var/log/storj/storj.log | grep cancel 2024-07-19T01:35:30+02:00 INFO piecestore upload canceled (race lost or node shutdown) {"Process": "storagenode", "Piece ID": "B4PJDU3TYYRUA2VJTRAPJ65T2Z3R6PZ76GVCJNCI6N4WU3B7MMKA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT", "Remote Address": "109.61.92.82:39516"} 2024-07-19T01:35:30+02:00 INFO piecestore upload canceled (race lost or node shutdown) {"Process": "storagenode", "Piece ID": "UZ7YJRS3JNNBOK22N562O7SHNNXWLFJOF7IYVVGRACZTJONLG7TQ", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT", "Remote Address": "79.127.213.34:38270"} 2024-07-19T01:35:30+02:00 INFO piecestore upload canceled (race lost or node shutdown) {"Process": "storagenode", "Piece ID": "EBSPEU6RN5765OKD3I2EBRW73FROL3MEKCHTSQQTIK2KV3YFSCUQ", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT", "Remote Address": "79.127.213.33:33718"} 2024-07-19T01:35:31+02:00 INFO piecestore upload canceled (race lost or node shutdown) {"Process": "storagenode", "Piece ID": "RTPUVF3NFLXQU6E2WB4TMH5XYEPAPHKBPYQJBYQVF7YO2ENR2KSA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT", "Remote Address": "79.127.205.228:50688"} # ls -lah /mnt/storj-data/data/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/b4/pjdu3tyyrua2vjtrapj65t2z3r6pz76gvcjnci6n4wu3b7mmka.sj1 -rw-r--r-- 1 storj storj 1.3K Jul 19 01:35 /mnt/storj-data/data/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/b4/pjdu3tyyrua2vjtrapj65t2z3r6pz76gvcjnci6n4wu3b7mmka.sj1 # ls -lah /mnt/storj-data/data/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/uz/7yjrs3jnnbok22n562o7shnnxwlfjof7iyvvgracztjonlg7tq.sj1 -rw-r--r-- 1 storj storj 16K Jul 19 01:35 /mnt/storj-data/data/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/uz/7yjrs3jnnbok22n562o7shnnxwlfjof7iyvvgracztjonlg7tq.sj1 # ls -lah /mnt/storj-data/data/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/eb/speu6rn5765okd3i2ebrw73frol3mekchtsqqtik2kv3yfscuq.sj1 -rw-r--r-- 1 storj storj 15K Jul 19 01:35 /mnt/storj-data/data/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/eb/speu6rn5765okd3i2ebrw73frol3mekchtsqqtik2kv3yfscuq.sj1 # ls -lah /mnt/storj-data/data/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/rt/puvf3nflxqu6e2wb4tmh5xyepaphkbpyqjbyqvf7yo2enr2ksa.sj1 -rw-r--r-- 1 storj storj 30K Jul 19 01:35 /mnt/storj-data/data/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/rt/puvf3nflxqu6e2wb4tmh5xyepaphkbpyqjbyqvf7yo2enr2ksa.sj1 ``` I would say this should be reworked to hold the piece all in memory until the upload is confirmed or cancelled, and only after this decision is made it should be written to the drive. There is this filestore.write-buffer-size configuration option (set to 4MiB on this particular node), which would suggest this would be the case, but this apparently isn't working correctly.

jammerdan · July 19, 2024, 5:27am

Another interesting thing to check, which was my initial concern regarding

if additionally to uploaded but cancelled pieces we have also .partial files that were used to be in the temp folder too. I haven’t seen any yet but maybe just missed them.

Mad_Max · July 19, 2024, 7:26am

I didn’t mean the OS file system level cache, but the internal cache implemented in the storagenode process itself. Which is controlled via this parameter in the node config:
filestore.write-buffer-size:
And designed to store(buffer) partially uploaded pieces before writing them to filesystem to reduce disk I/O.

Although I should have used the term “buffer” instead of “cache” to avoid possible confusion with the OS filesystem cache.

Same apply here. I think filestore.force-sync option can affect only for work of filesystem level cache and hardware cache of a HDD (controller onboard RAM cache). And not relevant for pieces stored in Storj write buffer.

P.S.
Bug report for canceled uploads saved to blobs (and became uncollected garbage instantly) was created by another user (link in a @Alexey post above). But i have added my data there too now. And I think we should continue discussing this particular issue there.

I did not see and did not know when the default values changed. Because I changed them to my own values a long time ago (currently its 2.5 MiB on one of the nodes and 1 MiB on the others).

But I also saw before that with any upload, regardless of the size of the piece started to upload, the maximum amount of memory is immediately allocated. Which can cause very high memory consumption with a large simultaneous number of connections sending a lot of small pieces. I have seen up to 5 GB RAM usage and about 1000 open connections with ingress traffic less than 100 Mbit/s during SLC test spikes with 2.5 MiB write-buffer-size in config. It can easily crash / kill by OOM a lot of “potato” nodes

It looks like a thing that should be framed as a suggestion for improvement in the appropriate section of the forum or directly on the GitHub. Are there any volunteers?

Draft:
If the upload size for the node is known in advance (when upload starts), then its simple and you need to reserve exactly as much RAM as you need to store whole piece (but not more than the amount specified in filestore.write-buffer-size:).
If it is unknown in advance (I think this case is true now?), it is probably worth allocating memory in medium sized chunks to avoid waste of RAM for small pieces.

For example, in 128 KB(former default value) or a comparable amount steps until one of the conditions met: ether end of upload or filestore.write-buffer-size limit hit.
Because allocating memory in a very small chunks is also not very good, because memory in the OS can ALSO be severe fragmented by tons of allocation (although this fragmentation is not as bad as fragmentation of files on disk, but still can affect performance), and requires overhead for alloc calls.
So here we need to look for some kind of compromise between the number of RAM allocations and the average memory overuse by write buffers due to small average piece size.

Alexey · July 19, 2024, 8:29am

Both cases are true at the moment. It depends on the client. Some will report, how much they are want to allocate, some - don’t and you will know only when they are finished. But they may also cancel the upload in the middle without letting to know the storagenode, so it will end with a timeout or a context canceled down the road.

Julio · July 20, 2024, 2:52am

Hey M_M…Sorry I’m just gonna stream of consciousness here, there’re be ramble undoubtedly - but hopefully in the end a little helpful - having to finish last weeks projects before weekend.

I know they changed it cuz that write-buffer size config line specifically dropped out of the .yaml on new installs at some version or another…say maybe last summer.
I haven’t checked a --help query on the binary, but shouldn’t it spew the current defaults - if they’re not too lazy to update?
I agree the write-buffer-size could certainly use some augmentation; making it more dynamic would be adventageous and should be prioretized.
Of note, after the exodus of the abuse of free accounts - probably used for chia, average file size had gotten up to over 700k, by my notes
(by contrast, I think current average is ~250kb/2272 - for most customer data. And having just checked an .eu node, it was about 530kb average file size

which I suspect will decline as purging old accounts there continues…) Moreover, it seems they’re down to 150-170kb avg. file size expectation with this test data,
indicative of a very ‘live’ & low TTL operational preference. Yet the max is the max, that’s obvious.
This is probably a good representation of a predominant archival bias; 64meg chunks (64meg/29 segments) ie: predominantly ~2.21 meg files.
So I think that generalized S3 bucket/chunk size MAX should be the basis for a more dynamic solution. As it was obvious enough to you and I
to have change the buffer-size to a more efficient allocation: In my case Not 4meg, like I think the default is now, but 2272.0 KiB or your 2.5/1 meg, as described.
Therefore, I suggest we specifically add the 2272.0 KiB to that draft.

2 cents for today
P.S. I would & will offer more, but I find myself having to reconfig a PB of stuff to accomodate this new potential reality… which ain’t easy, and a life. Nonetheless,
I’m excited to engage further as I’ve been a programmer since before Z80/a & 6502 days - lol.

Julio · July 20, 2024, 2:58am

I’ve noted blobs in excess of 2.2meg - very few. What is the max chunk size? Also it should be considered what final & or dynamic reed solomon variable set is used. Will that be variable per sat, is it locked down by the testing as yet - any insights? I have noted Littleskunk testing various permutations. Is Adding an RS header to a direct connect/S3 upload possible, for use of aforesaid dynamic write allocation?

Thanks,
2 of 2 cents for today

Alexey · July 20, 2024, 7:12am

If you specified some value in the config.yaml, why the node should reset it? This would mean that you cannot provide anything in your config file ever? It has no idea, what’s options were default.
But you may submit a PR if know how to do that, we are glad to accept a Community contribution!

unlimited, but you may check yours

It’s not exactly the answer, but you may see which ones are stored on your node.

jammerdan · July 20, 2024, 7:21am

With a different yaml structure in the config.yaml, this should be possible I believe.

Alexey · July 20, 2024, 7:42am

Only if the parameter got renamed.

pangolin · July 20, 2024, 1:39pm

What size do you mean? Max Segment size is still 64 MB, max piece size should depend on RS settings which were changed several times recently.

Alexey · July 20, 2024, 2:28pm

Yes. But. It’s not hard coded. Right now the maximum possible segment size is 64MiB, with the setting 80/29 it’s probable the 64MiB/80 or 64MiB/29. But I wouldn’t confirm that’s the maximum possible size for the piece…
Especially, if we want to change RS settings almost dynamically

Alexey · July 21, 2024, 6:45am

2 posts were split to a new topic: Is it possible to add more drives and expand

Toyoo · July 21, 2024, 7:54pm

I do recall pieces of ~8 MB being uploaded briefly as part of tests.

Julio · July 22, 2024, 11:40am

That’s correct, yes of course the binaries defaults should not superceed the configuration file’s user variables. What I meant was when querying the binary with a ‘–help’ command, normally all defaults of specific parameters will be displayed in conjunction; not unlike the hashtag commentary in the config .yaml file. But sometimes programmers are lazy, and forget to update help/usage responses when changing/obfuscating/etc. parameters, n’ stuff.

Thx for the other info, too Alexey.

Julio · July 22, 2024, 12:25pm

ok … so, in essence ‘Now testing RS number 16/20/30/60 with bestofn n=1.9’ == 64MiB segment / 16 pieces = 4 MiB blobs, ergo max mem needed. And in the case of 8 meg blobs, if those were max 64MiB segments, that’d be only 8 pieces. Obviously 16 is a better divisor to match cluster sizes, wonder why they ever started with 29? 32 would have seemed more rational. Anyway, back to my original thought: If the node can somehow be informed of the RS used for the piece it’s about to receive, it could react with an exact sized memory allocation, or a runlength encoding of partials (in-stream) as a look ahead, only needing to change when a new RS value received; although as M_M pointed out fragmenting memory might impose inefficiencies.
rable ramble … Rather than speculate, I should prolly go choke on some code… lol
1 & 1/2 cents

Toyoo · July 22, 2024, 9:08pm

One thing is that a piece file will also have a 512-byte header on top of actual data, which will already break your nice round numbers. Second is that a significant majority of segments are not exactly 64 MiB big. Third is that the parity algorithm used by Storj will also have some small overhead. No point in trying to exactly align RS numbers.