Two weeks working for free in the waste storage business :-(

No, this node was not killed, at the moment it continues to work on without restarts.
But your wording gave me an idea - this behavior may be just because of the large cache in memory - files are not written to the disk during upload process (because they completely fit in RAM and there is no need to write to disk at the moment), when upload canceled, a piece of code which you mentioned has nothing to delete from the disk (the content of the file still in RAM only), and after executing this part when later where-then the cache with canceled upload is flushed to disk instead of discarding it.

Now it looks like a potential bug, tomorrow I will collect some more tests and issue a new bug report on GitHub.

5 Likes

It is up to the operating system to manage the filesystem cache. I donā€™t think any operating system has such a race condition that would allow the delete operation to bypass the filesystem cache.

3 Likes

I think you mean the temp directory.

3 Likes

Maybe test with async: false (filestore.force-sync: true)ā€¦ see if the behavior is reproduced. Mayhaps they just changed the working dir from the .temp to the final piece id destination to constituted the files there instead; then forgot to subsequently redirect a .temp move and|or delete routine to the new async ā€œonā€ piece ID location.

Also of note, I believe they changed the default cache to 4 Meg (~152kb avg file size lately for test data = ~3.85 megs of ram per piece waste) from 128 kb. At the rates of test bandwidth saturation this can often cause a race condition; overwhelming windows memory management, and abruptly kill off a node (at least so noted on windows server, lower mem footprint.) Changing the config, obviously fixes that runaway condition, either by b/w restriction or (filestore.write-buffer-size: 128.0 KiB.) However, I havenā€™t yet noticed others posting with problems that arenā€™t logged yet.

2 cents

2 Likes
1 Like

Another interesting thing to check, which was my initial concern regarding

if additionally to uploaded but cancelled pieces we have also .partial files that were used to be in the temp folder too. I havenā€™t seen any yet but maybe just missed them.

I didnā€™t mean the OS file system level cache, but the internal cache implemented in the storagenode process itself. Which is controlled via this parameter in the node config:
filestore.write-buffer-size:
And designed to store(buffer) partially uploaded pieces before writing them to filesystem to reduce disk I/O.

Although I should have used the term ā€œbufferā€ instead of ā€œcacheā€ to avoid possible confusion with the OS filesystem cache.

Same apply here. I think filestore.force-sync option can affect only for work of filesystem level cache and hardware cache of a HDD (controller onboard RAM cache). And not relevant for pieces stored in Storj write buffer.

P.S.
Bug report for canceled uploads saved to blobs (and became uncollected garbage instantly) was created by another user (link in a @Alexey post above). But i have added my data there too now. And I think we should continue discussing this particular issue there.

I did not see and did not know when the default values changed. Because I changed them to my own values a long time ago (currently its 2.5 MiB on one of the nodes and 1 MiB on the others).

But I also saw before that with any upload, regardless of the size of the piece started to upload, the maximum amount of memory is immediately allocated. Which can cause very high memory consumption with a large simultaneous number of connections sending a lot of small pieces. I have seen up to 5 GB RAM usage and about 1000 open connections with ingress traffic less than 100 Mbit/s during SLC test spikes with 2.5 MiB write-buffer-size in config. It can easily crash / kill by OOM a lot of ā€œpotatoā€ nodes

It looks like a thing that should be framed as a suggestion for improvement in the appropriate section of the forum or directly on the GitHub. Are there any volunteers?

Draft:
If the upload size for the node is known in advance (when upload starts), then its simple and you need to reserve exactly as much RAM as you need to store whole piece (but not more than the amount specified in filestore.write-buffer-size:).
If it is unknown in advance (I think this case is true now?), it is probably worth allocating memory in medium sized chunks to avoid waste of RAM for small pieces.

For example, in 128 KB(former default value) or a comparable amount steps until one of the conditions met: ether end of upload or filestore.write-buffer-size limit hit.
Because allocating memory in a very small chunks is also not very good, because memory in the OS can ALSO be severe fragmented by tons of allocation (although this fragmentation is not as bad as fragmentation of files on disk, but still can affect performance), and requires overhead for alloc calls.
So here we need to look for some kind of compromise between the number of RAM allocations and the average memory overuse by write buffers due to small average piece size.

3 Likes

Both cases are true at the moment. It depends on the client. Some will report, how much they are want to allocate, some - donā€™t and you will know only when they are finished. But they may also cancel the upload in the middle without letting to know the storagenode, so it will end with a timeout or a context canceled down the road.

Hey M_Mā€¦Sorry Iā€™m just gonna stream of consciousness here, thereā€™re be ramble undoubtedly - but hopefully in the end a little helpful - having to finish last weeks projects before weekend.

I know they changed it cuz that write-buffer size config line specifically dropped out of the .yaml on new installs at some version or anotherā€¦say maybe last summer.
I havenā€™t checked a --help query on the binary, but shouldnā€™t it spew the current defaults - if theyā€™re not too lazy to update?
I agree the write-buffer-size could certainly use some augmentation; making it more dynamic would be adventageous and should be prioretized.
Of note, after the exodus of the abuse of free accounts - probably used for chia, average file size had gotten up to over 700k, by my notes
(by contrast, I think current average is ~250kb/2272 - for most customer data. And having just checked an .eu node, it was about 530kb average file size

  • which I suspect will decline as purging old accounts there continuesā€¦) Moreover, it seems theyā€™re down to 150-170kb avg. file size expectation with this test data,
    indicative of a very ā€˜liveā€™ & low TTL operational preference. Yet the max is the max, thatā€™s obvious.
    This is probably a good representation of a predominant archival bias; 64meg chunks (64meg/29 segments) ie: predominantly ~2.21 meg files.
    So I think that generalized S3 bucket/chunk size MAX should be the basis for a more dynamic solution. As it was obvious enough to you and I
    to have change the buffer-size to a more efficient allocation: In my case Not 4meg, like I think the default is now, but 2272.0 KiB or your 2.5/1 meg, as described.
    Therefore, I suggest we specifically add the 2272.0 KiB to that draft.

2 cents for today
P.S. I would & will offer more, but I find myself having to reconfig a PB of stuff to accomodate this new potential realityā€¦ which ainā€™t easy, and a life. Nonetheless,
Iā€™m excited to engage further as Iā€™ve been a programmer since before Z80/a & 6502 days - lol.

Iā€™ve noted blobs in excess of 2.2meg - very few. What is the max chunk size? Also it should be considered what final & or dynamic reed solomon variable set is used. Will that be variable per sat, is it locked down by the testing as yet - any insights? I have noted Littleskunk testing various permutations. Is Adding an RS header to a direct connect/S3 upload possible, for use of aforesaid dynamic write allocation?

Thanks,
2 of 2 cents for today :slight_smile:

If you specified some value in the config.yaml, why the node should reset it? This would mean that you cannot provide anything in your config file ever? It has no idea, whatā€™s options were default.
But you may submit a PR if know how to do that, we are glad to accept a Community contribution!

unlimited, but you may check yours

Itā€™s not exactly the answer, but you may see which ones are stored on your node.

With a different yaml structure in the config.yaml, this should be possible I believe.

Only if the parameter got renamed.

What size do you mean? Max Segment size is still 64 MB, max piece size should depend on RS settings which were changed several times recently.

Yes. But. Itā€™s not hard coded. Right now the maximum possible segment size is 64MiB, with the setting 80/29 itā€™s probable the 64MiB/80 or 64MiB/29. But I wouldnā€™t confirm thatā€™s the maximum possible size for the pieceā€¦
Especially, if we want to change RS settings almost dynamically

2 posts were split to a new topic: Is it possible to add more drives and expand

I do recall pieces of ~8 MB being uploaded briefly as part of tests.

Thatā€™s correct, yes of course the binaries defaults should not superceed the configuration fileā€™s user variables. What I meant was when querying the binary with a ā€˜ā€“helpā€™ command, normally all defaults of specific parameters will be displayed in conjunction; not unlike the hashtag commentary in the config .yaml file. But sometimes programmers are lazy, and forget to update help/usage responses when changing/obfuscating/etc. parameters, nā€™ stuff.

Thx for the other info, too Alexey.

1 Like

ok ā€¦ so, in essence ā€˜Now testing RS number 16/20/30/60 with bestofn n=1.9ā€™ == 64MiB segment / 16 pieces = 4 MiB blobs, ergo max mem needed. And in the case of 8 meg blobs, if those were max 64MiB segments, thatā€™d be only 8 pieces. Obviously 16 is a better divisor to match cluster sizes, wonder why they ever started with 29? 32 would have seemed more rational. Anyway, back to my original thought: If the node can somehow be informed of the RS used for the piece itā€™s about to receive, it could react with an exact sized memory allocation, or a runlength encoding of partials (in-stream) as a look ahead, only needing to change when a new RS value received; although as M_M pointed out fragmenting memory might impose inefficiencies.
rable ramble ā€¦ Rather than speculate, I should prolly go choke on some codeā€¦ lol
1 & 1/2 cents

One thing is that a piece file will also have a 512-byte header on top of actual data, which will already break your nice round numbers. Second is that a significant majority of segments are not exactly 64 MiB big. Third is that the parity algorithm used by Storj will also have some small overhead. No point in trying to exactly align RS numbers.

3 Likes