No, this node was not killed, at the moment it continues to work on without restarts.
But your wording gave me an idea - this behavior may be just because of the large cache in memory - files are not written to the disk during upload process (because they completely fit in RAM and there is no need to write to disk at the moment), when upload canceled, a piece of code which you mentioned has nothing to delete from the disk (the content of the file still in RAM only), and after executing this part when later where-then the cache with canceled upload is flushed to disk instead of discarding it.
Now it looks like a potential bug, tomorrow I will collect some more tests and issue a new bug report on GitHub.
It is up to the operating system to manage the filesystem cache. I donāt think any operating system has such a race condition that would allow the delete operation to bypass the filesystem cache.
Maybe test with async: false (filestore.force-sync: true)ā¦ see if the behavior is reproduced. Mayhaps they just changed the working dir from the .temp to the final piece id destination to constituted the files there instead; then forgot to subsequently redirect a .temp move and|or delete routine to the new async āonā piece ID location.
Also of note, I believe they changed the default cache to 4 Meg (~152kb avg file size lately for test data = ~3.85 megs of ram per piece waste) from 128 kb. At the rates of test bandwidth saturation this can often cause a race condition; overwhelming windows memory management, and abruptly kill off a node (at least so noted on windows server, lower mem footprint.) Changing the config, obviously fixes that runaway condition, either by b/w restriction or (filestore.write-buffer-size: 128.0 KiB.) However, I havenāt yet noticed others posting with problems that arenāt logged yet.
Another interesting thing to check, which was my initial concern regarding
if additionally to uploaded but cancelled pieces we have also .partial files that were used to be in the temp folder too. I havenāt seen any yet but maybe just missed them.
I didnāt mean the OS file system level cache, but the internal cache implemented in the storagenode process itself. Which is controlled via this parameter in the node config: filestore.write-buffer-size:
And designed to store(buffer) partially uploaded pieces before writing them to filesystem to reduce disk I/O.
Although I should have used the term ābufferā instead of ācacheā to avoid possible confusion with the OS filesystem cache.
Same apply here. I think filestore.force-sync option can affect only for work of filesystem level cache and hardware cache of a HDD (controller onboard RAM cache). And not relevant for pieces stored in Storj write buffer.
P.S.
Bug report for canceled uploads saved to blobs (and became uncollected garbage instantly) was created by another user (link in a @Alexey post above). But i have added my data there too now. And I think we should continue discussing this particular issue there.
I did not see and did not know when the default values changed. Because I changed them to my own values a long time ago (currently its 2.5 MiB on one of the nodes and 1 MiB on the others).
But I also saw before that with any upload, regardless of the size of the piece started to upload, the maximum amount of memory is immediately allocated. Which can cause very high memory consumption with a large simultaneous number of connections sending a lot of small pieces. I have seen up to 5 GB RAM usage and about 1000 open connections with ingress traffic less than 100 Mbit/s during SLC test spikes with 2.5 MiB write-buffer-size in config. It can easily crash / kill by OOM a lot of āpotatoā nodes
It looks like a thing that should be framed as a suggestion for improvement in the appropriate section of the forum or directly on the GitHub. Are there any volunteers?
Draft:
If the upload size for the node is known in advance (when upload starts), then its simple and you need to reserve exactly as much RAM as you need to store whole piece (but not more than the amount specified in filestore.write-buffer-size:).
If it is unknown in advance (I think this case is true now?), it is probably worth allocating memory in medium sized chunks to avoid waste of RAM for small pieces.
For example, in 128 KB(former default value) or a comparable amount steps until one of the conditions met: ether end of upload or filestore.write-buffer-size limit hit.
Because allocating memory in a very small chunks is also not very good, because memory in the OS can ALSO be severe fragmented by tons of allocation (although this fragmentation is not as bad as fragmentation of files on disk, but still can affect performance), and requires overhead for alloc calls.
So here we need to look for some kind of compromise between the number of RAM allocations and the average memory overuse by write buffers due to small average piece size.
Both cases are true at the moment. It depends on the client. Some will report, how much they are want to allocate, some - donāt and you will know only when they are finished. But they may also cancel the upload in the middle without letting to know the storagenode, so it will end with a timeout or a context canceled down the road.
Hey M_Mā¦Sorry Iām just gonna stream of consciousness here, thereāre be ramble undoubtedly - but hopefully in the end a little helpful - having to finish last weeks projects before weekend.
I know they changed it cuz that write-buffer size config line specifically dropped out of the .yaml on new installs at some version or anotherā¦say maybe last summer.
I havenāt checked a --help query on the binary, but shouldnāt it spew the current defaults - if theyāre not too lazy to update?
I agree the write-buffer-size could certainly use some augmentation; making it more dynamic would be adventageous and should be prioretized.
Of note, after the exodus of the abuse of free accounts - probably used for chia, average file size had gotten up to over 700k, by my notes
(by contrast, I think current average is ~250kb/2272 - for most customer data. And having just checked an .eu node, it was about 530kb average file size
which I suspect will decline as purging old accounts there continuesā¦) Moreover, it seems theyāre down to 150-170kb avg. file size expectation with this test data,
indicative of a very āliveā & low TTL operational preference. Yet the max is the max, thatās obvious.
This is probably a good representation of a predominant archival bias; 64meg chunks (64meg/29 segments) ie: predominantly ~2.21 meg files.
So I think that generalized S3 bucket/chunk size MAX should be the basis for a more dynamic solution. As it was obvious enough to you and I
to have change the buffer-size to a more efficient allocation: In my case Not 4meg, like I think the default is now, but 2272.0 KiB or your 2.5/1 meg, as described.
Therefore, I suggest we specifically add the 2272.0 KiB to that draft.
2 cents for today
P.S. I would & will offer more, but I find myself having to reconfig a PB of stuff to accomodate this new potential realityā¦ which aināt easy, and a life. Nonetheless,
Iām excited to engage further as Iāve been a programmer since before Z80/a & 6502 days - lol.
Iāve noted blobs in excess of 2.2meg - very few. What is the max chunk size? Also it should be considered what final & or dynamic reed solomon variable set is used. Will that be variable per sat, is it locked down by the testing as yet - any insights? I have noted Littleskunk testing various permutations. Is Adding an RS header to a direct connect/S3 upload possible, for use of aforesaid dynamic write allocation?
If you specified some value in the config.yaml, why the node should reset it? This would mean that you cannot provide anything in your config file ever? It has no idea, whatās options were default.
But you may submit a PR if know how to do that, we are glad to accept a Community contribution!
unlimited, but you may check yours
Itās not exactly the answer, but you may see which ones are stored on your node.
Yes. But. Itās not hard coded. Right now the maximum possible segment size is 64MiB, with the setting 80/29 itās probable the 64MiB/80 or 64MiB/29. But I wouldnāt confirm thatās the maximum possible size for the pieceā¦
Especially, if we want to change RS settings almost dynamically
Thatās correct, yes of course the binaries defaults should not superceed the configuration fileās user variables. What I meant was when querying the binary with a āāhelpā command, normally all defaults of specific parameters will be displayed in conjunction; not unlike the hashtag commentary in the config .yaml file. But sometimes programmers are lazy, and forget to update help/usage responses when changing/obfuscating/etc. parameters, nā stuff.
ok ā¦ so, in essence āNow testing RS number 16/20/30/60 with bestofn n=1.9ā == 64MiB segment / 16 pieces = 4 MiB blobs, ergo max mem needed. And in the case of 8 meg blobs, if those were max 64MiB segments, thatād be only 8 pieces. Obviously 16 is a better divisor to match cluster sizes, wonder why they ever started with 29? 32 would have seemed more rational. Anyway, back to my original thought: If the node can somehow be informed of the RS used for the piece itās about to receive, it could react with an exact sized memory allocation, or a runlength encoding of partials (in-stream) as a look ahead, only needing to change when a new RS value received; although as M_M pointed out fragmenting memory might impose inefficiencies.
rable ramble ā¦ Rather than speculate, I should prolly go choke on some codeā¦ lol
1 & 1/2 cents
One thing is that a piece file will also have a 512-byte header on top of actual data, which will already break your nice round numbers. Second is that a significant majority of segments are not exactly 64 MiB big. Third is that the parity algorithm used by Storj will also have some small overhead. No point in trying to exactly align RS numbers.