Disk usage discrepancy?

I’ve been here since 2020 and I think I’ve already managed to forget a lot of problems :smiley:

Since the recent changes in payouts, where you get 12$ for full 8TB drive, and if are lucky to get extra 1-2$ for traffic - indeed. Only way to get above electricity bills is to put larger and larger drives and wait.
Or… avoid these man-hours of a skilled software engineer time, which nobody will pay to you for keeping your node working, and delete all that that, have free hardware for something else and peace of mind.

1 Like

Apparently there are more changes planned to improve the load on the drives.
One is to improve the trash folder cleanups, as that still apparently runs as a standard priority I/O, reading all the files in the trash folder periodically determining which ones to delete. You can imagine running standard node I/O, filewalker and trash cleanups at the same time is really too much for a standard HDD.
There also are further changes I believe already in progress regarding the satellite cleanups, maybe causing some of these discrepancies we see.
And the discrepancies between the satellite and the node reported space I think were also partially caused by bloom filters, which were either not being distributed for some time from some satellites or were not big enough to cleanup the majority of the deleted data.

1 Like

Yes and all of them running over and over again from the beginning after a restart.

1 Like

Except I believe the GC filewalker. Once that is interrupted it won’t restart after the node is restarted.
And thinking about this, this also might be one of the causes of these satellite vs node reported space discrepancies - where the GC filewalker never finishes because new update is pushed causing the node to restart.
I have recently modified the bash script I use to not update the node if there are multiple storagenode processes running, as in a case of lazy filewalker new process is spawn to filewalk, but the question is if the storagenode updater does similar thing, or if it simply pushes the update and restarts the node in the middle of a GC, pretty much interrupting it until new bloom filter is distributed a week later.

On Linux, the storagenode is using the OS-provided interfaces for traversing a filesystem: opendir() and readdir() and stat(). This is essentially the same way that OS tools (e.g., du) do it. I’m not aware yet of anyone demonstrating that du -s is significantly faster than the non-lazy storagenode filewalker, under the same load conditions. If that ends up being the case, there may be some deeper magic we can leverage (i.e., maybe if we go lower-level than the Golang standard library and use openat and fstatat directly it would make a big difference? Or maybe reimplement os.(*File).ReadDir() so we can tune buffer sizes to getdents64()?).

If your problem is only with the lazy filewalker: if the lazy filewalker takes more than, say, 5x the time of the non-lazy one then your system is underprovisioned. It does not have enough I/O overhead to support the number of nodes it is running.

On Windows, we may be doing all sorts of things wrong. We have a few engineers who are experienced Windows programmers, but I don’t think we’ve had time for them to take a close look at file traversal and see what we could be doing to make it faster. Or, yes, if we can’t reach the same performance as built-in tools (which seems unlikely), then we could shell out to those tools and let them do the work instead.

I don’t know of any code in the storagenode that purports to prohibit the use of the filesystem cache. It’s possible Golang’s stdlib is doing something, but that seems like it would be absurd. Do you have any hard data you can share indicating this problem so we can examine further?

2 Likes

But why does it have to restart from the scratch when ever the node restarts?
Why no save state and let it resume where it stopped so it can finish at least some time.

I didn’t say that. However I can say that

is ignorant.

Sorry, not sure I understand you. I didn’t pay for development of the current solution, I’m just a node operator.

I know you are trying to mock Storj developers, but you shouldn’t. It’s them who made it work as a network, that’s a big achievement already.

The way you refer to Storj developers, it sounds like you’re new here. It would help a lot if you actually acted like wanting discussion.

One thing that the lazy file walker broke is that before if a file was to be moved/removed, this was done immediatelly after reading its direntry and inode. As such, these structures were still in cache and did not have to be re-read to actually move/remove the file. Now however, if a low-memory system has to first read all inodes, and only then move/remove files, their file system structures could have already been removed from memory cache.

Another obvious one is that previously if GC was interrupted, at least some files were already moved/removed. Now we need to wait until the whole reading phase is finished.

I wrote down one suggestion here.

Another suggestion would be to give up on the temp directory, and write uploads directly to the blobs directory. This saves 2 synchrononous writes (plus some reads as well) to directories, plus at least ext4 will try to place the inode close to the blobs subdirectory, which may later save on seeks for both downloads and file walkers.

Though, better returns would come from giving up on using one file per piece.

Out of curiosity, are there any statistics how much of storage is hosted on Windows nodes? @Alexey suggested in the past we can establish a rough proportion of Windows to non-Windows nodes by looking at the number of downloads on Github, but it is probably also useful to look at total storage of Windows vs. non-Windows nodes.

Ah, if we’re talking about the manipulation of the files, then yes, as you rightly point out, we have lots of room for improvement. I was only talking above about the directory traversal piece by itself.

Your suggestions are good ones and some we’ve talked about doing before. Some are harder than others. For example, giving up on the temp dir would mean we need some other way to differentiate partial uploads from full, intact blobs. We could try to use a database to keep track of what files are partial and which are not, but we’ve had lots of problems depending too much on sqlite databases or other single-file dbs. We could use different filenames, but then the rename() step still needs to do multiple writes (it does a link and unlink, unless Linux is able to overwrite a dirent with a new filename now). Also the directories would have a lot more entries in them to go through, which could hurt the performance of GET operations.

This is probably true, but we haven’t yet found the ideal way to do this that (a) doesn’t involve basically reimplementing the filesystem; (b) doesn’t hurt performance (both TTFB and throughput) for GET operations; and (c) still provides ranged reads and so on. We’ve had some promising results with experiments using LSM tree storage, but there are still big challenges to get over.

Regardless of all that, there’s a lot we can do, and as you note, it’s just a matter of assigning resources to the problem.

I think this information can be put together; nodes send build hashes with their telemetry data. If we collected all the hashes of released builds together with the target platforms, we’d at least have a good idea of the percentage of Windows nodes. But that would take a nontrivial amount of work. I don’t think we have anything better available, at least within the sphere of the project I’m familiar with. Downloads on Github is probably as good as anything. That would be one of the first things to fix once someone can be dedicated to improving node performance.

3 Likes

Can’t just add an id tag in the data sent to sats by the node on restart, which defines the OS of the system? Like: 1 for Windows, 2 for Linux etc. You can even go further and make unique tags for filesystems, or Linux flavours.

[quote=“Toyoo, post:46, topic:24715, full:true”]

I didn’t say that. However I can say that

is ignorant.

Maybe it is. But using simple folder/file with millions of small files is ignorant too.
Developers were in that point of history before. Solutions like this were used before people discovered databases. Why do you think records in databases aren’t just small files with record data? This could be good temporary solution in 2020, but it’s been 3 years.

Sorry, not sure I understand you. I didn’t pay for development of the current solution, I’m just a node operator.

Sorry, I guess I didn’t understand the message here either.

Avoiding optimalization to keep compatibility z few or less % of hardware?

Not that I would ever run something that is supposed to run reliably on Windows,
but you write that instead of focusing on optimizing performance on the most popular consumer OS you spent time making the program run on some archaic NAS? :smiley:
[/quote]

I know you are trying to mock Storj developers, but you shouldn’t. It’s them who made it work as a network, that’s a big achievement already.

No no, I’m sure developers did all what they could in the situation they were put in.

Projects are falling apart for two reasons. Poor developers make an even worse project or the people who manage the project screw up. I have no basis for accusations against anyone.
But from my point of view… in 2020 it was a nice promising project and almost finished. Now after 4 years of developers working on it, what do I have? For keeping hardware 99%+ uptime and many TB of data on disk: 2-3 packs of cigarettes per month, before the cost is counted.

The way you refer to Storj developers, it sounds like you’re new here. It would help a lot if you actually acted like wanting discussion.

People on the developers’ side of the forum have helped operators countless times.
It’s just that after all this effort in my opinion it’s not a success. And certainly not for node operators.

And I didn’t mean to offend anyone, it’s just that sometimes I get easily “turned on” when I see the clusters of words typical of corpo brainwashing, like “man-hours of a skilled software engineer time”.

1 Like

Oh, I didn’t know there is a need to do this. What do nodes need this information for?

I have some sort of a draft of an idea here, though it does put some simplifying assumptions. The largest one being, it’s fine to keep partial uploads in RAM, as opposed to trying to immediately write them to storage. I guess I need to finally write it down. It would indeed require more engineering work though.

Well, overly dramatizing doesn’t help discussion. The uptime threshold is 60% monthly, and sorry, I don’t smoke, I don’t know how expensive cigarettes are now—but I can say that Storj is still net-profitable on my setup.

Besides, after those 4 years thanks to the work of Storj engineers we have a product able to compete on metrics and features, not a technical novelty that would otherwise be quickly forgotten as soon as VC money runs out. This is not easy. I’ve worked for three startups in my career, and in each one we have a great core engine, but we failed to turn it into a product with real customers. This was achieved not just by engineering work on storage nodes, but also by working on things like user interface, payments, geofencing, TTFB/latency improvements, and scaling up satellite operations. Right now storage node code is good enough for most node operators. It would indeed be nice if this piece of code was more polished, but it would be nicer to have more customers.

3 Likes

I would go a step further - as an SNO, I have essentially no incentive to have writes by synchronous. Letting all writes be async instead lets me win more upload races, and the penalty for crashing in the middle of one such write is… I lose a few blobs that were in transit? That wouldn’t be enough to fail an audit.

1 Like

It would be bad if an uplink requested a piece and the storagenode started streaming the data in the piece but stopped early because the piece wasn’t fully uploaded. The uplink would have to “rewind” at that point, request a new set of pieces from other nodes, and continue from that point. The node should only begin streaming the data if it has a fully intact piece.

The space-used filewalker also needs to be able to differentiate; partially-uploaded blobs that may never be completed should not count toward used space.

I suppose the “cleanup temp files” step that happens on startup would also need to be able to differentiate, and it would have to traverse the whole data hierarchy. That would be awful. I suppose that step needs to be rearchitected anyway, though.

Oh, terrific! The assumption about partial uploads in RAM is probably valid–at least as things stand now. The max piece size is something like 2MB when using the max segment size and current RS parameters, so lots and lots of concurrent uploads would fit in memory just fine. And throwing away temporary files on crash is entirely correct behavior, so that’s good. My only concern would be that that max size could change if/when we start using different RS parameters. Probably it won’t change so much that this wouldn’t be feasible.

3 Likes

@snorkel @jammerdan @Mad_Max
We read a header of the each piece, where Metadata is stored, not only size and update time from the filesystem’s Metadata, so they are different metadata. We use system functions though too.
We didn’t implement anything special to evade the system cache, just Windows is very bad in caching unlike Linux. The 4kB block size is how the system function implemented, we didn’t invent it, and currently doesn’t have options to change it. In Linux it’s likely implemented differently and depends on the filesystem.

Your node will check-in on the satellite every hour by default and will send its allocated and used space, also signed orders to the satellite, the satellite will send the accounted used space accordingly signed orders. With some schedule each satellite will send a bloom filter to nodes to move the garbage to the trash.
The node’s task is to pass a filewalker and move the garbage to the trash accordingly this bloom filter. Then the retain process will remove the expired data from the trash.
If any of the filewalkers will not finish its work, data in the databases will be wrong.
The known reason for failing to process a filewalker is a slow disk subsystem. The reasons for slow disk subsystem are:

  • fragmentation for NTFS,
  • using a single drive metadata-hungry filesystems like BTRFS or zfs without SSD tier cache or at least huge RAM cache,
  • slow SMR HDD,
  • bad USB controllers/cables
  • using network filesystems
  • using VM
  • etc.

So, until the slowness of the disk subsystem would not be fixed, the problem will remain.

We used databases too, and when the database got corrupted (and our Node Operators are very professional in doing so :frowning: ), the node were disqualified, even if the actual data is intact. So, I wouldn’t suggest to return to databases, at least not to a single-file databases like SQLite.

I think you can do this

$ storagenode setup --help | grep write-buf

      --filestore.write-buffer-size memory.Size                  in-memory buffer for uploads (default 128.0 KiB)
1 Like

What would that mean in cases of bloating temp folders that don’t get cleaned up: Large temporary folder - #28 by jammerdan
I think I’d rather have these zombie files on disk than in RAM.

Are you sure there is such process? There is a huge old thread about the temp folder that gets bloated.
I have experienced it myself: Large temporary folder - #28 by jammerdan
And I have just checked on a random node of mine and I see partial files that are from November in the temp folder and they don’t get cleaned up even after full restart of the node container. So AFAIK there is no process at all that takes care of those remaining files and the temp folder still has to be cleaned by hand.

zombie temp files are remain because of interruptions, not because how it works. My nodes almost doesn’t have an old temp files, because my nodes doesn’t restart randomly.
So, if it would restart and temp files are stored in RAM, they will disappear.
By the way, you can try to use tempfs to store temp files even now.

I’m not sure that this process is exist.

That would be an improvement.

Then you may use the tempfs for the temp folder.
See

But how to use it for the temp folder only?

--mount type=tmpfs,destination=/app/config/storage/temp

1 Like