Copying node data is abysmally slow

arrogantrabbit · December 4, 2022, 7:58pm

Can you please elaborate on why? What difference does it make whether writes are flushed right away or in few seconds or minutes?

Pentium100 · December 4, 2022, 8:04pm

Potential to lose data and fail audits if the server crashes. Normally, the node should synchronously write the file and only then report the upload as done, so that the satellite would not be thinking that my node has the file when it, in fact, does not.

If sync adds latency, well, that’s what SLOG is for.

If someone else wants to run with sync disabled, they can do that. It will either work great or I will get more repair traffic, which would also be good.

arrogantrabbit · December 4, 2022, 8:18pm

Righ, but it’s not black and white and pros and cons should be justified.

How often does your server crashes that this becomes a concern? If you lose few kb or mb of data it’s ok. That’s what a massive redundancy in the network is for. Essentially, losing data to lost cache or bad blocks or bit rot here and there is accounted for and therefore is harmless, inconsequential. Also, you are not instantly disqualified if you fail one audit. Seeing 100% in the dashboard all the time is not a goal

So, benefits are questionable. But drawbacks — we are punishing everyone, artificially reducing performance of a node just to minimize impact of a rare and generally harmless event.

Sync is useful when data must be in a coherent state. e.g. databases. For a bunch of files on disks, that represent erasure coded augments — I would not care at all.

serger001 · December 4, 2022, 8:19pm

keep thinking like that. we get more repair traffic

arrogantrabbit · December 4, 2022, 8:22pm

I know you are joking, but even then that repair will be few megabytes tops.

Storj themselves recommend against raid → no redundancy, no checksumming and repair → rot, bad block, data loss is an expectation.

serger001 · December 4, 2022, 8:24pm

async write every 5 seconds. This is 3-5 files. Zfs lies storj and storj write this files as accepted. Server shut down. You lose files. Storj audit this files. And disqual you. I get you repair traffic. All happy

Pentium100 · December 4, 2022, 8:27pm

It does not help my node.

I may lose some data anyway because of other reasons, why add one more way to lose data. My server does not crash often, but why take the chance? Especially when getting disqualified is expensive.

And that’s why I said it should be a setting. If people want to disable it - great, as long as I can keep it enabled. It either works great for them of I get more repair traffic, which is also good (for me).

Or it may be a lot if your node gets DQ.

As for the recommendations - well, if a node gets disqualified, the network just repairs the data and Storj does not care. The node operator might care that his node got DQ though.

serger001 · December 4, 2022, 8:27pm

You dont fail once. You fail audits. Every failed audit will drop you score. And disqual at the end.

serger001 · December 4, 2022, 8:33pm

You lost not 5 seconds. You lost files. Which can be audited

Yes. If audit failed you will lose score and audit will be repeated after some time. If file not restored, score dropped lower. And lower. And disqual

serger001 · December 4, 2022, 8:36pm

You lose 5 seconds of data that was in RAM, and about the recording of which ZFS lied. Nothing more, nothing less. But keep thinking wrong. I need repair traffic

Toyoo · December 4, 2022, 8:38pm

Half of the uploaded files are at or below 4kiB. ext4 pages are 4kiB (mostly relevant for storage of inodes and directory entries). IIRC sqlite uses 8kiB pages. The difference between 4kiB and 16kiB write on HDDs is pretty small, the seek time dominates, so up to 16kiB it’s more meaningful to talk about IOPS than bandwidth.

Just to show the complexity, at least the parts I am aware of, not claiming to be an expert in file system design. For a typical ext4 file system with an ordered journal, each upload involves following operations:

Creation of an inode for the file (4kiB write + another write for the ext4 journal entry).
Write of the file contents (size of a piece + 512 bytes for a header, rounded up to 4kiB). This is a mostly sequential write, likely not fragmented, so I assume the extent tree fits in the inode itself (best case).
Write of a directory entry in the temp/ directory (4kiB write + another write for the ext4 journal).
sync(), forcing all of the above onto the drive. Here the journal, inodes, directory entries and file contents are unlikely to be placed next to each other, as the journal and inodes are preallocated, and the directory will probably already be allocated somewhere too.
Rename from temp/ to actual storage directory, involving 2×4kiB directory entry writes (one removes from temp, another creates it in another place) + an 8kiB journal write for both sides. In case of some satellites, their storage directories are big enough to use an h-tree, making the update potentially require several page writes, but let’s consider the optimistic case here.
Update of the bandwidth.db database, so probably two 8kiB file writes (the transaction log and the main database file) + maybe extending the log file, so an update to the 4kiB transaction log file inode + maybe a 4kiB update to the file’s extent tree. Not sure though if they’re synced, this database is not that vital.
Update of the orders file; again, a file write, though this one is not synced, so it’s likely multiple uploads will be coalesced into a single write here.

Here I assume the directory entries are already cached, so they don’t need to be read prior to writing. temp/ is invoked often enough, I suspect other directories as well as long as the machine has enough RAM.

I’m counting 10 write operations in the optimistic case, suspecting it may go up to around 20 seeks (reads and writes) in many cases. Some might not be synced, allowing them to be merged across multiple uploads (but not during a single upload!). Except for one they’ll likely all be 4kiB or 8kiB. There were some optimizations for the journal in recent kernels, making the journal writes smaller and potentially coalescable.

Other than that, any file system that coalesces other writes (like, maybe directory writes with inode updates) will fragment data structures (like, directory entries) so that reads become slow. Slow reads will mean slow file walker, and we’ve seen reports of the file walker taking >24h. So sometimes coalescing writes is actually not desirable.

Pentium100 · December 4, 2022, 8:48pm

Well, my house is on a slight hill and my servers are on the second floor. The water should probably reach something like 50-70 meters above sea level to flood my server. I probably would remove the hard drives from it and put then on the third floor by then.

In any case, I understand that for others sync writes may impact performance to much, so this should be made a setting if possible. Just like you do not want to be forced to have sync writes enabled, I would not want to be forced to have sync writes disabled. As I use SLOG, the sync writes are not a problem for me.

Pentium100 · December 4, 2022, 9:13pm

No pipes above me If a pipe below me broke, the water would not reach me. In theory the roof could be leaky and rain could get in, but there won’t be enough to submerge the server (which is ~120cm above ground) and the top of the rack (and the servers above) would prevent water from dripping on that server.

Pentium100 · December 4, 2022, 9:55pm

The losses accumulate. In any case, this should either be a setting or sync should always be enabled. It looks weird to me that it would be acceptable to lose data to save a bit of latency. Maybe it is needed for the single-drive nodes (especially SMR) I don’t know. I know that I would not want sync to be disabled on my node. If I have a problem with performance at some point maybe I will upgrade to a PCIE SLOG instead of tuning off sync.

BrightSilence · December 5, 2022, 10:20am

This is not a factor in node selection. Of course if you lose races, chances are that you will lose races again. But it’s not a compounding thing.

Pentium100 · December 5, 2022, 12:17pm

OK, so I read that wrong or remember it wrong about some kind of cache of successful nodes.

Depends on the data though. The way Storj works it looks to me more like a distributed database than just a file server.

The difference (at least for me) is that if I am copying a file to a SMB share and the server crashes, after it restarts, I just check if the file is there and if not, I copy it again. The is no equivalent for a node (“hey, satellite, could you check if the pieces uploaded in the last 10 minutes are still there? Just don’t count them as audit failures”).

Also, don’t programs (like ms word) sync the file when saving? I ave always assumed it, but never checked.

Though I would like it if it was possible to check if my node has all the files the satellite thinks it has (just for my own peace of mind, those results would be inadmissible as audits). Kind-of like “scrub” for a node.

Either having sync on affects the performance or it doesn’t. If it doesn’t, there is no reason to turn it off, because it is a (slightly) safer option. If it does affect the performance, then it makes sense for some node operators to disable it, but if it does not affect the performance on my node, then I might as well keep it on.

TechAUmNu · December 5, 2022, 12:19pm

My 50TB array gets hit pretty hard, by the time I get to 120TB the fancy bits of ZFS will be a necessity.

Pentium100 · December 5, 2022, 12:19pm

How many nodes (with different /24 subnets) are on that array?

TechAUmNu · December 5, 2022, 12:21pm

8 /24 ips for 15 nodes (some of them are vetting on existing ips so i can easily deploy more nodes later)

Pentium100 · December 6, 2022, 7:28am

That sounds like a sync write to me (making sure that the data is actually written before doing something else). Async, as I understand it, would be writing and doing something else without waiting for the ack. Maybe I am wrong, I have never analyzed this in extensive detail.

I think we are misunderstanding each other.

The way Storj works now (as I understand it) is that my node gets the piece, writes it to the drive, syncs it to get a confirmation that it is actually written and reports it to the customer and/or satellite.

Turning off sync for Storj would mean that it writes the data, does not wait for anything and reports success. There is no way for the node to report 5 or however many seconds later (after doing async write) that now the data is actually written. I don’t think Storj would change the protocol enough to allow for this late reporting.

The question is - does it affect the performance of the node in a perceptible way? An increase of latency of 0.5ms is still an increase, but would it make me lose many more races?