Copying node data is abysmally slow

As long as they make it an option it would be OK. If they disabled sync on uploads and not made it configurable that would be a problem. I would not be too happy running a node with sync disabled (unless they also disabled audits :slight_smile: ). Mounting the with the sync option could help, but then it would make every write synchronous, even those that do not need it.

image
I probably should add something here, maybe dispersion or the median size.

Can you please elaborate on why? What difference does it make whether writes are flushed right away or in few seconds or minutes?

1 Like

Potential to lose data and fail audits if the server crashes. Normally, the node should synchronously write the file and only then report the upload as done, so that the satellite would not be thinking that my node has the file when it, in fact, does not.

If sync adds latency, well, that’s what SLOG is for.

If someone else wants to run with sync disabled, they can do that. It will either work great or I will get more repair traffic, which would also be good.

Righ, but it’s not black and white and pros and cons should be justified.

How often does your server crashes that this becomes a concern? If you lose few kb or mb of data it’s ok. That’s what a massive redundancy in the network is for. Essentially, losing data to lost cache or bad blocks or bit rot here and there is accounted for and therefore is harmless, inconsequential. Also, you are not instantly disqualified if you fail one audit. Seeing 100% in the dashboard all the time is not a goal :slight_smile:

So, benefits are questionable. But drawbacks — we are punishing everyone, artificially reducing performance of a node just to minimize impact of a rare and generally harmless event.

Sync is useful when data must be in a coherent state. e.g. databases. For a bunch of files on disks, that represent erasure coded augments — I would not care at all.

2 Likes

keep thinking like that. we get more repair traffic :slight_smile:

Databases use lots of IO but are small. Ideal use case for SSDs, that have become very cheap.

I would say, not likely at all. ixSystems seems to agree with me on that one and sells TrueNAS mini with only one drive for SLOG.

I think the way to go for STORJ is to store the DB on a sync file system, while the data itself should be async. Currently, my plans for my new setup would be STORJ on a hypervisor VM with sync and the data on a network drive TrueNAS vdev with sync=disabled. That way I don’t even have to wait for STORJ to implement async, because TrueNAS lies to STORJ to be sync.

1 Like

I know you are joking, but even then that repair will be few megabytes tops.

Storj themselves recommend against raid → no redundancy, no checksumming and repair → rot, bad block, data loss is an expectation.

2 Likes

async write every 5 seconds. This is 3-5 files. Zfs lies storj and storj write this files as accepted. Server shut down. You lose files. Storj audit this files. And disqual you. I get you repair traffic. All happy :slight_smile:

Do you have any source for the “fail once, get instantly disqualified” claim?

It does not help my node.

I may lose some data anyway because of other reasons, why add one more way to lose data. My server does not crash often, but why take the chance? Especially when getting disqualified is expensive.

And that’s why I said it should be a setting. If people want to disable it - great, as long as I can keep it enabled. It either works great for them of I get more repair traffic, which is also good (for me).

Or it may be a lot if your node gets DQ.

As for the recommendations - well, if a node gets disqualified, the network just repairs the data and Storj does not care. The node operator might care that his node got DQ though.

You dont fail once. You fail audits. Every failed audit will drop you score. And disqual at the end.

Why take the chance of loosing data due to flooding. Did you waterproof your server cabinet? If not, was it because you think the chances of flooding are small? :kissing_heart:

Exactly. So how would I get disqualified, because I have lost 5 seconds of writes?
Will the audits only audit these 10mb multiple times?

You lost not 5 seconds. You lost files. Which can be audited

Yes. If audit failed you will lose score and audit will be repeated after some time. If file not restored, score dropped lower. And lower. And disqual :slight_smile:

I lost 5 seconds incoming data. Nothing more, nothing less.

If that is true, the audit system of STORJ is extremely stupid :joy:

You lose 5 seconds of data that was in RAM, and about the recording of which ZFS lied. Nothing more, nothing less. But keep thinking wrong. I need repair traffic

Half of the uploaded files are at or below 4kiB. ext4 pages are 4kiB (mostly relevant for storage of inodes and directory entries). IIRC sqlite uses 8kiB pages. The difference between 4kiB and 16kiB write on HDDs is pretty small, the seek time dominates, so up to 16kiB it’s more meaningful to talk about IOPS than bandwidth.

Just to show the complexity, at least the parts I am aware of, not claiming to be an expert in file system design. For a typical ext4 file system with an ordered journal, each upload involves following operations:

  • Creation of an inode for the file (4kiB write + another write for the ext4 journal entry).
  • Write of the file contents (size of a piece + 512 bytes for a header, rounded up to 4kiB). This is a mostly sequential write, likely not fragmented, so I assume the extent tree fits in the inode itself (best case).
  • Write of a directory entry in the temp/ directory (4kiB write + another write for the ext4 journal).
  • sync(), forcing all of the above onto the drive. Here the journal, inodes, directory entries and file contents are unlikely to be placed next to each other, as the journal and inodes are preallocated, and the directory will probably already be allocated somewhere too.
  • Rename from temp/ to actual storage directory, involving 2×4kiB directory entry writes (one removes from temp, another creates it in another place) + an 8kiB journal write for both sides. In case of some satellites, their storage directories are big enough to use an h-tree, making the update potentially require several page writes, but let’s consider the optimistic case here.
  • Update of the bandwidth.db database, so probably two 8kiB file writes (the transaction log and the main database file) + maybe extending the log file, so an update to the 4kiB transaction log file inode + maybe a 4kiB update to the file’s extent tree. Not sure though if they’re synced, this database is not that vital.
  • Update of the orders file; again, a file write, though this one is not synced, so it’s likely multiple uploads will be coalesced into a single write here.

Here I assume the directory entries are already cached, so they don’t need to be read prior to writing. temp/ is invoked often enough, I suspect other directories as well as long as the machine has enough RAM.

I’m counting 10 write operations in the optimistic case, suspecting it may go up to around 20 seeks (reads and writes) in many cases. Some might not be synced, allowing them to be merged across multiple uploads (but not during a single upload!). Except for one they’ll likely all be 4kiB or 8kiB. There were some optimizations for the journal in recent kernels, making the journal writes smaller and potentially coalescable.

Other than that, any file system that coalesces other writes (like, maybe directory writes with inode updates) will fragment data structures (like, directory entries) so that reads become slow. Slow reads will mean slow file walker, and we’ve seen reports of the file walker taking >24h. So sometimes coalescing writes is actually not desirable.

1 Like

Well, my house is on a slight hill and my servers are on the second floor. The water should probably reach something like 50-70 meters above sea level to flood my server. I probably would remove the hard drives from it and put then on the third floor by then. :slight_smile:

In any case, I understand that for others sync writes may impact performance to much, so this should be made a setting if possible. Just like you do not want to be forced to have sync writes enabled, I would not want to be forced to have sync writes disabled. As I use SLOG, the sync writes are not a problem for me.

2 Likes

TXG aggreagation groups on TrueNAS is max 5 seconds and max 1/8 system memory. Let’s assume I get very high 40Mbit/s ingress. That is 5MBs. 5x5 is 25MB data loss. If STORJ can’t handle 25MB data loss on a 20TB pool, and because of that disqualifies the node, we should really look into the audit score again.

Does that save you from a broken pipe? :slight_smile:

100% agree. More options is always better. We are all different :rainbow_flag:

No pipes above me :slight_smile: If a pipe below me broke, the water would not reach me. In theory the roof could be leaky and rain could get in, but there won’t be enough to submerge the server (which is ~120cm above ground) and the top of the rack (and the servers above) would prevent water from dripping on that server.

1 Like

Sorry serger001, I think you got that one wrong. Audits are done on a random bases, not the same missing file again and again.

So if I wanna be on the safe side, my pool should be at not smaller than 1,3GB (25MB * 50) to survive the missing 25MB I have lost during a crash :smile:

1 Like