Some of us run our storagenodes on redundancy, like mirrors, raid6, raidz2 and raidz1 or similar.
it seems like such a waste to store trash on good and or great disks, while often having old spares that work, but isn’t really doing much if anything… due to them not fitting well into our current storage solutions.
i suggest that node configuration be expanded to include an optional dedicated trash data storage location, so that we can run our storagenodes in a slightly less efficient manner.
duno the implications of this, my gut tells me it shouldn’t mean much in regard to the network integrity, and its a great way to make use of spare single disks.
You could do this by just replacing the trash folder with a symlink if you really want to. Stop the node first, then move the trash folder where you want it and create a symlink to that location.
I’m guessing this is a fairly niche request and it might have quite a bit of IO impact, since moves will turn into copies of pieces. So I can’t guarantee this is a good idea.
yeah thought about that myself… might do that… but i run raidz1 with 6 drives in a raidz1
so the gains in my case atleast when i use this configuration is minimal, still i think it would be an interesting feature.
maybe ill turn on tracking of my trash folder so i can see whats going on in it.
i sort of assumes most stuff is just deleted after a week or whatever
did a bit of further thinking on this, io thing… and it might be a good idea to have it as a secondary trash folder, so that it will initially use the primary storagenode storage and then after lets say 24 hours move it to the secondary trash storage.
I don’t see the value in this idea. Where’s the problem with 100GB of trash on a 10TB node?
And the trash is there as a safety. Putting the trash on a different drive will introduce more points of failures for the stability of the node because pieces can’t be rolled back if your other drive crashes.
I thought that was already what nodes did to put files in the trash for some reason: copy files to trash instead of simply using the system move operation. That’s creating massive IO activity on disks when moving lots of files to trash which happens now and then. Maybe I’m mistaken?
Or maybe it’s never been the case and it’s IO intensive just because files are so small and numerous?
It uses a rename function, which is a move. It’s just lots of small files. But it’ll definitely be a lot worse if that turns into a read, write, delete.
if the data is going out to be deleted and remains out then it would only add the read and grab a bit of bandwidth.
the real issue would be if stuff often was retrieved from the secondary dedicated trash storage.
another thing to keep in mind is hdd’s loose about 50% speed when full, and even tho this seems almost irrelevant then this is not a one time advantage, it’s a permanent advantage to all other data handling.
ofc that is almost at the point of splitting hairs.
if we do an example like say a 3 disk raidz1 with 6TB hdd’s thats 12TB usable capacity.
having a spare disk for trash could in theory increase max capacity during heavy delete periods by 10-25% by having an old drive for trash at 1-3TB
also the overhead of 7% and the 80% max capacity before a disk starts getting heavy fragmentation of the stored data, leaves us with only about 9TB usable space from 18TB worth hdd’s.
ofc this is relevant in all cases, then we can also add the what 15% storjlabs recommends as free space and call that another 1TB, putting us at an 8TB max before the performance tanks.
adding dedicated storage for trash might not seem that relevant… but it could for larger nodes in the 20TB ranges be rather advantageous.
did just check my trash… i’m at 50GB on the 15 TB node… but it has been into the near TB ranges already, at those times my max stored data on the node would be lower and thus if one was to hit the point of getting no ingress due to being at capacity while holding on to trash.
then thats a full week of ingress the node would be behind…
seems trivial sure, but i’m not sure that if one did all the math that it would actually turn out to being as trivial as one might initially think.
ofc it would be very niche and mainly for older very large nodes and the advantage wouldn’t always exist.
I’m not sure what you mean, but a move or rename is only a small file metadata update. Moving to another HDD would change that to reading from the original HDD, writing to the new one and then deleting the file on the original. The reading AND writing is new IO that wasn’t necessary before. Compared to only the rename that adds a lot of overhead.
It’s not. Only if something happens that requires data to be restored. I think that only happened once. Though in theory it could happen with quite a bit of data at the same time.
On my 15TB node I have 45GB of trash, I don’t see how that factors in to this equation. Just don’t assign all of your free space. The trash is negligible.
I have never seen anywhere close to those amounts. The highest I’ve ever seen was around 2% maybe 3%. I think you should divide those numbers by 10.
Different situations may occur if you leave your node offline for long times and trigger lots of repair though. But a good operating node should never get anywhere near even 5% trash.
Well then don’t use RAID… that’s your choice and completely independent from how trash is handled. Lets not confuse things. Also, where did that 7% number come from?
What I can say is that a 90% filled disk can handle Storj loads perfectly fine. I actually have one filled to 97% (which I don’t recommend anyone do) and it’s purring along just fine. As long as you don’t have an SMR disk, filling too much should not be a worry for Storj. Just follow the recommendation to not go beyond 90% and you’re golden. (yeah, it’s a do as I say, not as I do kind of situation. The 97% full HDD was a test that I decided to just keep running)
Since when is that 15%? (it’s not… it is and always has been 10%)
And why would that be additional if you are already keeping 20% space free for performance? That makes absolutely no sense.
Then you are most certainly doing something wrong. It should never get anywhere close to that kind of size if you operate your node reliably. 50GB is fine for a node that size. TB ranges are absolutely not. But even so… if you have such a peak, it’s there for only a week. It’s still not a big problem.
I think what you did here was drag in a lot of unrelated things (some completely nonsense) that would also require additional space but actually have nothing to do with how trash is handled, just to hide the fact that trash uses at the absolute most 3% of space on a good running node. All of the rest is not relevant and even such 3% peaks would not last longer than 7 days at which point it would drop back to the roughly 0.3% we are both currently seeing. And keep in mind, that this is even with the temporary measure that deletes now go to trash as well. Without that trash amounts are even significantly lower. It seems trivial, because it IS trivial.
You have a niche request, which is fine, it’s your right to request it. But here’s the thing, there is very little reason for anyone to want it, it comes with significant IO downsides and there is a very simple way you can actually implement this in your own setup without changes to the node code. So… please use symlinks to just do that last part yourself first and see if it helps you.
still only read io on the storage media, the writes would ofc be on the dedicated trash and thus becomes irrelevant to the overall performance.
but yes it would create more reads, i was trying to say if it flip flops back and forth it might be a big issue for this type of solution.
on avg sure, but since we are working against a fixed maximum, we need to use the max peak from the trash to know the limits, ofc there is an argument to be made that data in trash was already on the disk and thus it should only be put against the possible ingress over a week.
the 7% is the disk overhead, its used for metadata and crc, its part of capacity lost when partitioning a drive.
its something like 7-10% and why you get like 5.5TB on a 6TB hdd.
yeah short term it might not be bad, and also depend on the data grind or whatever it’s called in the industry, when you delete and rewrite data over time.
if you have a lot of this at near max capacities it will cause fragmentation which causes longterm and severe performance decreases due to increasing numbers of seek times.
for ZFS the recommendation is max 80% of capacity on live data, which would most likely be due to its raid sized block sizes… i suppose on single disk running like say 4k blocksize one might be able to go closer to the max capacity before seeing long term performance degradation.
ofc if the data isn’t live and doesn’t get rewritten this becomes a non issue and the issue doesn’t show up immediately but over time but worse performance and ends in levels where the storage basically becomes unusable.
true didn’t think about that… i sort of think of my 80% capacity as my max, which is why i didn’t catch it.
what if that was the good ingress week for that month, i know we saw pretty high deletes like 3 months ago or so, back then i think my 15tb node hit just over 600GB.
and that would get worst over time as the network ages and as the node grows.
i know it’s niche, but it was more thought as a thing for the future, but no point in arguing over it anyways, since this is clearly not going to get voted in.
i disagree with your numbers, i mean you are ofc right about the 10% free space that was just my bad memory and unwillingness to look it up since it’s sort of semi irrelevant to the point i was trying to make.
ill try the symlinks, might be fun experiment.
i think this will one day be a useful thing for nodes, even if people can’t see it now.
and even if it sort of isn’t…
It will significantly impact how long garbage collection runs take though. And depending on CPU the IO wait can impact other stuff as well.
Those were peak numbers, definitely not average.
This is simply wrong. A 6TB HDD is 6TB, which is around 5.5TiB. that’s just dependent on the unit used. The rest would be highly dependent on file system. But file system overhead is usually negligible. It’s definitely nowhere near 7%.
This is really not a worry for storagenodes. When the node is full writes stop. And files are small anyway, so fragmentation isn’t very likely to begin with. The vast majority of data is static at that point. With 90% full you won’t see an issue for sure.
ZFS more than most other file systems needs to be tuned to the intended use. I’m not convinced it’s a great choice for nodes, but you do you. If you can’t go over 80% for node operation, you’re doing something wrong. I think you’re trying to prevent performance degradation that isn’t going to happen anyway. But the again… You made things a lot more complicated by opting for ZFS and RAID to begin with. So now you have to put in the time to do it right.
I saw a max of 450GB at the time, which is 3%. But 600GB on a 15TB node would be 4%. So clearly nowhere near the 10-25% you mentioned.
No, only absolute numbers would be higher, percentages would stay the same.
Please do and report back. I don’t see the use right now and see lots of downsides. But experiments never hurt. So please prove me wrong.