Tagging uploads

hashbackup · September 19, 2021, 5:34pm

Hi - I will take your advice and read the V3 whitepaper; I’m sure it’s interesting.

I guess maybe tagging could be done upfront by looking at file contents and file name extensions. It wouldn’t work for HashBackup files because they are encrypted and don’t have extensions, but then again, I guess you could measure a file’s entropy to see if is encrypted and if so, assume it’s a backup and place it on appropriate storage nodes.

THanks for the whitepaper tip!
Jim

Alexey · September 20, 2021, 6:28pm

On Storj DCS it’s encrypted before to upload to the network. So, tagging is possible only client-side before encryption.
You can use the uplink meta command to add metainformation to the object.
But it will not be used by the satellites, because they do not proxy the data between the client and nodes.

hashbackup · September 20, 2021, 8:03pm

Here’s my original post about tagging:

I don’t know much about the Storj technology. It seems there are several differentiated use cases:

backups: high ingress, high storage, lower IOPS (except rclone!), low egress

image hosting: low ingress, medium storage, higher IOPS, high egress

video streaming: low ingress, high storage, lower IOPS, high egress

Maybe Storj already does this, but it might be useful to be able to tag uploaded data to optimize storage node selection. For example, a node with a low outgoing bandwidth speed or limits and high storage capacity could easily host backups, but would not be very successful with images or videos.

A meta tag as you suggest would work even if it wasn’t passed to SNs; I was thinking of it mostly for SN selection. I guess it could also be useful to a SN that wanted to optimize files by class, for example, putting high IOPS files like images on SSD while backups were put on spinning discs.

Yesterday I was looking through the requirements for SNs. I have a crappy connection, 20Mbit down, 1.5Mbit up, and only have a TB or 2 of free space. So my server would not be that great and doesn’t meet the minimum requirements. But my uptime is 72 days, so there is that. If Storj only sent backup data to my node and tried to avoid me when retrieving data, only using me as a last resort, it would probably work okay. I can see where you can dynamically classify my node capabilities, but it seems like you’d need a tag to know upfront you were dealing with backup data, though I just started on the whitepaper.

@jammerdan sent a link to some overall network stats that were very interesting. It caught my eye that there were 26K disqualified nodes out of 41K - around 60%. I’m sure there are lots of reasons for disqualifying, but maybe if files were tagged, some low-capability but reliable nodes that were disqualified might be able to participate for certain kinds of data without affecting performance of the network.

BrightSilence · September 20, 2021, 8:39pm

There is really no reason to implement tiering like that. I want to be very clear by starting out with that. The way Storj is implemented, means that even with slow nodes, parallelism ensures the end user speeds are still high.

Ok, with that said, lets look at what you are suggesting. Treating nodes differently depending on their performance. That either requires node operators to state their performance… In which case, I assure you I have 1000gbps up and down, pinky promise! Clearly, you can’t just trust that. Alternatively, nodes need to be constantly stress tested to keep their performance stats up to date. But that would require a lot of testing with large amounts of data to get consistent results. It adds a lot of complexity.

Instead, the way Storj solves this is by over provisioning on transfers and long tail cancelation. What this means is that for a download, a segment can be recreated with just 29 pieces, but it starts downloading 39 pieces from 39 different nodes. As soon as the first 29 finish, the rest of the transfers get canceled. This eliminates the slowest transfers and effectively uses 29 of the fastest transfers in parallel to get great speeds. Add to that that depending on how the client software is coded, you could also download multiple segments in parallel. A similar thing is done for uploads as well.

This relatively simple systems ensures top performance of ALL data stored on Storj. Since every segment is stored across fast and slower nodes, each segment has plenty of fast nodes that can be used in parallel to serve your files fast. Tiering really isn’t necessary as the current setup can ensure that everything is available at high speed, while still allowing lower performing nodes to participate. The only downside is that if your node is really slow, it’s likely not getting a lot of data and not getting a lot of egress. So it may be less profitable, but since it’s also less valuable to the network, I’d say that’s fair.

Toyoo · April 7, 2022, 3:39pm

I’d suspect significant majority of the disqualified nodes to be nodes set up as some kind of an experiment and purged before their vetting has finished. That’s the story of my first two nodes…