SQL Database with Storj backend

Recentley MariaDB announced support so Amazon S3 storage engine.
also read some news about STORJ being compatible with S3 API.

So.

I am wondering if somebody has used MariaDB (or any other relational SQL database) using STORJ as the storage engine?

I few benefits from the top of my head are:

  1. I will not have to worry about daily backups, STORJ is reliable and redundant out of the box,
  2. I will never loose my data STORJ would take care of consistency,
    3.I will never have to worry about my DB being corrupted
    4.Assuming my sharding strategy is robust enough (based on consistent hashing) I would not have to worry about scalability issues, because the troughput STORJ offers is world class.

if any of my assumptions above seems naive… please feel free to point it out.

4 Likes

Very interesting!
But how will the query Performance be?
If you run complex queries I fear that you need to wait a long time, at least much longer as with local storage attached to your database server.
In modern times each ms counts for most appliacations…

2 Likes

Welcome to the forum @crackerjack00 :slight_smile:

That’s cool news, I did’nt know ^^

Surely points 1 to 3 are valid. Like @striker43 said though I’m not sure about point 4: latency and speed may not be as world class as you might think yet, but things keep improving month after month…

However I could be wrong and it would be really cool if someone could try it out!
I would love to see the results posted here :slight_smile:

The S3 storage engine is read only and allows one to archive MariaDB tables in Amazon S3, or any third-party public or private cloud that implements S3 API (of which there are many), but still have them accessible for reading in MariaDB.

The typical use case would be that there exists tables that after some time would become fairly inactive, but are still important so that they can not be removed. In that case, an option is to move such a table to an archiving service, which is accessible through an S3 API.

One of the properties of many S3 implementations is that they favor large reads. It’s said that 4M gives the best performance, which is why the default value for S3_BLOCK_SIZE is 4M.

So, generally, you wouldn’t use it for queries that require any kind of performance. It’s more just another kind of layered storage, one that can be done per-table at the database level, which is a great feature. One for which Storj indeed should work well.

I used to work with druid.io, a “big data” non-relational database which also managed layered storage. You could partition a table by date, then decide to store, current week’s data on SSD, then current year’s data on HDD, older data on S3. Very useful.

2 Likes

Sorry if I’m teaching you to suck eggs but… You should always have backups, especially of mission critical data.
Not for hardware failure but for data corruption, accidental deletions, accidental data manipulation, etc.

1 Like

Hello @crackerjack00 ,
Welcome to the forum!

These good questions. As other pointed that out:

  • It’s read-only. Meaning the layered storage for not frequently access, your active data will be on local storage anyway
  • You should do backups (for example - with MariaDB connector for Storj DCS) independently of durability because of possible bugs in your software, unintended deletions, archival purposes, etc.
  • The latency will be greater than local filesystem (at least now), but it’s possible that you have a lot of nodes around you and the latency could be small enough - you need to test it.

The other points are valid.
You can configure the MariaDB to use Storj DCS as S3-compatible storage engine, you need to specify two additional parameters s3_host_name and s3_block_size:

1 Like

After thinking a bit about it, I confess I have doubts regarding Storj for this use case. Though, not because of technical parameters of the Storj service, but the payment structure.

I would assume that the standard scenario for using S3 as a database storage, even if for less used data, is when deploying database on a cloud server, potentially close to the S3 service. within an AWS region (or within a region of a different centralized cloud service) data transfer is free, so querying data doesn’t incur additional cost and—even if slow—one can do that as many times as desired.

This is not the case of Storj, as data transfer fees are unavoidable. What’s more, Alexey’s recommendation of a 64MB block size makes it even more expensive. Assume a large table, going in tens of gigabytes. If a query can be satisfied by a transfer of a few scattered 4MB blocks, it will be faster to transfer and cheaper than the recommended 64MB blocks. Sure, full table scans will be faster with 64MB blocks, but also pretty expensive too.

It feels like Storj is close to being a nice candidate for this use case, yet S3 might be quantitatively better though right now.

I wonder whether having two tiers of pricing: one that is like the current one, another with no or very small egress costs, but more expensive storage, would make sense.

Thanks for the info Pac, could you please elaborate on “latency and speed may not be as world class as you might think yet”,

I naively, assumed that the only difference between Amazon S3, and STORJ was the price. STROJ being cheaper and truly decentralized.

fair enough, and yes, you are correct backups are a must-have.

Fee structure is critical for my app, and I would assume many other apps looking for cloud based object storage. besides being a crypto fan the most attractive feature of STORJ was the price.

but if it ends up being more expensive than other cloud providers… then I would have to rethink my strategy.

Hi there. super excited to have a leader reply to my very first post, this comunity rocks!

regarding " but it’s possible that you have a lot of nodes around you and the latency could be small enough"

If I start my own nodes, assuming I have five raspberry pi’s with 2 4TB drives each each, would that reduce the latency of my app ?

1 Like

Well that’s a feeling I’m having when seeing other people’s results on this forum. I keep being unimpressed by throughput (especially considering how massively parallel it is) and latency, but maybe that’s just me…
Or sometimes it’s not directly because of the Storj network but rather because of tier tools using it, I’m not sure. Here are a couple of examples where the Storj network isn’t perfect:

But maybe these use cases wouldn’t work on Amazon S3 neither, I’m not sure. I may not be seeing the whole picture, other should feel free to chime in and explain why it is in fact an amazing product! Don’t get me wrong, I think Storj is a good and promising product that has a great potential and could become awesome, but I think it’s not there yet :slight_smile:
It keeps getting better month after month though :+1:

1 Like

It depends where your app runs and where your pi’s are located, also where the data is located.
Just because you become a node operator, doesn’t mean you receive the data on your pi’s that you upload via your app.

The data is distributed via some algorithm when uploaded. @Alexey can comment better to that.

When data is downloaded the request is sent to all operators which have the data and the ones which respond the fastest will be able to deliver.

So ultimately the speed depends on where the data is located, but that is not something you can control at the moment.

1 Like

It depends on where is compute running.
AWS will charge for egress in all cases except transfer within the same region to the AWS service or other s3 bucket in the same region.

If your compute service does not hosted in AWS or located not in the same region, you will pay for egress from AWS every time when you read. With Storj it will be cheaper in case of host not in AWS.

However, if your compute service is hosted on AWS, you will have to pay twice: when uploading to Storj, traffic will be free ingress in Storj, but paid egress traffic from AWS, while downloading from Storj it will be free ingress traffic to AWS, but paid egress traffic from Storj.

unlikely.
The nodes selection is random, but fastest 80 nodes (usually closest to your location) from the offered list of 110 nodes will host pieces of your file. When downloading you will use 35 out of 80 nodes, which stored pieces of your file, and 29 fastest from them will deliver 29 needed pieces to reconstruct the file.