What is the worst that would happen if I delete all of my SQLite databases and restart storagenode?

cdhowie · May 22, 2020, 6:04pm

Do you have a citation where anyone has found that to be the case? All I’ve heard on the forums for when there’s a database corruption error is “the databases aren’t critical, just delete it and restart storagenode.”

Ping @Alexey… can you chime in on this? What is the worst that would happen if I delete all of my SQLite databases and restart storagenode?

SGC · May 22, 2020, 6:20pm

well your ram in your computer aren’t critical either… i’m sure your system will do fine if you injected corrupt data into them…

the fact is you cannot really predict what would happen in most cases… so yes the databases are not critical for the storagenode integrity but damage to them could have nearly unpredictable outcomes…
i cannot say how bad it could be (kinda requires me to know how the storagenode works in detail), but i would seriously try to avoid doing to risky stuff with them.

like putting them on a ram drive, besides whats the gain from that… you most likely won’t get it to run much faster than with a proper ssd.
you would however end up using a few GiB to having the databases in RAM, you would be forced to have ECC memory, do regular scrubs of your RAM to ensure the data integrity of the entire db at all times… just to get performance thats most likely equal or to practical performance basically irrelevant.

with ZFS i barely even know where my database is located… some of it would be in ram, and then it would most likely also be in my L2ARC ssd and then it would be written to disk from my slog ssd when it flushes… so i basically have the database in ram, but i don’t have any of the downsides of using a RAM drive… only most of the advantages… and i could go hot pull most of my RAM and my L2ARC and my system would just run a bit slower… because of how my server is setup it doesn’t rely on those things as being critical components for operation and all of them are hotswap.
tho granted the OS might crash… not sure how well debian is programmed to deal with somebody pulling RAM on it while live.

but then the watchdog of the BMS would register the system as frozen and reboot after 10 minutes, and everything would just boot up and continue and be may have forgotten a few ms of what it was doing because the slog didn’t catch it in time and none of the data would be corrupt because it’s Copy On Write, thus it cannot create write holes during lockup or power outages.

i duno what you are running, but doing a RAM drive for the DB would most likely be a semi dangerous waste of resources…when there are so many much more practical solutions that can benefit your system in may more ways.

but if you do put you DB in RAM do let us know how it turns out and if you do see any actual performance gains… always interested in learning new optional tricks, or hear a horror story about how not to setup ones system…

cdhowie · May 22, 2020, 6:30pm

SSD is still 1-2 orders of magnitude lower throughput than SSD and sync() is still expensive compared to RAM as RAM can return write success faster than an SSD can.

I’d also posit that the failure rate of RAM is lower than SSD. If true then the databases are actually safer in RAM. When rebooting you could either choose to lose them or temporarily copy them to persistent storage.

The databases on my nodes don’t even exceed 300MB.

I strongly disagree with both of these points.

When you read from a file, it goes into RAM. When you write to a file, the write buffer is in RAM.

The data touches RAM all the time anyway, and more than just the databases – piece data lives there during reading and writing. The I/O cache is stored in RAM. Therefore data stored on disk is susceptible to RAM errors as well.

Put it another way… whatever data you consume with the ramdisk would have been used for the I/O cache. If you have RAM errors then you’re probably serving corrupt pieces out of your storagenode as the corruption is in the I/O cache instead of the databases. It’s still a bad situation.

With bad RAM, the entire system is suspect. It doesn’t matter what’s in a ramdisk and what’s not… you can’t trust anything at that point.

Whether or not you’re using ECC RAM has no impact on the safety of a ramdisk vs disk. The same data stored on disk could be corrupted with non-ECC RAM too.

SGC · May 22, 2020, 6:48pm

but would it matter, besides the ssd is where you save the db… not where you work on it, the used part of it would most likely be in memory.
so all you do is basically reverse the order… instead of having the db in ram being worked on and saving changes to an ssd for recovery if shit hits the fan, then you now will have a db in ram, saving it in ram, so that you can loose the whole database, you would also still need to have some place to store it during reboots, so in the end i would almost argue that the performance benefit you would see would be very low if any… and you just end up taking up more memory and more bandwidth in memory.

duno i copied one of mine the other day orders.db it was like 600mb just for that one… and there are a few to my knowledge, not sure how many.

ECC enables you to perform regular scrubbing of your memory for little performance loss, which is required to keep a stable integrity of the data that isn’t often used, RAM isn’t super stable over long periods of time with little usage.
sure it’s not relevant for short periods, but if you expect the node to run for like months without a reboot, you should expect such things to make life alittle less troublesome… its not for fun the server standart is ECC.

I cannot remember if RAM is more or less safe than on disk, but i can tell you what happens if the power goes out… then its safer on disk

i just don’t think there is much to gain from putting it on a RAM drive, but yeah RAM is faster than SSD’s no doubt about it, but you will reach a point where the added speed doesn’t do you any good…

you could also go to work in a jetfighter… would mean you get from point A to point B much faster… but i doubt it would make your daily life much easier…

get ZFS and it will basically put your DB in RAM by default… atleast it’s most frequently accessed parts, to make everything run better… and it would give you so many more advantages on a server with the RAM to spare from stuff like RAM drives for storagenode databases.

didn’t really touch that much on the ARC
but it was short… all the other videos are 50min lol

Alexey · May 22, 2020, 7:09pm

Your node would lost all stat (used space and bandwidth, payments information from satellites, the ability to check a correctness of the payment, etc.) and all unsent orders (they will not be paid).
The customer can download some pieces for free from your node.
The repair request will fail to find any piece on your node.
Since your node will not be aware about used space, it can accept more pieces than have a free space and will fail to add any record to the database. It almost will stuck in this state until customer hopefully delete their data or SNO will find a way to add more free space and correct the availability in the config.
That’s all what I aware about.

It’s not fatal, but very inconvenient.

cdhowie · May 22, 2020, 7:13pm

Great info, thanks!

This is data my monitoring system tracks so the impact to me would be basically nothing.

Assuming database losses are rare I imagine the impact would be not very severe (maybe a few cents).

In perpetuity, or is this just about orders that weren’t submitted?

So a repair request checks the database for a piece and doesn’t look for the file on disk?

Could the software be updated to regenerate this data if the database is found to not exist during startup? Aside from any theoretical discussion about using ramdisks, this would be useful in the event that the database gets corrupted.

Alexey · May 22, 2020, 7:22pm

The one time for each piece. So, you will eventually give your used space downloaded at least one time for free.
This is my understanding of the process and I can be wrong.
@littleskunk can correct me

I think so, this was the case while ago: Download failed: file does not exist - #4 by Alexey
I can’t be sure, because I’m too lazy to check the code

Topic for the Storage Node feature requests - voting - Storj Community Forum (official)

Pentium100 · May 23, 2020, 4:43am

What if I regularly back up the database? What would the impact of restoring a 5, 10, 30 minute old copy be?

Would that be better or worse than losing the entire DB?

Let’s assume that the actual piece data stays intact.

SGC · May 23, 2020, 8:17am

aren’t you on zfs… how would you have it break unless if the storagenode breaks it…and if you did backup it then you would need a way to verify it isn’t corrupt so that you can successfully overwrite the old backup else you will just have like a 5 min delayed copy which will also most likely end up being corrupted when the file is updated after the storagenode db is damaged.
maybe it would be easier to simply snapshot the whole thing…i duno how viable that would be tho… if we say ever 5min, and then maybe keep a day’s worth, and then after a day just do like hourly maybe every 3 hours until one reaches like a week continually longer between snapshots… and then merge them back into the file system…

not sure if zfs can even run like that… it would seem kinda insane with what it would have to manage, but maybe… not sure what it would actually do for fragmentation of the hold storagenode either…

Pentium100 · May 23, 2020, 9:10am

My idea is to put the database on tmpfs or similar and back it up to disk every few minutes. If the power fails, I just restore the last backup.

This is why I asked the question - would restoring such a backup be useful (assuming the data, which is always on disk, stays intact)?

SGC · May 23, 2020, 10:07am

i would say most certainly… loosing the entire database tho survivable sounds horrible… thus any partial restoration of the database should only make the time difference from corruption / loss to when the database was backup matter…

thus lets say your node, deletions we will assume won’t matter because deleted files must be able to be caught somehow at a later point… that leaves uploads and downloads… downloads are irrelevant again because they don’t make us miss any actual data… which leaves the uploads.

which for the lost time period would then supposedly be unrecorded, meaning these files would be stored for free, which i would see a little punishment instead of like alexey suggests that all the files stored when the database becomes corrupt would essentially become free…

not sure if this is realistic, because why would a storagenode ever accept to this… would most likely be more worthwhile to simply kill the node and start over, or thats what it almost sounds like to me…
and odds seem to be that it might happen anyways with the loss of the db’s

also the database isn’t the live data, so it doesn’t impact the erasure coding… if a satellite needs the data and knows it stored it on your node, then it sounds like it would simply retrive it based on it name on the files system… thus erasure codes would be unaffected, so the uploaded files that miss database entries would still work for the network even if they might be freely downloaded…

i would take a few freely downloaded files like if we say 1 file a second so 3600 an hour so in 4 days it would be about 100 hours so 360000 files, and the loss of lets say even 30 minutes would only result in 1800 files… which is a minimal part of the whole…

and then comes the closing on 8000000 files i currently have and keep track of… loosing track of 1800 vs the whole database… seems a trivial matter that will at worst cost me well lets see what is 1800 of 8m … less than 0.25%

actually seems like a pretty solid solution, and since a node can survive without the db’s then the ability to restore our databases at a level of 99.75% should be quite a nice ability to have.

the real problem is if we don’t notice it’s gone bad and the node keeps running and writing in a corrupt database or something like that… ofc doesn’t really change anything aside from how long a period one looses information from…

hell even a backup a day would make the world of a difference for a 3 month old node thats nearly only 1% loss of records…

not familiar with tmpfs

why would you restore data on power failure… zfs should take care of all that… it’s basically built in

only if you db is corrupted or such will a backup be useful… not sure how to automate database corruption and correction, but i suppose the logs could trigger it… or atleast send an alert…

like ZFS when you pull a non redundant raidz drive… it doesn’t go all crazy… it waits until it can speak to the drive again… and thus basically it waits for the drive to reconnect by itself or for a sysadmin to be alerted to the issue and take care of it… and when it detects the drive is back the pool is allowed to continue.

one might want to simply shutdown the node rather than keep it running with a bad DB, this also keeps any weird automatation issues from happening… such as the db was replaced with a backup because the stuff was triggered by mistake.

atleast for the first long while i would keep sysadmin action to let the system automation not make semi possible costly action all by its own until one understands how the coding will behave over time and in real world usage.

Alexey · May 23, 2020, 11:32am

You can do that. You will have the same impact, as described above but in less volume.
Just do not do that for pieces.

Pentium100 · May 23, 2020, 12:12pm

Obviously losing customer data is unacceptable, but, if I understood correctly, the database does not contain any critical data and restoring a slightly older version only loses a few orders and gets the used space a bit wrong and will allow some pieces (some MB) to be downloaded once for free.
Assuming my node does not crash multiple times per day this is interesting. I do not know if I am going to do it (I can just use a couple of SSDs) though.

tmpfs is a RAM drive, so obviously it loses all data on a reboot.

SGC · May 23, 2020, 1:24pm

not you to… lol i just don’t see the risk vs the possible rewards…

just get an IO accelerator card … granted its like 200$ for a nice RAM based one, but then you can simply set sync= always and it goes to the ram storage in the IO accelerator and incase of a power outage it will stored on flash with power from a supercap, meaning no batteries, no real wear and the ram module are usually replaceable.

all the advantages and non of the downsides… i’ve even seen ram modules that fit in memory blocks with batteries attached to them for this kind of stuff… might be cheaper than io accelerator card.

the whole idea with IO accelerator cards comes from sun in their old sparc computers which i assume was for which they also developed zfs…
why do it half way… besides you can get older version of IO accelerators down to 40$ or so … but they won’t sustain 10gbit speeds, but still they can be ram based if thats what one wants.
or i suppose they do but then one would need to lower the flush of the SLOG to something lower depending on the capacity of the cards… the 200$ or 150$ neat modern accelerator cards are 8gb i think so they would easily sustain 1200mb/s for 5sec as per default on the slog… and with just lowering it to like 3 sec per flush then you could do dual 10gbit ingress on it.
or like a million of random iops… ofc you would need oretty decent pcie connection to push that much data through it…

like pcie 2.0 x8 or better just to supply the bandwidth from the nic + 2x10gbit to the io accelerator and +2x10gbit onward to the drives… + whatever else one needs… would put my 10 year old server in a tricky setup to make that fly well …

but yeah you could do that in ram… i suppose thats what the ram modules with batteries connected to them are for… to avoid the pcie bottleneck.

cdhowie · May 22, 2020, 7:04pm

Sorry, @Alexey, I wrote this post while you were splitting the topic. Please move this post too.

Yes, I agree. But again, that discussion is totally irrelevant to whether the databases are stored in a ramdisk. Whatever RAM is free is used for the I/O cache so if you are picking between corrupting a DB in RAM or corrupting the I/O cache, you’re going to have a bad time either way.

If the databases are not critical for node operation and can rebuilt, then it doesn’t actually matter. That’s why I want to hear from @Alexey on this topic.

I don’t care about any metrics I’d lose in the database if they don’t impact node operation. I have a separate monitoring server that collects and records historical data, so the metrics in the node database is redundant to me.

Seeing 1-2 orders of magnitude improvement on DB writes that are already contending for disk IOPS isn’t an improvement? Okay…

Keep in mind that sync() on RAM is ~free; the data is already there. On an SSD it means pushing a write buffer to NAND cells and there is still some latency on that. Not as much as an HDD but still significantly more than on RAM.

This is the only point that I will concede – you are going to shrink the available RAM for the I/O cache. But keep in mind that, at least on Linux, the contents of a ramdisk can be swapped to disk. This has the net effect of ejecting DB pages not used to swap. If your swap is on SSD then you only get the addition performance hit of writing the database to disk when the system decides it is advantageous to free up RAM. So you’ll see significantly better DB performance in the average case, and in the worst case you’ll see the same performance as using an SSD as storage (which would be rare as the swapping algorithm is pretty good about deciding what to swap out when there is a lot of free RAM).

For reads, sure. For writes you still need to have writethrough to some kind of persistent storage, even if it’s an SSD cache that does writeback to an HDD, and writes are where you see the contention. During writes, you cannot even read from the database so writes must be fast.

Uncontended reads on the database should generally be of no consequence with or without ZFS.

SGC · May 22, 2020, 7:22pm

seen this one a couple of time, maybe more partial views xD
ZFS is very smart dealing with that stuff… one of the most impressive things i’ve found recently is that my VM disk images, it will literally identify the most used parts on the drive and keep them in the ARC (RAM) or if out of RAM it will go to the L2ARC, with this my VM’s run like on SSD unless if i use something i don’t usually use… and if it doesn’t predict what i’m going to do and prefetch it.

ZFS uses checksum… thus it keeps a CHK for each record … like say 128kb pr default, so if 128k of my VM’s disk image is used often it will stay in RAM, else it will stay on spindle… i don’t have to do anything… the system simply tunes itself .

ofc it comes at a cost… i got 48GB RAM of which 23GB is used for ARC and then i got 600GB L2ARC for my 42TB spindles. then it will simply adapt, whatever goes in memory depends on most frequent used and last used parameters which will fight for who is the most needed.

i can very much recommand ZFS if you want to do advanced caching, i’m unaware of any better way to go for that, and it’s safe because everything is stored on hdd, the others like RAM and L2ARC SSD is just for working faster with stuff, giving you the advantages that you seem to want…

cdhowie · May 22, 2020, 7:25pm

Right – for reads. For writes it still has to write to persistent storage (at least the first storage tier) which is the whole idea behind using a ramdisk: reducing write contention. SQLite writes require multiple trips between the WAL and the database. IIRC, at least two syncs are necessary.

ZFS RAM caches cannot help there unless it doesn’t tell the truth when asked to sync() and returns success before the data has been persisted, in which case it would not be considered a durable filesystem.

SGC · May 22, 2020, 7:43pm

yeah but that write goes to my dedicated SLOG SSD and then every 5 second is written as one big sequential write to the spindles, which because it’s recently would mean it still stays in RAM until it runs out of use or something more important comes along…

with the SLOG i mitigate fragmentation because i write everything to the slog instead of spindles errr hdd you know… so the fragmentation is one the SLOG and then after 5 sec it will be written to disks in one go which is then ofc unfragmented basically… i know arguments could be made… but for all practical purposes lets just call it that for simplicity’s sake.

so yeah you might get some fragmentation, but its like decreased greatly… i mean like a hdd can write in 6ms, and 5sec is basically 1000 times slower… not sure it’s that big a difference but its way more than needed.

so really the writing is all just in the background and ofc sequential writes are like 20 times faster if not more like 40 on hdd’s compared to Q1T1 Random RW.

doesn’t make everything go faster, because i push everything through the SLOG SSD, but that is a throughput issue that doesn’t really come into play when the max i’ve seen in ingress on the storagenode is 6mb/s or 7
and one can get old cheap IO accelerator cards that will push 2-3GB/s at 300k or so IOPS for 40$
might get one of those, but for this first attempt at ZFS i just went with a many years old MLC ssd i had in an little used laptop.

anyways… watch the video, its well worth it if you are just the least bit interested in advanced caching.

cdhowie · May 22, 2020, 7:47pm

Yep, that’s exactly what I mean. Writes to RAM are 1-2 orders of magnitude faster. Using a SLOG SSD in ZFS for this isn’t much different for writes than just using an SSD to store the databases.

My point is that using a ramdisk would significantly reduce write contention compared to an SSD. Saying “use ZFS with an SSD cache” just ignores the entire point I’m making.

SGC · May 22, 2020, 8:36pm

no because it’s still in my ARC / RAM the disk is simply to write it out to stable storage and the SLOG SSD is to make it safe faster than if it was going directly to HDD…

well i can usually keep my avg latency on hdd reads at about 12ms, which is basically the best possible speed they can get… ofc it’s quite a few drives, but still writing sequentially to hdd’s they can do at 100-120 mb pr sec, and it’s not important because it’s in the SLOG, so it’s flushing will afaik be secondary to regular reads, which are the only thing that matters because all writes are going to the SLOG meanwhile being in memory… sure it might use a good deal of added capacity because the data will exist like 3 or even 4 places at the same time… ARC, L2ARC, SLOG and or HDD
the SSD L2ARC just makes it so commonly reloaded stuff from memory that wasn’t room for in ARC (RAM) is accessed quicker and doesn’t disturb the HDD’s

If it’s used often it will always stay in ARC like say the storagenode database… it exists more places so that if the memory is lost you can restore it without loss, it’s all really about limited write contention to the HDD’s
does ofc eat much more internal bandwidth on the system, but most computers even very old ones have fairly high internal bandwidths… one of my current bottlenecks is my 3gbit sata controller on the 10 year old server mobo which i ended up connecting my SSD’s to… but haven’t gotten around to rebooting to fix it, because it’s barely even a real problem… tho for some reason i can currently only get 50mb/s write speeds… lol but 50mb/s tho not fast by modern standarts is still a significant amount of data… but i’ve been reading into my manual to figure out which sata ports on the 3gb sata controller is seperate, which might double my throughput again… but thats only related to writes currently, most likely because it’s actually using 3gbit/s to write the approximately 400mbit that 50mb/s is…
because the L2ARC and SLOG is both on the same controller and thus the data is moved into one and from one into the other and then back to the PCIe bus and into the HBA and onto the drives…

i could ofc move the SLOG to something faster, but then it would take up a bays on the backplane which i use for hdd’s, and it may come to something like that… but for now it’s an easy way to give my system low latency even when dealing with high iops random read writes.
will still read at 500mb/s sustained or something like that, most likely while writing 50mb/s sustained because there shouldn’t really be any difference… maybe i will get 450mb/s reads tho… but i doubt that… seems i generally is running into weird IO bottlenecks from the system shuffling around data to make everything run consistently better.

ofc 50mb/s is kinda unacceptable… because i got the server on 1gbit network… so i kinda want to get it up to that point… not like i write enough data to really need that tho… but it’s nice that stuff takes a lot less time.

ZFS is amazing tho, even if my rig is kinda outdated … only a decade old lol so there will be a few bottlenecks to bypass when doing stuff like this… but going to be mighty mighty good by the time i’m done with it.
40$ IO accelerator card and the write bottleneck is gone…ofc remove one find another… couldn’t use 400 or whatever mb write’s a second anyways… only got the one pool so everything would be over the network… so anything above 120 wouldn’t matter in and practical sense for now… and i bet that after next reboot ill get it back at 100… just put poorly placed cable issue.

and L2ARC is going to be full sometime in like 24-48 hours, that should be interesting to start to monitor and test with… takes the better part of a week, but everything i use on the pool will just run like on ssd or like from ram if it’s very often used… its amazing