Just remove it, and you will find out. But thereās a chance the node wonāt start. Then you probably have to remove them all, if youāre not able to recover this particular database. And probably other databases will turn out to be corrupted as well. Then youāre losing stats indeed. But itās what now important to you: a running node or full stats.
But I actually think, the databases are really a pain in the ass of STORJ. Never came across an application with so many failing databases. The developer team should consider another solution in my opinion. Itās probably even in the settings they use, because sqlite is actually being used very much.
well node actually did not start yes I removed them all and start new clean one. Not really big deal because this node was about 10-15 days only and reason of failure was corrupted ext4 fs, I fix it with fsch. But still for future if node will run about 6+ month it will really sad if some databases will crush with no restore only delete.
Thread may be closed
You can just remove all databases, then they will be recreated as soon as you restart the node. You donāt have to start the node all over, as like from scratch.
My experience up to now is that databases are usually not corrupted at their own, but usually multiple at the same time. So, I on purpose put them on another disk than the data files (because of the many writes), which is partitioned as btrfs which is being snapshotted. In the end I only lose some hours of stats, but thatās it.
In my opinion, STORJ should take those back-up measures on itās own. For example write at start all sqlite databases to a tmpfs-device, and write them back with another name from time to time which is being renamed to the original name as soon as the file is successfully being closed.
But just out of curiosity: are there plans on automatically handling those database errors? Especially because they are that frequent coming along? I mean:
Just recreate missing databases on the fly isnāt that hardā¦?
Running pragma check with recovery trial or recreate empty database if not working, wouldnāt be so hard to automate?
Working with a more back-up idea, like I suggested in the previous post? (E.g. copying the database to a working file like database.db database-working.db, closing them each N hours, check in the folder whether file is still well-formed, rename the file over the previous database, restart the cycle, ā¦) or create a tmpfs-disk in memory which even brings down IO to disks. If anywhere in the process there is a problem mentioned in the sqlite-manual, youāre of without a problem in most cases.
Just trying to improve the work flow, because SNO isnāt the most well-paid job and Iām not kidding I never came across a program with that many problems with sqlite as STORJ. As you can see, many run into trouble by just power cycling their node, like pragma settings arenāt right or something. Recoverability isnāt guaranteed. Speaking for myself it already took place 3 times last 2 months on the six nodes in running, which were all not recoverable. My nodes are all behind UPS and only one time it happened because of an OOM-failure, but the other times it just happened after power cycling. I already put the databases on another drive (SSD), in advance because I saw this error coming along in the forum so many times. I finally decided to use BTRFS for the host filesystem, so it would take me the least time as possible to recover from such errors while keeping as many stats as possible
These corruptions are a litmus test. A wake up call. Silencing these will be akin silencing smoke alarm. Fire will still get you even if you make fire alarm reset itself. This is data loss, and needs to be prevented by the operator. If the operator cannot make sure their node does not reset abruptly ā perhaps they should not be hosting a node. Itās a bare minimum that is required: keep it running. Even graceful restart of the storage appliance is an ordeal that shall virtually never happen, much less abruptly.
Lol. Those who donāt power cycle the nodes and not kill the storagenode process donāt run into these issues⦠so⦠how about⦠donāt do it!?
This is thre times too much. What changes did you implement to prevent this in the first place? Or at the very least, after the first time?
It is against ToS to prevent node from using specified amount of memory. If the disk subsystem is insufficiently performant node can use abnormal amount of memory ā either disk subsystem needs to be fixed, or more memory provided. Under no circumstances shall node be killed.
Also, donāt power cycle compute devices, much less storage appliances. if you want to do that ā you need to ensure all wires are atomic, but if you do ā you wonāt be able to run the node as you will run out of IOPS even at the current utilization.
Ensure that graceful OS shutdown waits enough time for the process to exit, and does not kill them abruptly (this is often an issue on windows)
Do those SSDs support PLP? How will putting databases on an SSD help with abrupt power resets? Data in flight will still get lost.
Advice not to use BTRFS was posted on the forum many times tooā¦
If you are hoping that snapshots will save you ā they wonāt. You canāt snapshot open database and expect to have it in the consistent state in the snapshot. If you want to backup databases you need to ether have all client connections closed and disks synced at the time of snapshotting, or youāll need to backup the exported data from live databases.
I strongly believe automatic repair shall not be implemented. Data loss is a catastrophe, and shall be noticed, measures shall be taken to prevent it from occurring again, and only then the node shall be started again. Otherwise itās a massive waste of time.
I now understand, why itās an arrogant rabbit ;p
This is data loss, but selectively to the databases. So these nodes have never missed out on an audit. So I doubt that it has the same root cause. Also because 2 of 3 times, there were no *.db-{wal,shm}-files.
At the beginning, I was tweaking some things concerning networking. Besides, in the end you sometimes end up rebooting your PC due to some updates or something. Cheap advice, isnāt hardā¦
Kind of annoying, but they all have assigned 2GiB at least. But in this case it was a hung process on the host, why one qemu-VM was killed.
No, but itās just normal rebooting the system in which is already happening. So I hypothetised it might be due to hung/slow processes, taking longer than the timeout; maybe worsened by the fact the data and databases were in the same (hard) drive. Besides, SSD is faster in writes so you have less time the data is in flight.
Besides I did a full memtest, turning out without issues.
I even pointed my VMs to CPUs, so they would be less likely to stall each other.
So you advice me to stop the node for a while?
No, but these btrfs snapshot are filesystem consistent. So no syncing troubles. Besides, the *.db-{wal,shm}-files are being snapshotted too. Both times I used this approach, sqlite was able to recollect them. And otherwise there is another snapshot of 6h beforeā¦
But again, Iāve never heard of and seen an application with so many sqlite-database troubles. And welcome in real life, where updates sometimes mess with your connections forcing you to reboot, where there may be power glitches whatever you do to prevent them, hardware defects or other unforeseeable events in which arenāt signs of bad SNO-ship per se. In which it would be helpful if the databases arenāt giving you trouble.
I believe there is a difference between node data and stats. This probably hasnāt the same root cause as data loss from the node, which you will notice by missing out on audits. And as I already argued, I am almost sure itās a different thing because I never missed an audit (or at least itās still 100%). So I believe it should be implemented, and otherwise Iāll be doing it myself. Not in the least, because of the strong opinions conveyed in this thread
Besides the relation between data loss of the blobs and malformed database issue, isnāt being mentioned in the forum or in the manual.
Dangit! My coverās been blown! quick! Regroup and disappear into the sunsetā¦
But even if I exaggerated ā it was just a tiny bit.
Database are written to constantly. Node data - once ever. Also, audit is very weak in terms of detection, itās a last resort ā if it catches anything ā node is probably already dead. You can have massive amount of objects missing and not have an audit failure for years.
Databases on the other hand ā youāll see immediately.
Data loss is data loss. It does not matter what is losses because if anything is lost at all ā game over, the system is broken.
Great that you brought it up. Journaling only works if underlying filesystem does not lie about writes atomicity. But if you keep sync writes on ā youāll node will choke: too much IOPS. So sensible advice is to disable sync on the mount where storj keeps data and especially databases. This unlocks tremendous performance, but now you canāt afford to reset. Itās a fair trade off.
on reboot the is shall send graceful exit signal to processes (varies by OS) and wait for processes to exit. Patiently. Storagenode can take a long time to exit ā limited by drive IOPS. Desktop OSes may have lower thresholds for patience and kill a process. Thatās a problem.
Less is not zero. Data loss probability must be eliminated, not merely reduced.
those are once in a lifetime events each. You said you had three of them in a row. That means your setup is inadequate (no offense intended here).
Stable systems does not need updates more often than once per quarter. And it does not need to be rebooted if one subsystem fails. UPS keeps it powered forever and manages gracefull shutdowns when power is lost. Graceful shutdown waits for apps to exit and syncs disks befor halting. There is zero opportunity for data loss of any kind. This is reality Iām experiencing.
Cpu is not a problem here, only disk latency.
Yes. Or export data from a live database, if you rightfully donāt want to stop. Itās done atomically. But none of that would be required if your hardware and software did not allow data loss in the first place.
See above, you can lose quite a lot of data and never fail audit, node data is write-once; databases are constantly rewritten so corruption is immediately evident; if databases are on SSD without PLP your chances to lose data is much, and order of magnitude, higher, because of read-modify-write nature of SSDs. Instead of putting sata to SSD I would turn off sync for the mountpont and ensure no abrupt resets.
I completely understand your point and wholeheartedly disagree with it: time is better will be spent and with better outcome preventing data loss than concocting wicked recovery techniques.
Why would that matter? Data loss of any kind is bad and shall not happen. If it does ā the appliance in current form is not suitable for storing data.
In one of the threads I suggested to keep databases on tmpfs: itās evaporates on reboot and will always be consistent. The metrics are useless anyway. I personally never look at them.
But this will be with the understanding that your storage is unreliable crap sufficiently reliable to host storage node data where some data loss is acceptable and accounted for in the design.
But there is no other usecase that can tolerate such a total garbage low reliability storage service; which means you are running node on a hardware configured specifically to run a node, which goes against the recommendation to only use unused resources.
So, my node is on an array that guarantees data consistency, where all other my data lives. Databases are on the same array. I never had data loss. Neither with storj last year since day one, nor ever in the past 15 years of running servers at home⦠(and not planning to).
This is why any data loss is a catastrophe if you are running storage node as designed.
This is quite a bit of exeggerating indeed. Taking my smallest node of about 250GB of data now (started one week ago), for this example on which I get daily 25 audits at the moment. If you lose a substantial amount of files like 0.1%, 0.5% or even 1%; the chance it wonāt be detected within a week is 84, 41 and 17% (={0.999,0.995,0.99}^(7 x 25)) respectively and 47, 2 and 0.05% (={0.999,0.995,0.99}^(30 x 25)) within a month. For bigger nodes with the same relative amount of data loss, these figures are considerably less, because they get more audits.
Well, it actually does matter. Because data loss of the blobs is taken care of and accounted for in the STORJ-design. Thatās also the reason why itās adviced against to run STORJ-nodes with RAID>0.
The reason I brought it up is, because if the node has made a clean exit than the *.db-wal and *.db-shm files are being cleaned-up and written to the node. And you end up with only *.db-files. Thatās also the reason why I doubt it has to do with timeout running out or something.
Iām working on Debian (without desktopt). Timeout is given bij docker -t 300, so 5 minutes, which is the same for qemu-VM. Usually more than enough, because usually itās less than 30s with about 10 drives.
Thatās the opinionated part, I spoke about. Because thatās even not an advice from STORJ itself. I refer you to the part stated above. It also wonāt ever become zero. Because everything has itās lifetime.
Oh, the first month I for sure was fidling around sometimes and trying to optimize the whole thing. So I may have been rebooted sometimes 10 times a day the first weeks. But even then, I never had any application using sqlite (of which are very many around) which I ran in trouble with. For example, Syncthing, Plex Media Server and Home Assistant are running on the same systems using a much bigger databases (with also not just one table in a file) with the same sqlite-driver. Those applications have seen many more reboots and even unanticipated power cycles, but Iāve never had any problem with them.
So yeah, I could be the problem here. As I see, SNOās are often the problem and I wonāt be an exception. Also because there are not always clear guides on how to make a rock solid node, so you have to tweak sometimes a little bit. But thereās also a peculiarity going on at the side of STORJ if you ask me.
Fine, but IO can be quite stalling for the CPU. Besides pinning the whole process to one CPU, keeps the system from swapping memory around. And I can tell you, since I did it the CPU-load decreased over 20%.
I can imagine this, albeit it might be written to the disk at exit of the node (and maybe so now and then in the meantime).
I myself was thinking of installing anything-sync-daemon, which is kind of the same thing.
This is making assumptions, as I wrote before in this reply also Home Assistant, Open Media Vault, Syncthing and Plex Media Server are running on these systems. Aside from some VPN-servers. Even if I were using it on dedicated hardware, what would be the problem if I saw it as a hobby or lived in a situation I could earn money because of electricity and hardware isnāt everywhere the same price.
Thatās great. But also against the recommendations, as I wrote before. I also never suffered any data loss on Home Assistant, Open Media Vault, Syncthing, and so on. Because of having them in RAID1 at the first place, but Iām also tempted to think these applications are quite more stable.
In my world there is a difference between an irritation, a little problem, a problem, a big problem, a small disaster, a disaster, a big disaster, a small catastrophe, a catastrophe⦠STORJ never became higher than a problem, data loss never higher than a little problem.
therefore every day 0.02% of random files get checked. This is also happens to be probability of detecting one single corrupted block in a day. Therefore, corruption can stay undetected on average for 2500 days, or 6 years.
Yes. For storj, itās OK to lose some of blob data. But thatās it. We are talking about your appliance, that also runs storj. That your appliance demonstrably loses data. Full stop. Fix that.
This is misguided advice, and contradicts the guidelines of re-using existing hardware, and has nothing to do with the present discussion: raid only addresses rot and bad sectors and other media failures, not filesystem promises. See below.
You are making a few implicit assumptions here, that may (or likely not, seeing that you see corruption) not be true. Most common ones are
broken file locking. Docker had (still has?) flock not working over bind points, and therefore it breaks the assumption sqlite makes.
broken fsync. Same deal with docker not fully implementing this.
So no, clean journal means nothing unless you satisfy these prerequisites and promises. And if you use docker ā you already donāt.
Again, we are talking about your appliance, that you also happen to use for storj. You manage to lose data, so until you get to the bottom of it, you cannot trust your appliance to host important data.
It absolutely does become zero if configured correctly. Literally zero. Correct data or no data. Never corrupted data.
This actually confirms the point. Syncthing, plex, and home assistant write very little to the database. But just look at the write traffic storj sends to theirs. Itās massive; itās the largest contributor to IOPS from the storagenode. So drastically lower traffic is one reason you havenāt yet seen an issue with them.
Another reason could be how you configure and run them (without docker, with correctly configured mounts, etc)
And yet, it does not matter. No amount of lack of issues prove anything, but even one failure proves the existence of a problem. So either storj has a bug with using sqlite (you can review the code) or your appliance is violating the assumptions and misbehaves.
I would start with throwing away docker. Run storagenode directly on your host OS. There is no benefit in containerization for go applications.
This is a brand new sentence, that makes zero sense. Zero. All CPU cores have access to the same memory controller and shared cache.
CPU load is irrelevant, and many other things could have changed to contribute to the apparent load reduction, including bugs in monitoring software.
SQLite does it internally anyway. You are suggesting to reinvent the same wheel around the database, instead of fixing the underlying issues that cause corruption in the first place. The suggestion for Ramdisk was to get rid of 100% useless IOPS.
Different write pressure, so irrelevant.
Not sure what you mean. You are running it on a shared appliance, and that appliance allows to lose data in SQLite of all things under moderate load, that happens to be from storj. You need to root cause and fix it.
Again, storj losing data is not a problem. Storj here is a canary showing you that your appliance is capable of losing data. In my world there are two types of storage devices ā those that lose data and those that donāt. If the device lost 1 byte of data, I need to know why, and prevent that, or I cannot trust this appliance any more.
You seem to sleep well at night knowing that your storage appliance misconfigured to a degree of data loss under a moderate conditions, kudos to you. I canāt. I need to find the culprit and fix it. Not for storj, but for all my other data; and I would be immensely grateful to storj for generating use pattern that uncovered this vulnerability.
Itās not. Because your example is just one corrupted block of data, and I was talking about 0.1, 0.5 or 1.0% of the data. These are like apples and pears. Or to say it otherwise, if you lose today 0.1, 0.5 or 1.0% of data, chance you wonāt find out in audit score even within a day {0.999,0.995,0.99}^2482={8,3E-4,1.5E-10}% and is about zero chance withing a week. But that one piece, might take forever⦠(if not deleted before audited).
Since STORJ is made fault-tolerant, just for the sake of data losses this is real nonsense. Like even using RAID>0ās, which isnāt anywhere stated as a obligation or even recommendation.
Whoever wrote the SNO-handbook, I donāt know. But the advices on Hardware Requirements - STORJ SNO Book are even contradicting the official advice not to use RAID5. So, if you want to RAID all the data, itās fine by me. I choose to start some additional storage nodes over time.
My appliances are running apart from STORJ in their own VMās, that havenāt suffered from data loss whatsoever. Thatās the whole point. Besides, the database of Home Assistant is over 3GiB, and it writes about 10 GiB a day (increase TBW of the SSD, a little bit different from real written data). Thatās really an awefull lot more than the database of STORJ which is contributing less than 1GiB a day to the TBW per node in my case (remember: database and data are on different disks in my case). Donāt worry for the data loss, itās RAID1 and being backed up every day to the openmediavault server (also RAID1 and other drives).
So indeed, different write pressure. So itās that unbelievable STORJ manages to fail on me, but Home Assistant for example isnāt.
It will never become zero. For example, the chance of both drives failing at the same day assuming a life time of 10 years. That would be 7.5E-6%, assuming these are independent variables. But in practice this is much higher, because these drives more than often are about the same age. And also external influences, like a fire, lightning strike, flooding, war, nuclear bomb, ⦠total earth destruction whatsoever make them probably fail together. After all, a RAID isnāt a back-up. For that Iām using Syncthing, syncing my real important files to two other locations (family members living elsewhere). But even then, my chance of loosing the data isnāt 0%. Itās small, but never zeroā¦
I already cited the sqlite āhow to make it crackā-manual some posts before.
But not using docker is a real good point, especially since Iām running those storagenodes in separate VMās anyway. Any manual lingering around on how to do this on Debian/Ubuntu(-derived) Linux? Canāt find it actually a recent one online, only this oldie.
Jup, as long as it only pertains to a hobby project, Iām really fine with it. And considering the whole topic, Iām increasingly convinced itās a STORJ-issue. Also finding the Plex Media Server database being over 5GiB and Syncthing database over 3.5GiB (which turns out to be a level-db BTW), which are being rebooted / power cycled the same way as the storagenodes. Aside from the already mentioned Home Assistant with bigger file size and higher write pressure to the database.
Yes, lets stop talking about this: I said in every post that storj does not care about data loss, but it losing data means your appliance can lose data, means itās not save to store your important data either.
Letās dig into this further. Does each service run in a separate VM? How is the storage that hosts databases provided to VM? Disk passthrough or some mount point? i.e who manages filesystem on that device ā VM or host? Is the situation the same for home assistant and storj?
I was (still am) talking about possibility to return bad data. If both drives died the appliance will return no data. Remember ā either correct data or no data. Never corrupted data.
Funny, how you said raid isnāt backup (correct, it isnāt) but then you say you use syncthing for backup (which isnāt either But letās not go into that off topic.
This might be the root of your problem. file locks does not work across kernel boundaries. Using docker instead of a VM here would be an improvement actually.
There is no need for a manual. Storagenode is a command line utility. run it with --help, if will tell you what to do. If you want step by step ā here is my ātutorialā in the form of a script, for FreeBSD, which is literally a list of things I did manually from --help command, and wrote in the shell file, so I donāt have to do it manually again. You can adapt it easily to systemd on linux freebsd_storj_installer/install.sh at 93c882b08e4ee724b63114d4c84598640dd6b7eb Ā· arrogantrabbit/freebsd_storj_installer Ā· GitHub
You are missing the point. Unless you fully understand why does storj databases on your device get corrupted, you canāt be sure the same underlying issue does not affect your other data.
Iām increasingly convince4d itās your config issue. If this was storj issues everyone would have been affected, not just a minority of users. After all, itās really hard to screw up sqlite API, especially since they just use go bindings, from a single process. But itās very easy to screw up everything else around it in the filesystem and environment.
Plex barely writes to it. Storj writes all the time. If the issue affects writes ā you will unlikely see it with plex or synching, but will see it with storj.
Iām not familiar with home assistant but I doubt it writes gigabytes of data daily. Where does that data come from? if you judging by fast SSD life decrease it could be due to sector size mismatch you are seeing write amplification (did you force 4k sector size when adding SSD to the pool?). But if genuinely writes gigabytes of garbage per day, and you share SSD the same way as with storj (across kernel boundaries) ā run check of their databases too. I bet itās corrupted too.
Usually database corruption happened when the write cache is enabled, but node was abruptly stopped, or if you use unstable file system like BTRFS or network filesystems any kind or some configuration of Unraid.
From thousands SNOs only dozens have problems with database corruption, this suggests the non-optimal configuration in these cases.
For example, I did not have a database corruption since start in 2019, but my nodes working on NTFS and ext4, no RAID. NTFS ones has UPS, but ext4 is not (Rpi3). One Windows node is a binary node, two others - docker for Windows (so even worse, these two uses a network filesystem (p9) to access disks).
So, depends on the setup.
This is not considered as a normal situation, it requires investigation, because missing databases only smallest of the problems in such case.
Dumb automation doesnāt help, it will just hide the problem. And we likely will have posts - āI found a bug - my stat suddenly disappeared!ā
Again, itās not a common and normal situation. I against such automatization, which hides problems.
Unlikely. Usually you also have a data corruption in other places, just not discovered yet. Audits did not check the entire dataset, they are random for random parts of pieces, they are used only to determine, can the satellite trust your node or not, not to check the entire data integrity. If you have corruption, it will be catched eventually and your node could be disqualified.
This suggests the slow disk subsystem or other hardware issues, when the process cannot be stopped normally even after 300 seconds of timeout, if your OS respects this timeout at all.
Yep, you may read the Unraid forum for SQLite corruption for any application. So, it depends. And not always on the application, it also the underlying setup. Why I did not have an SQLite corruption for the last 4 years? And remained thousands SNOs?
then you probably should place databases back to the dataset, if itās more reliable, than having them on SSD.
Each process is running in a separate VM, using qemu.
Partitions are being passed trough, so no caching (isnāt also supported anyway) and the file system is all being managed by the guests.
Situation is the same for Home Assistant and for the storage-nodes: file systems are all in RAID1 (so two partitions on two different drives are being passed through for the root file system), formatted as BTRFS in order to be able to have snapshots and scrubbing
Two nodes are residing on exactly the same drives as is Home Assistant, the other four nodes are on different systems but have same configuration.
Root file systems are all on internal drives.
Most data drives for STORJ are USB drives.
Same stance here for personal data, therefore RAID + back-up of essential data.
How do you mean, are you pointing to the fact itās not used on a freezed (but running) file system or something? Just out of curiosity.
As far as Iām concerned, the Syncthing use case is that all our phones are being synced real-time so photos arenāt being lost accidentally. And since weāre using multiple systems, our documents are the same at each system (of which weāre using one at the time anyways). The Syncthing-system is being snapshotted at regular intervals till 3months back. Besides, one time a day (at night) the whole system is being synced with NASses of two family members. And once in one or two months, I sync an external drive with all the data. The biggest concern in this, is that Syncthing doesnāt validate the data as well, which it easily could do but has been turned down as something the file system should be doing.
Home Assistant is a platform, you can use to connect smart home devices from different brands that usually donāt work togheter. The whole point is that every update of all systems is being stored which in my case is about 300-500/min, aside from logs that are being written to it. Many complaints about wear out of micro-SDs and SSD arose the last years. It used to write 40GiB/day per drive before I implemented some measurements (such like reducing logs and more selective event logging), since itās 10GiB/day. Almost fully attributable to database writes.
Except from commit-time, no other caching is enabled since these are passthrough block devices.
First one was running on ext4, probably choke on too few IOPS because it was an external hard drive some years old. A reboot often took >3min. After a reboot, almost al databases turned out to be corrupted. Since I also lost some data on it and also saw many database errors on the forum, I decided to move the database to the internal SSD-drive (BTRFS in RAID). Besides, I had some other crappy drives lingering around and decided to combine them to one storage node using mergerfs with func.create=pfrd to distribute the IOPS. Rendering this storagenode as a ātestingā node, since I consider this one as most unreliable (multiple old drives, accumulating the chance of failure). If a modification doesnāt knock out this node, it wonāt probably knock the other nodes. Although funny is that this node, is getting about the most ingress of them all now.
Second one, was already started with DB on the data disk (external SSD this time), which I formatted XFS. It was running fine till one day after a reboot three databases turned out to be corrupted. All running fine since I moved the database to the internal drive.
Third one, was already on the internal drive. But some day a process on the host caused an OOM-situation, in wich the node process was killed. After that the databases turned out to be corrupted (my other post, in which I overlooked the database error).
For sure, but as @arrogantrabbit already correctly postulates some single failures will take ages to detect.
Great, thatās something I could underline.
I really doubt whether you see them all. For example, I had already three occurences and only reported one on the forum. People not caring their stats, just throw away the databases and will have a full running and functional node afterwards in most situations.
donāt cross kernel boundaries (I.e. donāt host database on the host when node runs in a VM)
Iām do you want to backup these databases ā you need to export data from it and backup that data. But I donāt see personally a point in saving them. They are purely cosmetic. I would love an option to avoid generating them. I donāt really care about nice graphs in the dashboard.