I'm done... 4th time the database has randomly corrupted FTS. I'm out!

Which will have exactly the same problem with the operating system.

I would recommend to fix the problem on the operating system: How to Enable or Disable Disk Write Caching in Windows 10 - Make Tech Easier

Which will have exactly the same problem with the operating system.

There are options to test OS specific fsync…

pg_test_fsync is intended to give you a reasonable idea of what the fastest wal_sync_method is on your specific system, as well as supplying diagnostic information in the event of an identified I/O problem.

And tailor fsync parameters accordingly…

wal_sync_method ( enum )

Method used for forcing WAL updates out to disk.

sqlite is a great choice for plenty of projects. However, a robust fault tolerant data storage system should probably have a system specific database which can be optimized for the host OS and its various quirks.

However, there is a basic problem with the SNO software and possible database corruption during unexpected power failures of the host OS. The sqlite documentation refers to the precise problem.

Which will find out that the operating system is very fast on writing something to disk. I don’t see how that will deal with the problem. On the next poweroutage the operating system will show the same behavior and drop data.

Again I recomment fixing the issue on the operating system. That will allow sqlite3 to work just fine.

I haven’t looked at the SNO installation documentation recently… So, I’m unsure if this recommended method of attempting to make the implementation more fault tolerant is included in those instructions.

It’s also a bit unclear that any OS specific testing has been performed on the expected database corruption problem during power failures.

It is written directly in the postgres documentation: Corruption - PostgreSQL wiki

Hard disk drives with write-back cache enabled, and an unexpected power loss

That is exactly the same reason as with sqlite3. Switching to postgres will not fix it. Disabling the cache will fix it.

2 Likes

It’s not that db corruption is not possible with postgresql… it’s rather that db corruption can be less likely using a tailored postgres configuration which is not possible using sqlite.

However, arguing over db choice is hardly ever productive. Of course, Storj has decided on sqlite as the db, and any change is unlikely in the future due to sunk costs associated with development.

But… it would be a nice idea to ensure that all SNOs understand, by reading the documentation upfront on the installation pages, that database corruption is possible on various OSes and hardware combinations… and that a possible workaround is fiddling with the disk write caching parameters.

Like in our last conversation I recommend you open a pull request for postgres support. Feel free to proof your statements. Right now I have the information from the postgres wiki and they are telling me that under the conditions sqlite3 is getting corrupted right now, postgres will get corrupted as well.

And I’m pointing out that this very thread is proof that SNOs do not understand what this means, nor how to fix it.

The installation documentation should be written such that an SNO clearly understands that database corruption is possible due to how sqlite writes to the drive in relation to how the underlying host OS writes to the drive… and that if seemingly random database corruption seems to be occurring that it’s possible that the SNOs underlying host OS and hardware can be reconfigured in an attempt to workaround this problem.

I have the capability to write it myself… but my time is rather limited.

My discussion here is intended to try to get a reasonable technical response rather than a “it’s user error” dead-end threads. In this case, it’s very likely not “user-error” … the problem is due to a very low level OS/hardware issue which is not easily discoverable for the general user and also not documented well in the SNO installation guide as a potential problem.

1 Like

I did full checkdisk /r and no issues there, SMART is all green

Ok, I had that option checked, now it’s off. Despite my pessimistic and venting post, I might still have a few retry-s in me left. Maybe this should be mentioned in setup guide as ’ best practice ’ to uncheck that caching option.

Now just to wait for a new registration code.

BTW beast and littleskunk, mad respects for your insight.

As a storj employee I will leave that decision to @Alexey @Dylan @Knowledge

As a windows user myself I have to say that we are talking about basics. I am disabling my hard drive cache since windows XP because it hits me once and I learned from my mistake. I did a lot of other mistakes doing that time. That is how it works. I would get disqualified one way or the other anyway. So what is the point of adding all possible mistakes into our documentation? At the end the documentation will be confusing for most of the users because a small group of people doesn’t know the basics? For that reason I vote with no. Let them get disqualified and learn from there mistakes like we all did and still do.

Fun fact: The biggest risk for me is port forwarding. Sometimes I reconfigurate virtual network adapter and always forget to double check the port forwarding in my router. It might point to the old virtual network adapter that doesn’t exist anymore. That is my personal problem and I don’t expect the documentation to tell me how to deal with it.

I think disabling write cache should go into the installation documentation for windows.
Judging from many threads in here most people are not familiar with anything besides knowing how to use a PC and I wouldn’t rely on people knowing the “basics” (which is a term differently for everyone).

2 Likes

Database corruption is a common enough problem:

Storj Forum Search Results

…that it should be addressed in the base documentation. The Storj network benefits from having a continuously growing value for available network drive space… even if that drive space is unreliable. So, addressing a few common problems within the documentation would likely result in many of the less technical SNOs remaining onboard rather than writing about quitting the network on the forums.

Hopefully the OP’s problem is now solved for the near future… but rather than the “sink-or-swim” approach, it might be better to provide a thorough “Best Practices and Common Troubleshooting” section in the documentation.

3 Likes

Especially from our windows users you shouldn’t expect them to know about anything.
I use writecache on windows myself without any problems and didn’t immediately think about it as a possible error source (especially as OP didn’t mention windows as OS). No typical windows user will know about those “basics” or think about them even if they know about it.

1 Like

Port forwarding is also a common problem. That doesn’t necessary mean we have to explain it to everyone including to thausends of storage nodes operators that didn’t have that problem.

Well it’s your company but the typical company explains everything so even the biggest idiots know what to do best because the company profits from that and avoids frustration and support time.
If STORJ likes to handle it differently that is certainly your choice.

Do you understand the downsides? Worst case we end up with a guide that will let users like me freak out and stop because it contains way to much useless informations. I don’t want storj or any other company to have a copy of my router documentation. To mitigate the risk with my virtual network adapter that documentation would have to explain much more than the original router documentation does.
Loosing a few “big idiots” (keeping your wording here) doesn’t affect our profit. So the question here is how many operators would benefit and how many operators would get confused.

1 Like

Yes it certainly is your choice on which group of participants you put your focus on. By providing a windows installer you certainly opened the door for more user with few technical knowledge. Although even before that there were enough users that didn’t even manage to install docker on linux but still wanted to run storagenode badly (because profit i guess…).

however explaining how to disable write cache on windows will hardly make the documentation as big as any chapter of your routers documentation.

May I remind those of you that insist we should put all the possible questions people may have in the basic installation documentation, that we do have a Knowledge Base which you can search for answers to those questions. It can be found at https://support.storj.io
If there are specific questions that are frequently asked in support tickets, we will add them to our knowledge base articles. So it is not like the info is not available at all, you just need to remember to look for it in the right places (of course, right here on the forum you can also find answers to many of these FAQs)

We may be able to put some tips in the guide and link to the FAQ. I don’t think we want the initial experience to be one where the user thinks they setup everything properly and found that something Windows has on by default is going to ruin their day. And we can’t expect them to read through every FAQ entry to know what to expect either. But Skunk’s point is also taken that we don’t need to have a single guide be filled with a lot of tips, because it would get unwieldy. We’ll try and strike a balance that works for everyone.

6 Likes