I'm done... 4th time the database has randomly corrupted FTS. I'm out!

kajar9 · October 28, 2019, 7:10pm

My “bandwidth” database is corrupt… yay… file is not a database. GREAT
Utter bullcrap.

I’m out!

Edit: This was written in anger when my setup failed out of the blue. Mostly because of the responses here and people like beast and littleskunk I still seem to have a few tries in me to try again.

What was pointed out to change on my next try is this:

Sorry for the initial charged post, but in the end I’m happy to have posted it as people like these here truly inspire to keep me at it.

nerdatwork · October 28, 2019, 7:15pm

Quitting is easy. To find the problem & applying the solution takes time. Are you using the same disk and operating system from last 3 attempts ?

kajar9 · October 28, 2019, 7:28pm

Like I said, it’s the 4th time this has happened. I did all the steps, I spent a long time with support trying to figure stuff out… all times… oh well, nothing more to do than request new auth token and start over.

It’s about some definition of insanity of doing something over and over while expecting different results.

I am aware of those sqlite3 fix, dump, recreate stuff. None have worked.

littleskunk · October 28, 2019, 7:30pm

All of my databases have survived 1 power outtage and at least 5 hard resets (out of memory and not responding to restart attempts). I never had a single database corruption even with these bad circumstances. I am not sure how you managed to corrupt your database but it can’t be the storage node itself because than it would have hit me as well.

peem · October 28, 2019, 7:33pm

classic admin answer “works for me”

littleskunk · October 28, 2019, 8:08pm

Did you had 1 power outage and 5 hard resets? I am asking a serious questions. My storage node should be the one that gets corrupted but it is not!

kevink · October 28, 2019, 9:33pm

What did you do during the other 3 db corruptions? Wait until it happens again or did you try something to fix the problem?

I also had some hard resets, HDD disconnects (because I use 2 external HDDs), killing the nodes, etc. Still working fine.
Doesn’t mean it can’t happen, but 4 times? There got to be more to it than just “randomness”.

Alexey · October 28, 2019, 10:10pm

@kajar9 , please, tell us, how is your HDD connected to the host with storagenode?
What is your host OS?
What is the filesystem on your HDD?

JoshGarza · October 28, 2019, 10:10pm

“If an application crash, or an operating-system crash, or even a power failure occurs in the middle of a transaction, the partially written transaction should be automatically rolled back the next time the database file is accessed. The recovery process is fully automatic and does not require any action on the part of the user or the application.”
Extracted from https://www.sqlite.org/howtocorrupt.html

Power outages and hard reset are not the worst things.
More likely memory, hard disk sync issues related to the hardware of the storagenode.

kajar9 · October 28, 2019, 10:29pm

Regular SATA3 HDD, few months old, fully checked for bad blocks, single disk, single node.
Windows 10 Enterprise, NTFS

kajar9 · October 28, 2019, 10:32pm

It started completely randomly at ~140+ h uptime. No discernible change.
Then it gradually got worse, then it stopped working at all. Periodic errors, but only from 18UWP

Most uploads / downloads completed correctly.

Although I use my computer on the regular this HDD is only used by STORJ. I have 32GB of RAM, mostly using only about 10-15GB

kajar9 · October 28, 2019, 10:35pm

I have had 1 crash that happened more than 3 weeks ago. Otherwise fine. UPS protected.

In my opinion it should be the most ideal conditions, yet this has happened over and over.

kevink · October 29, 2019, 6:25am

Periodic errors, but only from 18UWP

What does that mean? Errors about harddrive or database are not satellite dependant… What kind of errors?

JoshGarza · October 29, 2019, 6:25am

Storage Node Recovery mode is advisable.

kajar9 · October 29, 2019, 7:31am

I might have missed them on others, but it looked like this sattellites messages caused the most issues.
That tried to access the bandwidth.db file.

kajar9 · October 29, 2019, 7:32am

What commands do I need to use to enter that recovery mode since it certainly did not automatically do that for me.

kevink · October 29, 2019, 7:55am

Do you have error messages you can present? Otherwise it’s just an empty claim nobody will be able to help you with.
What made you think that you have periodic errors?

it looked like this sattellites messages caused the most issues.
That tried to access the bandwidth.db file.

Every incoming and outgoing message of your node accesses bandwith.db file so if the satellite 18UWP was the most active, then of course this one caused the most issues.

JoshGarza · October 29, 2019, 3:16pm

there is no recovery mode so far. I suggested it would be a nice feature to avoid losing Storage Node data because of a db corruption

anon27637763 · October 29, 2019, 5:06pm

This may be the underlying problem:

SQLite Atomic Commit Documentation

For this reason, SQLite does a “flush” or “fsync” operation at key points. SQLite assumes that the flush or fsync will not return until all pending write operations for the file that is being flushed have completed. We are told that the flush and fsync primitives are broken on some versions of Windows and Linux. This is unfortunate. It opens SQLite up to the possibility of database corruption following a power loss in the middle of a commit. However, there is nothing that SQLite can do to test for or remedy the situation. SQLite assumes that the operating system that it is running on works as advertised. If that is not quite the case, well then hopefully you will not lose power too often.

A solution may be to re-write the SNO software to be database agnostic and make default suggestion that the node operator utilize a more robust database such as postgresql.

I made this suggestion a few weeks ago and have been looking at the problem myself on and off since then. There are various libraries that could be used to connect and utilize postgresql databases rather than the current hardwired sqlite.

It was brought up that use of postgresql would require a network connection. However, this is not the case on GNU/Linux hosts. A connection can be accomplished using Unix Domain sockets. Here’s a sample implementation in Go. Unix Domain sockets do not utilize networking. The connection is made through the filesystem and so would not have any networking overhead. This is the default connection method for postgresql local users on GNU/Linux systems and it is very fast.

I will continue to look at how to put it all together for a more robust DB connection… but, I’m just another SNO and have other responsibilities in the real world.

foximulder · October 29, 2019, 5:14pm

I have also problems with one drive, try crystalDiskinfo its free an you can read all params from SMART perhaps its the drive and not the db.