Ordersdb error: database is locked

What do you propose we do to solve this? If we skipped the lock, nodes could potentially not get paid for a few transfers. Please propose an alternative solution on the github issue.

Thanks for your reply @heunland !

I proposed two solutions on Github, first is easy to implement and save time for DEV team, second is good for SNO that have a plan for long term cooperation (when used space will be more than 10+TB) but waste a lot of DEV team time.

Thanks for pay attention to this issue!

2 Likes

Thanks for helping us find potential issues. We are already following up on GitHub. Despite how it might appear sometimes to users here on the forum, our team is paying attention to all important matters that are brought up here, and issues in the backlog will be added to sprints according to their urgency/impact on the system functioning properly. We appreciate all user input and thank you for your patience.

3 Likes

This is the same kinda usage if you pull a script to see how much your node has made, and also when you read a long list of logs though ssh.

damn I have the same issue after 27.1
node was updated via storagenode properly and works about 20 hours and than it just happened for no reason

+1 on “ordersdb error: database is locked” :smiley:

This 0.27.1 is a “killer” release! :smiley:

I havent had any problems. Running 3 nodes and 1 on windows none of which had any kinda issues.

Well, I have 500MB log of errors after the upgrade couple of days ago… I had to redeploy the container today :frowning:

I’ve had no problems since the update:

$docker ps

...IMAGE                      ...  CREATED             STATUS           
...storjlabs/storagenode:beta ...  43 hours ago        Up 43 hours  

My largest sqlite database is orders.db and that’s just under 200MB.

How much data is on your node?

About 3.5TB right now

Total used spaced in partition:

  • 1.3 TB as reported by Dashboard
  • 1.3 TB as reported by OS

I have the same error (failed to add order {“error”: “ordersdb error: database is locked”).

Node run on a raspberry pi 4 4GB from july, 1.4 TB used of 3 TB

On low system load I see only some upload failed (“error”: “piecestore protocol: context canceled”) but when the load increase I have a lot of database locked error

The database access seems to be consuming a lot of CPU resources. While my node is running, the largest sqlite file has seven concurrent file descriptors:

# lsof orders.db
COMMAND    PID USER   FD   TYPE DEVICE  SIZE/OFF      NODE NAME
storageno 7230 root   27ur  REG    8,1 209620992 213939424 orders.db
storageno 7230 root   29ur  REG    8,1 209620992 213939424 orders.db
storageno 7230 root   30ur  REG    8,1 209620992 213939424 orders.db
storageno 7230 root   89ur  REG    8,1 209620992 213939424 orders.db
storageno 7230 root   90ur  REG    8,1 209620992 213939424 orders.db
storageno 7230 root  100ur  REG    8,1 209620992 213939424 orders.db
storageno 7230 root  108ur  REG    8,1 209620992 213939424 orders.db

On my node with 1.3 TB, that’s a 200 MB file opened 7 times for the same process. This behavior is going to have significant performance issues on resource limited machines when the node’s storage space begins to grow past 1 or 2 TB or so…

EDIT: It looks like orders.db could be split into two separate files, since the tables don’t seem to overlap:

$ sqlite3 orders.db ".fullschema"
CREATE TABLE unsent_order (
    					satellite_id  BLOB NOT NULL,
	    				serial_number BLOB NOT NULL,

    					order_limit_serialized BLOB      NOT NULL, -- serialized pb.OrderLimit
	    				order_serialized       BLOB      NOT NULL, -- serialized pb.Order
	    				order_limit_expiration TIMESTAMP NOT NULL, -- when is the deadline for sending it

    					uplink_cert_id INTEGER NOT NULL,

    					FOREIGN KEY(uplink_cert_id) REFERENCES certificate(cert_id)
	    			);
CREATE TABLE order_archive_ (
    					satellite_id  BLOB NOT NULL,
	    				serial_number BLOB NOT NULL,

    					order_limit_serialized BLOB NOT NULL,
	    				order_serialized       BLOB NOT NULL,

    					uplink_cert_id INTEGER NOT NULL,

    					status      INTEGER   NOT NULL,
	    				archived_at TIMESTAMP NOT NULL,

    					FOREIGN KEY(uplink_cert_id) REFERENCES certificate(cert_id)
	    			);
CREATE UNIQUE INDEX idx_orders ON unsent_order(satellite_id, serial_number);
/* No STAT tables available */

1 Like

I have 6 nodes. Only 2 nodes experience this problem. In which the volume is more than 3tb.
On small nodes, such problems have not yet been observed.

I’m not quite sure how to read your post…

  • 6 nodes total.
  • 2 large nodes with >3TB of stored data have problems
  • nodes with less data stored have no problems

is that correct?

What are the specs and OS of the 6 nodes?

3 IP * 100mb / s

1 server
OS Windows server 2016
Hyper-v
3 VM Debian10
2 ADAPTEC disk controllers
2 RAID 5 arrays
1 INTEL SSDPEDME016T4S (SSD)
16 HDD * 16TB
2 CPU E5-2650V3
128 gb DDR4
node sizes:
audit uptime used
node04 100 99.99 65.8 GiB
node02 100 99.01 3.41 TiB (database is locked)
node05 100 99.99 64.9 GiB
node06 100 100 36.2 GiB
node03 100 97.85 285 GiB
server is loaded at 1-2%

2 server
OS Debian10 Intel ® Core ™ i7-3820 CPU @ 3.60GHz
2 SSD RAID0
1 HDD 16TB
16 gb DDR3
audit uptime used
node01 100 99.93 2.18 TiB (database is locked)
server is loaded at 1-2%

1 Like

Colleagues, I updated GitHub issue with my investigation. If you decide to repeat my steps, please backups your databases with stopped storage node before doing anything.

1 Like

Good investigative work!

On GNU/Linux systems, it’s very simple to add a cron job that runs the sqlite3 vacuum procedure once a week or so… However, as the databases get larger, the vacuum process will take longer. How long did the vacuum process take?

EDIT:

I ran the sqlite vacuum procedure on a copy of my current orders.db

Here are the results:

Size before    215003136
Size after     195088384

Time(real)     0m6.351s

Size percent difference

(215003136-195088384)/((215003136+195088384)/2)
-----------------------------------------------
                9.7 %

CPU (x2): Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz

1 Like

Thanks, you are already doing a test vacuum (I did the same) for determinate a time for this procedure. The main goal is defragmentation file on filesystem and inside file.