My storage node numerous issues

Hi,

Today I received my monthly node payout and it was really really low.
My fault, I haven’t check the health status of my node in the last several months.
I have just checked it was getting updates and that it has stayed online.

No big problem, but I am here to ask your help and fix any possible issue with my node.

I searched the forum for the issues I will report here and I haven’t fully understand root causes and solution, so please be patient with me and if the problem is already reported and solved just point me to the right direction.

List of Issues I see…

Logged ERRORs:

# tail -n 1000 node.log | grep ERROR | sed 's/^.*ERROR\t\(.*\)\t{.*$/\1/g' | sort | uniq -c
     10 piecestore	download failed
      4 piecestore	failed to add bandwidth usage
      2 piecestore	failed to add order

These errors are logged continously:
1 - “download failed”
I can see 2 main type messages related to this problem:

  • “write tcp X.X.X.X:28967->Y.Y.Y.Y:P: use of closed network connection”
  • “tls: use of closed connection”
    Is it a common issue ? Is it a known problem with a known fix ?

2 - “failed to add bandwidth usage”

This is the full reported message:

2020-04-11T17:45:37.304Z	ERROR	piecestore	failed to add bandwidth usage	{"error": "bandwidthdb error: database is locked", "errorVerbose": "bandwidthdb error: database is locked\n\tstorj.io/storj/storagenode/storagenodedb.(*bandwidthDB).Add:59\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).saveOrder:721\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).doUpload:443\n\tstorj.io/storj/storagenode/piecestore.(*drpcEndpoint).Upload:215\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func1:987\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:107\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:105\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:56\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:93\n\tstorj.io/drpc/drpcctx.(*Tracker).track:51"}

How bad is it ? Does it impact the service anyway ? Does it impact the bandwidth payout ? I read that it is probably related to the disk performances and the locking mechanism of sqlite … could you suggest any fix or mitigation ?

3 - “failed to add order”
Very similar to the previous one:

2020-04-11T17:45:54.392Z	ERROR	piecestore	failed to add order	{"error": "ordersdb error: database is locked", "errorVerbose": "ordersdb error: database is locked\n\tstorj.io/storj/storagenode/storagenodedb.(*ordersDB).Enqueue:53\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).saveOrder:714\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).doUpload:443\n\tstorj.io/storj/storagenode/piecestore.(*drpcEndpoint).Upload:215\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func1:987\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:107\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:105\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:56\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:93\n\tstorj.io/drpc/drpcctx.(*Tracker).track:51"}

Same questions here, What its impact ? Do you know a fix or resolution ?

Dashboard reported Usage:
Yesterday my dashboard reported 51.44 TB*h of disk space used. How do I compute it to actual capacity in use ? It sounds to me that a rough computation could be 51.44 / 60 = 0.86 TB.
This sound also similar as order of magnitude to “780GBm” reported as “Disk Average Month” in the new payout information page.

The “/app/dashboard.sh” reports a different usage:

                      Available         Used
          Disk        1.2 TB       1.8 TB

My system reports other different numbers (I published STORAGE=“3TB” for my storage node):

Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb1       3.7T  2.9T  817G  78% /mnt/usb_disk

I really hope these issues are fixable some way or I could prefer to re-init my node rather than waste another month with just 123 storj as payout (~$ 11).

Thanks In Advance,
d4lamar

I don’t see any real issues with your node, You only get paid for egress traffic and amount that is stored on your hard drive, So you dont get paid when data is being uploaded to your node just stored.
Also how old is your node you need to remember that starting a new node that most of your payments are withheld when your about 7 months into running your node you will see much higher payments.

This is when you start and stop your node or you remove your container and you start a new one this fixes itself after its running.
Also have you been making sure your node is up to date?

I don’t see any real issues with your node

Good, do they all look as known common bug with no fix ?

You only get paid for egress traffic and amount that is stored on your hard drive, So you dont get paid when data is being uploaded to your node just stored.

Ok but I am not sure I get payed for the correct amount of data, and all these errors I reported make the accounting quite complex for me, so I would like to fix them.

This is when you start and stop your node or you remove your container and you start a new one this fixes itself after its running.

It cannot be. it is continuously logged not just after a node restart. It does not recover. How long “after” has to be considered healthy ? 1hour, 1 day, 6 months ?

Also have you been making sure your node is up to date?

v1.1.1 it should be the last one.

Cheers,
d4lamar

How is your storage connected to your PC?

How is your storage connected to your PC?

USB. it could be slow, I know.
# lsusb -s 001:004 -vvv 2>/dev/null | grep bcdUSB
bcdUSB 2.10

Then there is nothing you can do. The database is locked still hits some setups because of speed of processing.

Then there is nothing you can do. The database is locked still hits some setups because of speed of processing.

So USB storage is not supported or recommended for storj storagenodes.
Which are current recommendations ?

Cheers,
d4lamar

The recommend is always directly connected to the system, but if your running a system that you can only use usb drives then nothing you can do.

The recommend is always directly connected to the system, but if your running a system that you can only use usb drives then nothing you can do.

Just a bit of recap here:

ERRORS

  • “failed to add order”
  • “failed to add bandwidth usage”

can be solved by a hardware upgrade going from USB2 to something a bit faster.

What about the other 2 problems I reported ?

  • “download failed”
  • dashboard reported disk space usage

Cheers,
d4lamar

I think the most pressing issue is the “disk space” one.

Tonite I got another kind of error reporting “out of space”:

2020-04-12T00:13:56.839Z	ERROR	piecestore	upload failed	{"Piece ID": "VBN34TEMQKGVCEQLBDC3VQQZF2FPNGJIKFPY4IJUGJRGXHOTTCWA", "Satellite ID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW", "Action": "PUT_REPAIR", "error": "out of space", "errorVerbose": "out of space\n\tstorj.io/common/rpc/rpcstatus.Error:95\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).doUpload:394\n\tstorj.io/storj/storagenode/piecestore.(*drpcEndpoint).Upload:215\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func1:987\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:107\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:105\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:56\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:93\n\tstorj.io/drpc/drpcctx.(*Tracker).track:51"}

This morning the dashboard reports “5.8 GB” as Disk Space Remaining (yesterday was 1.25 TB).

On the system the status does not seem changed at all:

Filesystem                    Size  Used Avail Use% Mounted on
/dev/sdb1                     3.7T  2.9T  815G  79% /mnt/usb_disk

I also suspect that this warning @start could be quite relevant:

2020-04-12T07:49:32.466Z	WARN	piecestore:monitor	Used more space than allocated. Allocating space	{"bytes": 3000000000000}

Cheers,
d4lamar

…In your allocation. How much you allocated?
The garbage collector moves deleted pieces to the trash folder and will delete them after 7 days. This is happened with delete operations which was submitted by customer, but your node either were offline or not so fast to delete pieces in time of request.
These pieces should be included in occupied space on the dashboard and in the space monitor.

…In your allocation. How much you allocated?

This is my question too… I do not know :slightly_smiling_face:
If you read my previous statements I do not know which values I should trust.

The garbage collector moves deleted pieces to the trash folder and will delete them after 7 days.

Is it safe to delete content of trash folder manually ?

Nobody apart from you can know the answer. You allocate the space in your docker run command. Try ps aux|grep storage.allocated-disk-space

You should not delete anything ever manually, unless told by Storj support. The trash folder is there for a reason and will empty after 7 days

Nobody apart from you can know the answer. You allocate the space in your docker run command. Try ps aux|grep storage.allocated-disk-space

Ok, understood. I allocated 3 TB.

# df -Pm /dev/sda1
Filesystem     1048576-blocks    Used Available Capacity Mounted on
/dev/sda1             3800108 2963950    835631      79% /mnt/usb_disk

# echo "3000000 - 2963950" | bc
36050

So roughly I could expect 36G.

You should not delete anything ever manually, unless told by Storj support. The trash folder is there for a reason and will empty after 7 days

Thanks.

Just for info:

# du -sh trash
101G    trash

Looks like a consequence of “database is locked”. The storagenode is unable to update the database with actual number because of it and shows not precise numbers on the dashboard.
A few graceful restarts should resolve it.

docker restart -t 300 storagenode

I would like to suggest to update your docker run command with a latest changes in the parameters: https://documentation.storj.io/setup/cli/storage-node#running-the-storage-node

I would like to suggest to update your docker run command with a latest changes in the parameters: Storage Node - Storj Docs

Thank you, I just did it.

I think that now, beside having the node full, the situation is quite aligned with reality.
Yesterday before restarting my node I found few errors with a filesystem check.
Probably the correction of those filesystem errors caused the quick reduction of available disk.

30 GB of difference between the actual free space and the reported one, are not a big issue, I suppose.
I will avoid to touch the node for the next days and see how the situation develops.
In the meanwhile I will think how to upgrade my node hardware.

Cheers,
d4lamar