Node not retaining data

Eleos · April 29, 2024, 5:48pm

So a few days ago I noticed this starting to happen but didn’t think much of it. Today I am realizing that the following is happening:

Data is not being retained on the node
Node shows it being super active (150GB+ ingress per day, disk sounds really active constantly like data is being written)
Data goes from let’s say 5.97TB to 6.10TB, randomly goes back down to 5.97TB but with no additional trash (always at 0.79TB trash)
Each time this happens, the node was restarted (uptime back to just a few minutes)
Suspension and Audit does not get affected (all 99% something)

Is there something wrong with my node or drive?

Please let me know if any of this info is sensitive but here are Errors I see in the log when doing: docker logs storagenode | grep "ERROR"

2024-04-29T18:13:38Z    ERROR   piecestore      upload failed   {"Process": "storagenode", "Piece ID": "SDFOKDP4QXV342QFGKMMO5HJQOWSIXL6APHYXALNIKGDXPLZW4HA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "PUT", "Remote Address": "216.66.40.82:8544", "Size": 62464, "error": "pieceexpirationdb: database is locked", "errorVerbose": "pieceexpirationdb: database is locked\n\tstorj.io/storj/storagenode/storagenodedb.(*pieceExpirationDB).SetExpiration:67\n\tstorj.io/storj/storagenode/pieces.(*Store).SetExpiration:608\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Upload.func6:487\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Upload:520\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func1:294\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:61\n\tstorj.io/common/experiment.(*Handler).HandleRPC:42\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:167\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:109\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:157\n\tstorj.io/drpc/drpcctx.(*Tracker).track:35"}

2024-04-29T18:13:22Z    ERROR   piecestore      upload failed   {"Process": "storagenode", "Piece ID": "CLGDBINULTNR3P5N2O6FNZHLV4LTS34VUUTOQY24F6HLANLLWSHA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT", "Remote Address": "79.127.226.101:55460", "Size": 131072, "error": "manager closed: unexpected EOF", "errorVerbose": "manager closed: unexpected EOF\n\tgithub.com/jtolio/noiseconn.(*Conn).readMsg:225\n\tgithub.com/jtolio/noiseconn.(*Conn).Read:171\n\tstorj.io/drpc/drpcwire.(*Reader).read:68\n\tstorj.io/drpc/drpcwire.(*Reader).ReadPacketUsing:113\n\tstorj.io/drpc/drpcmanager.(*Manager).manageReader:229"}

I am using a rpi 4 8GB, an exernal SSD via USB 3 for the OS + identity, an external USB 3 HDD for the node, Ethernet, Rasberry Pi OS 64 Lite (CLI headless), docker, No-IP. Screenshots are provided below.

bre · April 29, 2024, 7:14pm

Hi @Eleos ,

Any customer usage pattern is normal. We can not predict future patterns based on prior usage patterns.

However it’s been suggested you look into database lockup problems

Eleos · April 29, 2024, 7:26pm

I know customer usage can change, I first thought it was just everyone backing up their data at the end of the month but it seems I have some issues noticed around the same time that this increase in usage started.

How exactly do I do that? I have no knowledge of database files, what they should look like, reading errors, ect.

bre · April 29, 2024, 7:41pm

These links are for the most common database issues encountered by SNOs
If none of these are applicable please let us know and we can look further

https://support.storj.io/hc/en-us/sections/360004515252-Databases-Issues

heunland · April 29, 2024, 7:45pm

this seems to have helped solve a similar issue in the past for another user Database locked,node crash - #4 by zip

Eleos · April 29, 2024, 7:53pm

Thank you. I am currently looking into this so I will come back after I have more info on if it solved my issues.

Alexey · April 30, 2024, 4:27am

Seems it’s time to move databases either to SSD or a flash card/stick:

Please also check your logs regarding errors related to a retain process. Also, is lazy mode is enabled? Did you disable a used-space-filewalker on start or not?

Eleos · April 30, 2024, 4:44am

I was looking into that earlier and moved them over to the external USB 3 SSD I have the OS on which has a 300TBW rating. Things seem to be running well now so far for 8 hours and I will continue to monitor it for the new few days. I am guessing that the HDD could not keep up with the databases and 160GB per day rate but could at 80GB a day rate.

Running docker logs storagenode | grep "ERROR" still shows a lot of the following error

2024-04-29T21:27:40Z    ERROR   piecestore      upload failed   {"Process": "storagenode", "Piece ID": "SZLQBO6XHDTMTIJVJ7KGN6BP3Z2WR2HLDBGK3CX6OEXEM4345BEA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT", "Remote Address": "79.127.226.98:52044", "Size": 196608, "error": "manager closed: unexpected EOF", "errorVerbose": "manager closed: unexpected EOF\n\tgithub.com/jtolio/noiseconn.(*Conn).readMsg:225\n\tgithub.com/jtolio/noiseconn.(*Conn).Read:171\n\tstorj.io/drpc/drpcwire.(*Reader).read:68\n\tstorj.io/drpc/drpcwire.(*Reader).ReadPacketUsing:113\n\tstorj.io/drpc/drpcmanager.(*Manager).manageReader:229"}

I am not sure about the lazy file walker but in the config file this is what I have for it. I am guessing it is disabled since it’s commented out:

# run garbage collection and used-space calculation filewalkers as a separate subprocess with lower IO priority
# pieces.enable-lazy-filewalker: true

Alexey · April 30, 2024, 4:55am

Yes, your node is slow for this customer, so it canceled it.

Since it’s commented-out, then this is mean - it’s unchanged. The default value is true, so lazy mode is enabled.

Eleos · April 30, 2024, 5:19am

For this error, is there anything that I can do to make it faster for this customer so that the node does not cancel accepting some of the data?

Like I mentioned above, I am using a rpi 4 8GB, an exernal SSD via USB 3 for the OS + identity (and database now), an external USB 3 HDD for the node, Ethernet, Rasberry Pi OS 64 Lite (CLI headless), docker, No-IP

Alexey · May 5, 2024, 7:57am

I think no. You cannot win all the races and be close to the everyone customer in the world.