Node crash keep crashing after 48H+

Hi have problem my node keep crashing I set it too automatic reboot in the docker but it lost part off the data it was getting.
What files do you need too see have a opinion on where the fail can be
System
OS Unraid + Docker
Parity TOSHIBA_MG09ACA18TE
DISK 1 WDC_WD120EMFZ
DISK 2 WDC_WD80EMAZ
MB ASUS ROG STRIX B450-F
CPU 5700G
RAM 16 GiB DDR4
have this exstre paremeteres too my docker
–mount

type=bind,source=“/mnt/user/appdata/storj/identity/storagenode/”,destination=/app/identity --mount type=bind,source=“/mnt/user/storj/”,destination=/app/config --mount type=bind,source=“/mnt/user/storj-database/”,destination=/app/dbs --restart unless-stopped

pehaps im wrong but from this I get my config file should be in /mnt/user/storj/ but that file there are there don’t show the right info my e-mail and Allocated Storage is set in the docker container but it don’t update the config are that a problem ?

my config is this think it just standard unedited as I remember ?

# how frequently bandwidth usage rollups are calculated
# bandwidth.interval: 1h0m0s

# how frequently expired pieces are collected
# collector.interval: 1h0m0s

# use color in user interface
# color: false

# server address of the api gateway and frontend app
# console.address: 127.0.0.1:14002

# path to static resources
# console.static-dir: ""

# the public address of the node, useful for nodes behind NAT
contact.external-address: ""

# how frequently the node contact chore should run
# contact.interval: 1h0m0s

# Maximum Database Connection Lifetime, -1ns means the stdlib default
# db.conn_max_lifetime: -1ns

# Maximum Amount of Idle Database connections, -1 means the stdlib default
# db.max_idle_conns: 20

# Maximum Amount of Open Database connections, -1 means the stdlib default
# db.max_open_conns: 25

# address to listen on for debug endpoints
# debug.addr: 127.0.0.1:0

# expose control panel
# debug.control: false

# provide the name of the peer to enable continuous cpu/mem profiling for
# debug.profilername: ""

# If set, a path to write a process trace SVG to
# debug.trace-out: ""

# open config in default editor
# edit-conf: false

# how often to run the chore to check for satellites for the node to exit.
# graceful-exit.chore-interval: 15m0s

# the minimum acceptable bytes that an exiting node can transfer per second to the new node
# graceful-exit.min-bytes-per-second: 5.0 KB

# the minimum duration for downloading a piece from storage nodes before timing out
# graceful-exit.min-download-timeout: 2m0s

# number of concurrent transfers per graceful exit worker
# graceful-exit.num-concurrent-transfers: 5

# number of workers to handle satellite exits
# graceful-exit.num-workers: 4

# path to the certificate chain for this identity
identity.cert-path: identity/identity.cert

# path to the private key for this identity
identity.key-path: identity/identity.key

# if true, log function filename and line number
# log.caller: false

# if true, set logging to development mode
# log.development: false

# configures log encoding. can either be 'console' or 'json'
# log.encoding: console

# the minimum log level to log
log.level: info

# can be stdout, stderr, or a filename
# log.output: stderr

# if true, log stack traces
# log.stack: false

# address(es) to send telemetry to (comma-separated)
# metrics.addr: collectora.storj.io:9000

# application name for telemetry identification
# metrics.app: storagenode

# application suffix
# metrics.app-suffix: -release

# instance id prefix
# metrics.instance-prefix: ""

# how frequently to send up telemetry
# metrics.interval: 1m0s

# path to log for oom notices
# monkit.hw.oomlog: /var/log/kern.log

# maximum duration to wait before requesting data
# nodestats.max-sleep: 5m0s

# how often to sync reputation
# nodestats.reputation-sync: 4h0m0s

# how often to sync storage
# nodestats.storage-sync: 12h0m0s

# operator email address
operator.email: ""

# operator wallet address
operator.wallet: ""

# whether or not preflight check for database is enabled.
# preflight.database-check: true

# whether or not preflight check for local system clock is enabled on the satellite side. When disabling this feature, your storagenode may not setup correctly.
# preflight.local-time-check: true

# how many concurrent retain requests can be processed at the same time.
# retain.concurrency: 5

# allows for small differences in the satellite and storagenode clocks
# retain.max-time-skew: 72h0m0s

# allows configuration to enable, disable, or test retain requests from the satellite. Options: (disabled/enabled/debug)
# retain.status: enabled

# public address to listen on
server.address: :28967

# log all GRPC traffic to zap logger
server.debug-log-traffic: false

# if true, client leaves may contain the most recent certificate revocation for the current certificate
# server.extensions.revocation: true

# if true, client leaves must contain a valid "signed certificate extension" (NB: verified against certs in the peer ca whitelist; i.e. if true, a whitelist must be provided)
# server.extensions.whitelist-signed-leaf: false

# path to the CA cert whitelist (peer identities must be signed by one these to be verified). this will override the default peer whitelist
# server.peer-ca-whitelist-path: ""

# identity version(s) the server will be allowed to talk to
# server.peer-id-versions: latest

# private address to listen on
server.private-address: 127.0.0.1:7778

# url for revocation database (e.g. bolt://some.db OR redis://127.0.0.1:6378?db=2&password=abc123)
# server.revocation-dburl: bolt://config/revocations.db

# if true, uses peer ca whitelist checking
# server.use-peer-ca-whitelist: true

# total allocated bandwidth in bytes (deprecated)
storage.allocated-bandwidth: 0 B

# total allocated disk space in bytes
storage.allocated-disk-space: 1.0 TB

# how frequently Kademlia bucket should be refreshed with node stats
# storage.k-bucket-refresh-interval: 1h0m0s

# path to store data in
# storage.path: config/storage

# a comma-separated list of approved satellite node urls (unused)
# storage.whitelisted-satellites: ""

# how often the space used cache is synced to persistent storage
# storage2.cache-sync-interval: 1h0m0s

# how soon before expiration date should things be considered expired
# storage2.expiration-grace-period: 48h0m0s

# how many concurrent requests are allowed, before uploads are rejected. 0 represents unlimited.
# storage2.max-concurrent-requests: 0

# how frequently Kademlia bucket should be refreshed with node stats
# storage2.monitor.interval: 1h0m0s

# how much bandwidth a node at minimum has to advertise (deprecated)
# storage2.monitor.minimum-bandwidth: 0 B

# how much disk space a node at minimum has to advertise
# storage2.monitor.minimum-disk-space: 500.0 GB

# how long after OrderLimit creation date are OrderLimits no longer accepted
# storage2.order-limit-grace-period: 24h0m0s

# length of time to archive orders before deletion
# storage2.orders.archive-ttl: 168h0m0s

# duration between archive cleanups
# storage2.orders.cleanup-interval: 1h0m0s

# maximum duration to wait before trying to send orders
# storage2.orders.max-sleep: 5m0s

# timeout for dialing satellite during sending orders
# storage2.orders.sender-dial-timeout: 1m0s

# duration between sending
# storage2.orders.sender-interval: 1h0m0s

# timeout for sending
# storage2.orders.sender-timeout: 1h0m0s

# allows for small differences in the satellite and storagenode clocks
# storage2.retain-time-buffer: 48h0m0s

# how long to spend waiting for a stream operation before canceling
# storage2.stream-operation-timeout: 30m0s

# file path where trust lists should be cached
# storage2.trust.cache-path: config/trust-cache.json

# list of trust exclusions
# storage2.trust.exclusions: ""

# how often the trust pool should be refreshed
# storage2.trust.refresh-interval: 6h0m0s

# list of trust sources
# storage2.trust.sources: https://tardigrade.io/trusted-satellites

# address for jaeger agent
# tracing.agent-addr: agent.tracing.datasci.storj.io

# application name for tracing identification
# tracing.app: storagenode

# application suffix
# tracing.app-suffix: -release

# buffer size for collector batch packet size
# tracing.buffer-size: 0

# how frequently to send up telemetry
# tracing.interval: 0s

# buffer size for collector queue size
# tracing.queue-size: 0

# how frequently to send up telemetry
# tracing.sample: 0

# Interval to check the version
# version.check-interval: 15m0s

# Request timeout for version checks
# version.request-timeout: 1m0s

# server address to check its version against
# version.server-address: https://version.storj.io

# directory to store databases. if empty, uses data path
# storage2.database-dir: ""
storage2.database-dir: "dbs"

Do you have the logs as to why its crashing thats where you should have started.

Where should the log be when it crash ?

Can’t find any .log file only .db files but can out problems look at the log at the docker when it runing that just not too much use when it crash

Well if your running docker and the storagenode was running there would be a log…
docker logs storagenode

Weel as I pointed out there ARE when it runnin…

It will show logs even if its failing to start though.

You like too get the one there run where the crash have NOT happen I can give you that out problem … just don’t see it too much help

But it have no problems starting so

Can see it could be misunderstood I take that on me.
It have 0 problem starting or restarting.
The problem come after 2-3 days time it crash . The log I can see are only from after the crash so really not show me much other in the normal error if too slow too get a file you all know them.

Edit got it too write too a file so now it just wating for it too crash so will post when it happens

Mine stopped crashing after editing the config.yaml file with:

image

Thanks will try will get back or it stop

1 Like

My nodes that were crashing were actually due to CPU load, not HDD.
They are HP Microserver N36L and N40L.
They´re quite good the job but the CPU is old and very modest; easely gets 100% load.
Not sure if FreeNas on them would help; I´d have to test (they´re on Windows now).

ok it for sure not that for me

I hold >20TB worth of nodes on my HP Microserver N36L and CPU usage is at around 30-40%. Running it on Linux though.

Oh yeah, that definitely makes the difference.
For now I’ll leave as it is.
I’m planning to migrate data to a single 16TB HDD anyway.
For now it will stay with windows.

it crash early this day have a 900mb log file . have copy what are around fatal logs

2023-04-26T05:01:07.017+0200	ERROR	piecestore:cache	error getting current used space: 	{"Process": "storagenode", "error": "readdirent config/storage/blobs/qstuylguhrn2ozjv4h2c6xpxykd622gtgurhql2k7k75wqaaaaaa/7y: structure needs cleaning; readdirent config/storage/blobs/v4weeab67sbgvnbwd5z7tweqsqqun7qox2agpbxy44mqqaaaaaaa/la: structure needs cleaning", "errorVerbose": "group:\n--- readdirent config/storage/blobs/qstuylguhrn2ozjv4h2c6xpxykd622gtgurhql2k7k75wqaaaaaa/7y: structure needs cleaning\n--- readdirent config/storage/blobs/v4weeab67sbgvnbwd5z7tweqsqqun7qox2agpbxy44mqqaaaaaaa/la: structure needs cleaning"}
2023-04-26T05:01:07.017+0200	ERROR	services	unexpected shutdown of a runner	{"Process": "storagenode", "name": "piecestore:cache", "error": "readdirent config/storage/blobs/qstuylguhrn2ozjv4h2c6xpxykd622gtgurhql2k7k75wqaaaaaa/7y: structure needs cleaning; readdirent config/storage/blobs/v4weeab67sbgvnbwd5z7tweqsqqun7qox2agpbxy44mqqaaaaaaa/la: structure needs cleaning", "errorVerbose": "group:\n--- readdirent config/storage/blobs/qstuylguhrn2ozjv4h2c6xpxykd622gtgurhql2k7k75wqaaaaaa/7y: structure needs cleaning\n--- readdirent config/storage/blobs/v4weeab67sbgvnbwd5z7tweqsqqun7qox2agpbxy44mqqaaaaaaa/la: structure needs cleaning"}
2023-04-26T05:01:07.018+0200	INFO	piecestore	upload canceled	{"Process": "storagenode", "Piece ID": "77WTIWGRNEESFEB5C52KLTL3N3VUGTRO6JVWIFFNUKE4Q4W663CA", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "PUT", "Size": 65536, "Remote Address": "184.104.224.99:20256"}
2023-04-26T05:01:07.019+0200	INFO	piecestore	upload canceled	{"Process": "storagenode", "Piece ID": "LOC2J6RQXWJ7ZCIPQRNCOMDTGAXA3P4AYZXTQ6MNSKUZIBH4WLTQ", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT", "Size": 131072, "Remote Address": "72.52.83.202:52492"}
2023-04-26T05:01:07.020+0200	INFO	piecestore	download canceled	{"Process": "storagenode", "Piece ID": "MIZK5BPS6HE7S2ICUUMOW56TM7CDLK25R7T7UFOYXITSP4BWAL2Q", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET", "Offset": 0, "Size": 2105344, "Remote Address": "5.161.121.132:43994"}
2023-04-26T05:01:07.022+0200	INFO	piecestore	download canceled	{"Process": "storagenode", "Piece ID": "3FSYGRLDMDWUVZMYLW6EAMI43MP24WI55US6NQHZBSUO5YTWB3GA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET", "Offset": 0, "Size": 1218560, "Remote Address": "216.66.40.82:50498"}
2023-04-26T05:01:07.039+0200	FATAL	Unrecoverable error	{"Process": "storagenode", "error": "readdirent config/storage/blobs/qstuylguhrn2ozjv4h2c6xpxykd622gtgurhql2k7k75wqaaaaaa/7y: structure needs cleaning; readdirent config/storage/blobs/v4weeab67sbgvnbwd5z7tweqsqqun7qox2agpbxy44mqqaaaaaaa/la: structure needs cleaning", "errorVerbose": "group:\n--- readdirent config/storage/blobs/qstuylguhrn2ozjv4h2c6xpxykd622gtgurhql2k7k75wqaaaaaa/7y: structure needs cleaning\n--- readdirent config/storage/blobs/v4weeab67sbgvnbwd5z7tweqsqqun7qox2agpbxy44mqqaaaaaaa/la: structure needs cleaning"}
2023-04-26T05:01:09.715+0200	INFO	Configuration loaded	{"Process": "storagenode", "Location": "/app/config/config.yaml"}
2023-04-27T23:18:40.360+0200	INFO	piecestore	upload started	{"Process": "storagenode", "Piece ID": "QISHTHKE5W3ZDVI2MMSZG23BT2ODRZ6VPJ7I7U6XGR6CWP25DYEA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT", "Available Space": 1441162677313, "Remote Address": "5.161.146.178:40320"}
2023-04-27T23:18:40.478+0200	INFO	piecestore	uploaded	{"Process": "storagenode", "Piece ID": "QISHTHKE5W3ZDVI2MMSZG23BT2ODRZ6VPJ7I7U6XGR6CWP25DYEA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT", "Size": 36608, "Remote Address": "5.161.146.178:40320"}
2023-04-27T23:18:40.715+0200	ERROR	piecestore:cache	error getting current used space: 	{"Process": "storagenode", "error": "readdirent config/storage/blobs/qstuylguhrn2ozjv4h2c6xpxykd622gtgurhql2k7k75wqaaaaaa/7y: structure needs cleaning; readdirent config/storage/blobs/v4weeab67sbgvnbwd5z7tweqsqqun7qox2agpbxy44mqqaaaaaaa/la: structure needs cleaning", "errorVerbose": "group:\n--- readdirent config/storage/blobs/qstuylguhrn2ozjv4h2c6xpxykd622gtgurhql2k7k75wqaaaaaa/7y: structure needs cleaning\n--- readdirent config/storage/blobs/v4weeab67sbgvnbwd5z7tweqsqqun7qox2agpbxy44mqqaaaaaaa/la: structure needs cleaning"}
2023-04-27T23:18:40.715+0200	ERROR	services	unexpected shutdown of a runner	{"Process": "storagenode", "name": "piecestore:cache", "error": "readdirent config/storage/blobs/qstuylguhrn2ozjv4h2c6xpxykd622gtgurhql2k7k75wqaaaaaa/7y: structure needs cleaning; readdirent config/storage/blobs/v4weeab67sbgvnbwd5z7tweqsqqun7qox2agpbxy44mqqaaaaaaa/la: structure needs cleaning", "errorVerbose": "group:\n--- readdirent config/storage/blobs/qstuylguhrn2ozjv4h2c6xpxykd622gtgurhql2k7k75wqaaaaaa/7y: structure needs cleaning\n--- readdirent config/storage/blobs/v4weeab67sbgvnbwd5z7tweqsqqun7qox2agpbxy44mqqaaaaaaa/la: structure needs cleaning"}
2023-04-27T23:18:40.716+0200	INFO	piecestore	upload canceled	{"Process": "storagenode", "Piece ID": "HVRUQPMOSTYERBVVCKANEJ2YQCZ6HUOPSEPHHUT7DKS2HHU3FGSA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT", "Size": 131072, "Remote Address": "5.161.149.40:36852"}
2023-04-27T23:18:40.717+0200	INFO	piecestore	upload canceled	{"Process": "storagenode", "Piece ID": "76LN3KGDX5GKMGHJWJC45UBQ5O3UWJPDR7N2L2CUXLHWNWYTWLTQ", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "PUT", "Size": 65536, "Remote Address": "193.62.216.32:36824"}
2023-04-27T23:18:40.717+0200	INFO	piecestore	upload canceled	{"Process": "storagenode", "Piece ID": "BBI4ZKATMU7544IDUWZ6YACS6DDJY4ZZTHCL4DVV5PSMNMMSD4QQ", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "PUT", "Size": 794624, "Remote Address": "193.62.216.32:60420"}
2023-04-27T23:18:40.718+0200	INFO	piecestore	upload canceled	{"Process": "storagenode", "Piece ID": "DGYFAG4SIZ3626KGXLYY6Y3KZZ5AUGBTMOHNAHZUKCPSC66B7K2Q", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "PUT", "Size": 1581056, "Remote Address": "193.62.216.32:54490"}
2023-04-27T23:18:40.718+0200	INFO	piecestore	upload canceled	{"Process": "storagenode", "Piece ID": "IV6JIA3XWNN6VC5NT6Q2WKBBIKDPWIFSJBECZ6WEOTTFMJ3FRWZQ", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "PUT", "Size": 1318912, "Remote Address": "193.62.216.32:58610"}
2023-04-27T23:18:40.741+0200	FATAL	Unrecoverable error	{"Process": "storagenode", "error": "readdirent config/storage/blobs/qstuylguhrn2ozjv4h2c6xpxykd622gtgurhql2k7k75wqaaaaaa/7y: structure needs cleaning; readdirent config/storage/blobs/v4weeab67sbgvnbwd5z7tweqsqqun7qox2agpbxy44mqqaaaaaaa/la: structure needs cleaning", "errorVerbose": "group:\n--- readdirent config/storage/blobs/qstuylguhrn2ozjv4h2c6xpxykd622gtgurhql2k7k75wqaaaaaa/7y: structure needs cleaning\n--- readdirent config/storage/blobs/v4weeab67sbgvnbwd5z7tweqsqqun7qox2agpbxy44mqqaaaaaaa/la: structure needs cleaning"}
2023-04-27T23:18:43.417+0200	INFO	Configuration loaded	{"Process": "storagenode", "Location": "/app/config/config.yaml"}
2 Likes

thx try too reinstall the docker then have had some error way back I fix but out reinstall the docker

You need to stop your node, unmount disk and fix the filesystem with fsck, until all errors will be fixed. If your filesystem is xfs, then you need to use a different method: https://www.2daygeek.com/repairing-xfs-file-system-in-rhel/
After that you can mount disk back and run the node.
Docker have no relation to the corrupted file system, so no need to reinstall.