Running 1.77.2 and am suddenly getting the following error every few hours causing the node to crash
2023-05-01T06:09:47.337-0400 FATAL Unrecoverable error {“error”: “piecestore monitor: timed out after 1m0s while verifying writability of storage directory”, “errorVerbose”: “piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:150\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:146\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75”}
I would first see if you have a drive issue, make sure you can write files to the device and everything is behaving as it should be. If after that, I would increase the timeout thresholds as explained above to see if that helps.
So an update on my errors. Since removing the on-start filewalker (my drives are Storj exclusive) my notes have been mostly stable. I’ve still had one node crash a couple times but it was for an out of memory condition, not surprising where I’m running three nodes on a measly Raspberry Pi 3 with only 1GB of RAM. I’m curious to see if the 1.78 lazy-filewalker ability will help me out and allow me to re-enable the on-start filewalker.
For the slow disk you may increase a writeable timeout a little bit and restart the node. Of course because of issue Storagenode 1.77.2 wont stop you can do restart only after crash.
defragmentation was running 5 days in the background, i canceled it, but it defragmented most of the data.
now i will turn on free space and wait. with timeout “2m10s”. “piecescan on startup false”
i activated ingress again restarted the node and within some hours
since im on v1.76.2 it shows something different:
-the node process was using unusual high cpu, 10x the normal amount,
-dasboard was taking unusual long to load, i restarted the node manually then dashboard was normal
-i got suspension 95% too
-still no audit errors
-uptime robot did not detect any error, (maybe because automatic restart)
-online was affected shortly after
-timeout was 1m in logs (maybe i got the line wrong, could not find an # line in .yaml for that)
so defragmentation helped not with the error, but with general performance of the disk.
I have files for chia on the same drive and no problems with it. but stroj constantly crashes. how can you even imagine that the hard drive will not respond for more than a minute ?
in the resource manager i have a response maximum of 100ms and in the logs when it crashes more than 1 minute ? a bug in the storj software
noda crashes every 2-10 hours. no problems with the disk (victoria 3.7 test), no errors in any system logs. chia mining disk requirements - 5 seconds and everything works. your software constantly crashes. if the disk does not respond 1 minute - then the disk is dead, and if it is alive - then the software does not work correctly.
i still go with some ungabunga inside code.
it is showing on windows first, since docker gets automatic restarts per default. covering it up.
full nodes not affected, so everybody thinks its fine tho.
Please try to do a defragmentation and do not disable it for the data location.
See also
The problem with not reliable readability/writeability was there before, just checkers did not have a timeout, so for partial hanging it was not effective and some such nodes were disqualified in the past.
Now the timeout is in place and crashes node in such half-hanging state.
However, you may disable it, and make it behave as before - just increase a timeout to some big number like a week.
But I think it’s better to know that there is some problem with disk subsystem than close eyes on it.
I have similar problem.
Has been running fine for a long time. Now service is crashing again an again.
OS: Windows
I get a lot of these Fatal errors:
FATAL Unrecoverable error {"error": "piecestore monitor: timed out after 1m0s while verifying readability of storage directory", "errorVerbose": "piecestore monitor: timed out after 1m0s while verifying readability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func1.1:142\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func1:134\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75"}
My config file looks like this:
# how frequently bandwidth usage rollups are calculated
# bandwidth.interval: 1h0m0s
# how frequently expired pieces are collected
# collector.interval: 1h0m0s
# use color in user interface
# color: false
# server address of the api gateway and frontend app
# console.address: 127.0.0.1:14002
# path to static resources
# console.static-dir: ""
# the public address of the node, useful for nodes behind NAT
contact.external-address: 62.66.145.73:28967
# how frequently the node contact chore should run
# contact.interval: 1h0m0s
# Maximum Database Connection Lifetime, -1ns means the stdlib default
# db.conn_max_lifetime: 30m0s
# Maximum Amount of Idle Database connections, -1 means the stdlib default
# db.max_idle_conns: 1
# Maximum Amount of Open Database connections, -1 means the stdlib default
# db.max_open_conns: 5
# address to listen on for debug endpoints
# debug.addr: 127.0.0.1:0
# expose control panel
# debug.control: false
# If set, a path to write a process trace SVG to
# debug.trace-out: ""
# open config in default editor
# edit-conf: false
# in-memory buffer for uploads
# filestore.write-buffer-size: 128.0 KiB
# how often to run the chore to check for satellites for the node to exit.
# graceful-exit.chore-interval: 1m0s
# the minimum acceptable bytes that an exiting node can transfer per second to the new node
# graceful-exit.min-bytes-per-second: 5.00 KB
# the minimum duration for downloading a piece from storage nodes before timing out
# graceful-exit.min-download-timeout: 2m0s
# number of concurrent transfers per graceful exit worker
# graceful-exit.num-concurrent-transfers: 5
# number of workers to handle satellite exits
# graceful-exit.num-workers: 4
# path to the certificate chain for this identity
identity.cert-path: C:\Users\administrator\Documents\identity.cert
# path to the private key for this identity
identity.key-path: C:\Users\administrator\Documents\identity.key
# if true, log function filename and line number
# log.caller: false
# if true, set logging to development mode
# log.development: false
# configures log encoding. can either be 'console', 'json', or 'pretty'.
# log.encoding: ""
# the minimum log level to log
log.level: FATAL
# can be stdout, stderr, or a filename
log.output: winfile:///D:\Storj-DB\\storagenode.log
# if true, log stack traces
# log.stack: false
# address(es) to send telemetry to (comma-separated)
# metrics.addr: collectora.storj.io:9000
# application name for telemetry identification
# metrics.app: storagenode.exe
# application suffix
# metrics.app-suffix: -release
# instance id prefix
# metrics.instance-prefix: ""
# how frequently to send up telemetry
# metrics.interval: 1m0s
# maximum duration to wait before requesting data
# nodestats.max-sleep: 5m0s
# how often to sync reputation
# nodestats.reputation-sync: 4h0m0s
# how often to sync storage
# nodestats.storage-sync: 12h0m0s
# operator email address
operator.email: Thorben@j
# operator wallet address
...
# operator wallet features
operator.wallet-features: ""
# move pieces to trash upon deletion. Warning: if set to false, you risk disqualification for failed audits if a satellite database is restored from backup.
# pieces.delete-to-trash: true
# file preallocated for uploading
# pieces.write-prealloc-size: 4.0 MiB
# whether or not preflight check for database is enabled.
# preflight.database-check: true
# whether or not preflight check for local system clock is enabled on the satellite side. When disabling this feature, your storagenode may not setup correctly.
# preflight.local-time-check: true
# how many concurrent retain requests can be processed at the same time.
retain.concurrency: 5
# allows for small differences in the satellite and storagenode clocks
# retain.max-time-skew: 72h0m0s
# allows configuration to enable, disable, or test retain requests from the satellite. Options: (disabled/enabled/debug)
# retain.status: enabled
# public address to listen on
server.address: :28967
# if true, client leaves may contain the most recent certificate revocation for the current certificate
# server.extensions.revocation: true
# if true, client leaves must contain a valid "signed certificate extension" (NB: verified against certs in the peer ca whitelist; i.e. if true, a whitelist must be provided)
# server.extensions.whitelist-signed-leaf: false
# path to the CA cert whitelist (peer identities must be signed by one these to be verified). this will override the default peer whitelist
# server.peer-ca-whitelist-path: ""
# identity version(s) the server will be allowed to talk to
# server.peer-id-versions: latest
# private address to listen on
server.private-address: 127.0.0.1:7778
# url for revocation database (e.g. bolt://some.db OR redis://127.0.0.1:6378?db=2&password=abc123)
server.revocation-dburl: bolt://D:\Storj-DB/revocations.db
# if true, uses peer ca whitelist checking
# server.use-peer-ca-whitelist: true
# total allocated bandwidth in bytes (deprecated)
storage.allocated-bandwidth: 0 B
# total allocated disk space in bytes
storage.allocated-disk-space: 2.1 TB
# how frequently Kademlia bucket should be refreshed with node stats
# storage.k-bucket-refresh-interval: 1h0m0s
# path to store data in
storage.path: E:\jStorage\
# a comma-separated list of approved satellite node urls (unused)
# storage.whitelisted-satellites: ""
# how often the space used cache is synced to persistent storage
# storage2.cache-sync-interval: 1h0m0s
# directory to store databases. if empty, uses data path
storage2.database-dir: D:\Storj-DB\
# size of the piece delete queue
# storage2.delete-queue-size: 10000
# how many piece delete workers
# storage2.delete-workers: 1
# how soon before expiration date should things be considered expired
# storage2.expiration-grace-period: 48h0m0s
# how many concurrent requests are allowed, before uploads are rejected. 0 represents unlimited.
# storage2.max-concurrent-requests: 0
# amount of memory allowed for used serials store - once surpassed, serials will be dropped at random
# storage2.max-used-serials-size: 1.00 MB
# how frequently Kademlia bucket should be refreshed with node stats
# storage2.monitor.interval: 1h0m0s
# how much bandwidth a node at minimum has to advertise (deprecated)
# storage2.monitor.minimum-bandwidth: 0 B
# how much disk space a node at minimum has to advertise
# storage2.monitor.minimum-disk-space: 500.00 GB
# how frequently to verify the location and readability of the storage directory
# storage2.monitor.verify-dir-readable-interval: 1m0s
# how frequently to verify writability of storage directory
#storage2.monitor.verify-dir-writable-interval: 5m0s
# how long after OrderLimit creation date are OrderLimits no longer accepted
# storage2.order-limit-grace-period: 1h0m0s
# length of time to archive orders before deletion
# storage2.orders.archive-ttl: 168h0m0s
# duration between archive cleanups
# storage2.orders.cleanup-interval: 5m0s
# maximum duration to wait before trying to send orders
# storage2.orders.max-sleep: 30s
# path to store order limit files in
# storage2.orders.path: C:\Program Files\Storj\Storage Node/orders
# timeout for dialing satellite during sending orders
# storage2.orders.sender-dial-timeout: 1m0s
# duration between sending
# storage2.orders.sender-interval: 1h0m0s
# timeout for sending
# storage2.orders.sender-timeout: 1h0m0s
# allows for small differences in the satellite and storagenode clocks
# storage2.retain-time-buffer: 48h0m0s
# how long to spend waiting for a stream operation before canceling
# storage2.stream-operation-timeout: 30m0s
# file path where trust lists should be cached
# storage2.trust.cache-path: C:\Program Files\Storj\Storage Node/trust-cache.json
# list of trust exclusions
# storage2.trust.exclusions: ""
# how often the trust pool should be refreshed
# storage2.trust.refresh-interval: 6h0m0s
# list of trust sources
# storage2.trust.sources: https://tardigrade.io/trusted-satellites
# address for jaeger agent
# tracing.agent-addr: agent.tracing.datasci.storj.io:5775
# application name for tracing identification
# tracing.app: storagenode.exe
# application suffix
# tracing.app-suffix: -release
# buffer size for collector batch packet size
# tracing.buffer-size: 0
# whether tracing collector is enabled
# tracing.enabled: false
# how frequently to flush traces to tracing agent
# tracing.interval: 0s
# buffer size for collector queue size
# tracing.queue-size: 0
# how frequent to sample traces
# tracing.sample: 0
# Interval to check the version
# version.check-interval: 15m0s
# Request timeout for version checks
# version.request-timeout: 1m0s
# server address to check its version against
# version.server-address: https://version.storj.io
storage2.monitor.verify-dir-writable-timeout: 4m00s