Восстановление ноды после миграции с проблемного HDD

karacurt · April 30, 2023, 6:05am

F:\data\storage -> config.yaml1058611141
странное расширение у файла

karacurt · April 30, 2023, 6:07am

# how frequently bandwidth usage rollups are calculated
# bandwidth.interval: 1h0m0s

# how frequently expired pieces are collected
# collector.interval: 1h0m0s

# use color in user interface
# color: false

# server address of the api gateway and frontend app
console.address: :14002

# path to static resources
# console.static-dir: ""

# the public address of the node, useful for nodes behind NAT
contact.external-address: ""

# how frequently the node contact chore should run
# contact.interval: 1h0m0s

# Maximum Database Connection Lifetime, -1ns means the stdlib default
# db.conn_max_lifetime: 30m0s

# Maximum Amount of Idle Database connections, -1 means the stdlib default
# db.max_idle_conns: 1

# Maximum Amount of Open Database connections, -1 means the stdlib default
# db.max_open_conns: 5

# address to listen on for debug endpoints
# debug.addr: 127.0.0.1:0

# expose control panel
# debug.control: false

# If set, a path to write a process trace SVG to
# debug.trace-out: ""

# open config in default editor
# edit-conf: false

# in-memory buffer for uploads
# filestore.write-buffer-size: 128.0 KiB

# how often to run the chore to check for satellites for the node to exit.
# graceful-exit.chore-interval: 1m0s

# the minimum acceptable bytes that an exiting node can transfer per second to the new node
# graceful-exit.min-bytes-per-second: 5.00 KB

# the minimum duration for downloading a piece from storage nodes before timing out
# graceful-exit.min-download-timeout: 2m0s

# number of concurrent transfers per graceful exit worker
# graceful-exit.num-concurrent-transfers: 5

# number of workers to handle satellite exits
# graceful-exit.num-workers: 4

# Enable additional details about the satellite connections via the HTTP healthcheck.
healthcheck.details: false

# Provide health endpoint (including suspension/audit failures) on main public port, but HTTP protocol.
healthcheck.enabled: true

# path to the certificate chain for this identity
identity.cert-path: identity/identity.cert

# path to the private key for this identity
identity.key-path: identity/identity.key

# if true, log function filename and line number
# log.caller: false

# if true, set logging to development mode
# log.development: false

# configures log encoding. can either be 'console', 'json', 'pretty', or 'gcloudlogging'.
# log.encoding: ""

# the minimum log level to log
log.level: info

# can be stdout, stderr, or a filename
# log.output: stderr

# if true, log stack traces
# log.stack: false

# address(es) to send telemetry to (comma-separated)
# metrics.addr: collectora.storj.io:9000

# application name for telemetry identification. Ignored for certain applications.
# metrics.app: storagenode

# application suffix. Ignored for certain applications.
metrics.app-suffix: -alpha

# address(es) to send telemetry to (comma-separated)
# metrics.event-addr: eventkitd.datasci.storj.io:9002

# instance id prefix
# metrics.instance-prefix: ""

# how frequently to send up telemetry. Ignored for certain applications.
metrics.interval: 30m0s

# maximum duration to wait before requesting data
# nodestats.max-sleep: 5m0s

# how often to sync reputation
# nodestats.reputation-sync: 4h0m0s

# how often to sync storage
# nodestats.storage-sync: 12h0m0s

# operator email address
operator.email: ""

# operator wallet address
operator.wallet: ""

# operator wallet features
operator.wallet-features: ""

# move pieces to trash upon deletion. Warning: if set to false, you risk disqualification for failed audits if a satellite database is restored from backup.
# pieces.delete-to-trash: true

# file preallocated for uploading
# pieces.write-prealloc-size: 4.0 MiB

# whether or not preflight check for database is enabled.
# preflight.database-check: true

# whether or not preflight check for local system clock is enabled on the satellite side. When disabling this feature, your storagenode may not setup correctly.
# preflight.local-time-check: true

# how many concurrent retain requests can be processed at the same time.
# retain.concurrency: 5

# allows for small differences in the satellite and storagenode clocks
# retain.max-time-skew: 72h0m0s

# allows configuration to enable, disable, or test retain requests from the satellite. Options: (disabled/enabled/debug)
# retain.status: enabled

# public address to listen on
server.address: :28967

# whether to debounce incoming messages
# server.debouncing-enabled: true

# if true, client leaves may contain the most recent certificate revocation for the current certificate
# server.extensions.revocation: true

# if true, client leaves must contain a valid "signed certificate extension" (NB: verified against certs in the peer ca whitelist; i.e. if true, a whitelist must be provided)
# server.extensions.whitelist-signed-leaf: false

# path to the CA cert whitelist (peer identities must be signed by one these to be verified). this will override the default peer whitelist
# server.peer-ca-whitelist-path: ""

# identity version(s) the server will be allowed to talk to
# server.peer-id-versions: latest

# private address to listen on
server.private-address: 127.0.0.1:7778

# url for revocation database (e.g. bolt://some.db OR redis://127.0.0.1:6378?db=2&password=abc123)
# server.revocation-dburl: bolt://config/revocations.db

# enable support for tcp fast open
# server.tcp-fast-open: true

# the size of the tcp fast open queue
# server.tcp-fast-open-queue: 256

# if true, uses peer ca whitelist checking
# server.use-peer-ca-whitelist: true

# total allocated bandwidth in bytes (deprecated)
storage.allocated-bandwidth: 0 B

# total allocated disk space in bytes
storage.allocated-disk-space: 2.00 TB

# how frequently Kademlia bucket should be refreshed with node stats
# storage.k-bucket-refresh-interval: 1h0m0s

# path to store data in
# storage.path: config/storage

# a comma-separated list of approved satellite node urls (unused)
# storage.whitelisted-satellites: ""

# how often the space used cache is synced to persistent storage
# storage2.cache-sync-interval: 1h0m0s

# directory to store databases. if empty, uses data path
# storage2.database-dir: ""

# size of the piece delete queue
# storage2.delete-queue-size: 10000

# how many piece delete workers
# storage2.delete-workers: 1

# how many workers to use to check if satellite pieces exists
# storage2.exists-check-workers: 5

# how soon before expiration date should things be considered expired
# storage2.expiration-grace-period: 48h0m0s

# how many concurrent requests are allowed, before uploads are rejected. 0 represents unlimited.
# storage2.max-concurrent-requests: 0

# amount of memory allowed for used serials store - once surpassed, serials will be dropped at random
# storage2.max-used-serials-size: 1.00 MB

# a client upload speed should not be lower than MinUploadSpeed in bytes-per-second (E.g: 1Mb), otherwise, it will be flagged as slow-connection and potentially be closed
# storage2.min-upload-speed: 0 B

# if the portion defined by the total number of alive connection per MaxConcurrentRequest reaches this threshold, a slow upload client will no longer be monitored and flagged
# storage2.min-upload-speed-congestion-threshold: 0.8

# if MinUploadSpeed is configured, after a period of time after the client initiated the upload, the server will flag unusually slow upload client
# storage2.min-upload-speed-grace-duration: 10s

# how frequently Kademlia bucket should be refreshed with node stats
# storage2.monitor.interval: 1h0m0s

# how much bandwidth a node at minimum has to advertise (deprecated)
# storage2.monitor.minimum-bandwidth: 0 B

# how much disk space a node at minimum has to advertise
# storage2.monitor.minimum-disk-space: 500.00 GB

# how frequently to verify the location and readability of the storage directory
# storage2.monitor.verify-dir-readable-interval: 1m0s

# how long to wait for a storage directory readability verification to complete
# storage2.monitor.verify-dir-readable-timeout: 1m0s

# how frequently to verify writability of storage directory
# storage2.monitor.verify-dir-writable-interval: 5m0s

# how long to wait for a storage directory writability verification to complete
# storage2.monitor.verify-dir-writable-timeout: 1m0s

# how long after OrderLimit creation date are OrderLimits no longer accepted
# storage2.order-limit-grace-period: 1h0m0s

# length of time to archive orders before deletion
# storage2.orders.archive-ttl: 168h0m0s

# duration between archive cleanups
# storage2.orders.cleanup-interval: 5m0s

# maximum duration to wait before trying to send orders
# storage2.orders.max-sleep: 30s

# path to store order limit files in
# storage2.orders.path: config/orders

# timeout for dialing satellite during sending orders
# storage2.orders.sender-dial-timeout: 1m0s

# duration between sending
# storage2.orders.sender-interval: 1h0m0s

# timeout for sending
# storage2.orders.sender-timeout: 1h0m0s

# if set to true, all pieces disk usage is recalculated on startup
# storage2.piece-scan-on-startup: true

# allows for small differences in the satellite and storagenode clocks
# storage2.retain-time-buffer: 48h0m0s

# how long to spend waiting for a stream operation before canceling
# storage2.stream-operation-timeout: 30m0s

# file path where trust lists should be cached
# storage2.trust.cache-path: config/trust-cache.json

# list of trust exclusions
# storage2.trust.exclusions: ""

# how often the trust pool should be refreshed
# storage2.trust.refresh-interval: 6h0m0s

# list of trust sources
# storage2.trust.sources: https://www.storj.io/dcs-satellites

# address for jaeger agent
# tracing.agent-addr: agent.tracing.datasci.storj.io:5775

# application name for tracing identification
# tracing.app: storagenode

# application suffix
# tracing.app-suffix: -release

# buffer size for collector batch packet size
# tracing.buffer-size: 0

# whether tracing collector is enabled
# tracing.enabled: true

# how frequently to flush traces to tracing agent
# tracing.interval: 0s

# buffer size for collector queue size
# tracing.queue-size: 0

# how frequent to sample traces
# tracing.sample: 0

# Interval to check the version
# version.check-interval: 15m0s

# Request timeout for version checks
# version.request-timeout: 1m0s

# server address to check its version against
version.server-address: https://version.storj.io

Alexey · April 30, 2023, 6:13am

А как вы переадресовали логи? В config.yaml ничего нет.

нашёл

karacurt · April 30, 2023, 6:15am

docker run -d --restart unless-stopped --stop-timeout 300 -p 28971:28967/tcp -p 28971:28967/udp -p 127.0.0.1:14004:14002 -e WALLET="xxx" -e EMAIL="xxx@xxx" -e ADDRESS="xxx:28971" -e STORAGE="11TB" --mount type=bind,source="F:\Identity\storagenode2\",destination=/app/identity --mount type=bind,source="F:\data\",destination=/app/config --mount type=bind,source="D:\Storj\Logs\node2.log",destination=/app/logs/node2.log --name storagenode2 storjlabs/storagenode:latest --log.output=/app/logs/node2.log

karacurt · April 30, 2023, 6:18am

переименовал config.yaml1058611141 → config.yaml
и дополнительно скопировал его в F:\data

в логах появилась новая ошибка:

2023-04-30T05:22:12.762Z	ERROR	contact:service	ping satellite failed 	{"Process": "storagenode", "Satellite ID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB", "attempts": 3, "error": "ping satellite: check-in ratelimit: node rate limited by id", "errorVerbose": "ping satellite: check-in ratelimit: node rate limited by id\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatelliteOnce:143\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatellite:102\n\tstorj.io/storj/storagenode/contact.(*Chore).updateCycles.func1:87\n\tstorj.io/common/sync2.(*Cycle).Run:99\n\tstorj.io/common/sync2.(*Cycle).Start.func1:77\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75"}

Alexey · April 30, 2023, 6:19am

эта ошибка возникает, когда узел слишком часто пытается сделать check-in, когда его внешний адрес (заявленный в ADDRESS) не доступен.

karacurt · April 30, 2023, 6:21am

так же обратил внимание что в папке F:/data много файлов trust-cache.json
внутри:

{
  "entries": {
    "https://www.storj.io/dcs-satellites": [
      {
        "SatelliteURL": {
          "id": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S",
          "host": "us1.storj.io",
          "port": 7777
        },
        "authoritative": false
      },
      {
        "SatelliteURL": {
          "id": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs",
          "host": "eu1.storj.io",
          "port": 7777
        },
        "authoritative": false
      },
      {
        "SatelliteURL": {
          "id": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6",
          "host": "ap1.storj.io",
          "port": 7777
        },
        "authoritative": false
      },
      {
        "SatelliteURL": {
          "id": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE",
          "host": "saltlake.tardigrade.io",
          "port": 7777
        },
        "authoritative": false
      },
      {
        "SatelliteURL": {
          "id": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB",
          "host": "europe-north-1.tardigrade.io",
          "port": 7777
        },
        "authoritative": false
      },
      {
        "SatelliteURL": {
          "id": "12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo",
          "host": "us2.storj.io",
          "port": 7777
        },
        "authoritative": false
      }
    ]
  }
}

Alexey · April 30, 2023, 6:22am

там нужен только один с нормальным именем и расширением.
Кстати, попробуйте удалить все копии trust-cache.json и перезапустить узел, он должен скачать его заново.

karacurt · April 30, 2023, 6:29am

видимо была комбинация из неверного названия config.yaml, а так же настроек проброса.

поменял название config.yaml и перевёл на другой порт - узел запустился.

в логах есть ошибки:

ERROR	piecestore	upload failed	{"Process": "storagenode", "Piece ID": "RMMZNELQBADH6M37RMEKIVYDDTO34MMCKF3TKAJPSGJA5TKFASKA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT", "error": "context canceled", "errorVerbose": "context canceled\n\tstorj.io/common/rpc/rpcstatus.Wrap:75\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Upload.func5:498\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Upload:504\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func1:243\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:61\n\tstorj.io/common/experiment.(*Handler).HandleRPC:42\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:124\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:66\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:114\n\tstorj.io/drpc/drpcctx.(*Tracker).track:35", "Size": 14848, "Remote Address": "172.17.0.1:44182"}
ERROR	piecestore	download failed	{"Process": "storagenode", "Piece ID": "PMOFC7522WXEKE5VJCSLE2OGV3NYVPXF3ODPENVU2OGHYPTJFM2A", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET", "Remote Address": "172.17.0.1:43730", "error": "untrusted: unable to get signee: trust: rpc: tcp connector failed: rpc: dial tcp: lookup us1.storj.io: operation was canceled", "errorVerbose": "untrusted: unable to get signee: trust: rpc: tcp connector failed: rpc: dial tcp: lookup us1.storj.io: operation was canceled\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).VerifyOrderLimitSignature:140\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).verifyOrderLimit:62\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:599\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:251\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:61\n\tstorj.io/common/experiment.(*Handler).HandleRPC:42\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:124\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:66\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:114\n\tstorj.io/drpc/drpcctx.(*Tracker).track:35"}

во всяком случае узел заработал и это чудесно:

буду надеяться что не дисквалит.

karacurt · April 30, 2023, 6:36am

Проблема до конца не решена - узел, который работал на этом порту остановлен и всё так же работают 3 из 4 узлов.
Надеюсь что это проблема настроек проброса, а не настройки WSL2 или ещё что-то неведомое.
Сегодня доеду до места установки, поковыряюсь в настройках роутера.

Так же в папке F:\data\storage видимо ещё до переноса образовались 2 увесистых файла:
DUMP2db2.tmp и DUMP29fe.tmp - можно ли их удалить?

karacurt · April 30, 2023, 6:51am

Во всяком случае без неоценимой помощи Алексея сам не разобрался бы.
Алексей, спасибо!
Как обычно вы - мой герой

Alexey · April 30, 2023, 7:13am

таких там не должно быть, так что можно удалять.

это странно. Узел не смог у DNS получить IP.

порт, кстати, можете вернуть, просто убедитесь, что в правиле проброса он указан.

karacurt · April 30, 2023, 7:35am

понаблюдаю следующие несколько дней за повтором этой ошибки.

в антивирусе точно, в роутере тоже на неделе проверял и в IP → Firewall этот порт был активирован, но так как роутер месяц как установлен Mikrotik и для меня сложен для восприятия, возможно что то пропустил.

karacurt · April 30, 2023, 10:51am

как и предполагалось - моя ошибка:
28971/tcp отдал другому локальному ip.
скорректировал, теперь работают 4/4.

квест завершён.

karacurt · May 9, 2023, 6:18am

На данный момент диск работает, узел активен, но получено сообщение:

Your Node is suspended since 2023-04-30 12:00:44.911493 +0000 UTC: This is a reminder that your StorageNode is suspended on Satellite 12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo

Это же временно, пока узел не наработает какое-то количество часов?
Suspension получен сразу после перезапуска узла и уже прошло около 10 дней. Прироста показателей и роста объёма занятого пространства не происходит.
Остальные узлы тоже практически не прирастают в показателях.

Alexey · May 9, 2023, 7:41am

Suspension может происходить в двух случаях:

если suspension score ниже 60%
если online score ниже 60%

Для восстановления первого необходимо чтобы ваш узел начал проходить аудиты, с каждым пройденным аудитом suspension score восстанавливается.
Для восстановления второго необходимо, чтобы ваш узел был онлайн в следующие 30 дней. Для каждого события downtime требуется дополнительные 30 дней онлайн.

karacurt · May 9, 2023, 12:43pm

на us2.storj.io:7777 текущий счёт 59.29 %, буду ждать 30 мая.
принял, благодарю!

Alexey · May 23, 2023, 5:02am

A post was split to a new topic: “error”: “untrusted: unable to get signee: trust: rpc: tcp connector failed: rpc: dial tcp: lookup us1.storj.io: operation was canceled