Fatal Error on my Node / Timeout after 1 min

jammerdan · April 4, 2023, 5:16am

That stuff is crazy on drives and needs to be optimized where possible.
Also drives will not get smaller but bigger. Biggest drives I know currently are 22TB. But manufacturers are working on even larger ones.

daki82 · April 4, 2023, 6:46am

how big is your trash from the dashboard? mine is 207GB, growing ~2gb a day

can you get the trash folder size from the windows file explorer or is it suspiciously slow?

suspiciously, the node is half full in terms of drive capacity, not storj reserved. 5.5 of 12 tb.

(me trying to find similarities)

Vicente · April 4, 2023, 12:16pm

Using "Get-Content “C:\Program Files\Storj\Storage Node\storagenode.log” | sls “fatal error” | select -last 10 " does not show errors.

I give free space to the node again. To see if the problem that the node shuts down appears again.

pcresumen · April 4, 2023, 6:44pm

It took a bit longer, but in the end the node crashed after about 6 hours.

Before getting it back up and running, I’ve connected the hard drive internally to the SATA port on the motherboard just to be sure. (I knew it would fail, but for trying it don’t miss). After another 6 hours, the error returned. I have left it external again via E-SATA.

In the end I have chosen to downgrade the version of storagenode.exe from 1.75.2 to 1.74.1. Since it does not verify, the service does not stop and works correctly. Let’s hope it stays that way. Now waiting for the new version that we can edit to verify it or not.

Vadim · April 4, 2023, 6:51pm

I got 6tb hdd, and tried to copy ther my 3tb hdd, in result partition magic made so 180GB logs with error, but no smart errors, and no errors in windows disck check tool. Making serface scan now to understand something.

Vicente · April 4, 2023, 7:32pm

Adding "# storage2.piece-scan-on-startup: false " does not remove the file scan on startup.

Have I entered the wrong text in config.yaml?

The hard drive has been at 98% usage for many hours since I started the node again.

Vadim · April 4, 2023, 7:36pm

delete # befor, then it will work

Dave-Baldwin · April 4, 2023, 7:47pm

The # is a comment, you need to remove the # in order for the line to actually be applied.
https://www.google.com/url?sa=t&source=web&rct=j&url=https://www.educative.io/answers/how-to-write-comments-in-yaml&ved=2ahUKEwi5q4H6_pD-AhWabTABHUGcDnMQFnoECAkQAQ&usg=AOvVaw1koSQdow3ZXrg3tMcPEvLJ

Vicente · April 4, 2023, 7:59pm

Thank you very much, I’ll try removing #.

litvinovov · April 4, 2023, 9:04pm

Я уже писал, что у меня была эта проблема с остановкой узла. Как советовал @Alexey я изменял таймаут до 3 минут (шаг +30 секунд), ошибка сохранялась.
Но поскольку у меня уже был запланирован переезд этого узла на новое железо, дальше наблюдение не вел.
Сейчас узел на новом железе, никаких настроек с таймаутом не делал.
Работает 36 часов без остановки.
Поэтому склоняюсь к тому, что ошибка была из-за железа. Об этом и предупреждал @Alexey
Версия v1.75.2
Новое железо: P8H61-M LX3, i3-3220, RAM 8 Gb, WIN10 LTSC
В планах на будущее заменить на материнку с 6 портами SATA.
Если вдруг возникнет проблема напишу.

translation

I already wrote that I had this problem with node stopping. As @Alexey advised, I changed the timeout to 3 minutes (step +30 seconds), the error persisted.
But since I had already planned to move this node to new hardware, I didn’t conduct further monitoring.
Now the node is on new hardware, I did not make any settings with the timeout.
Works 36 hours non-stop.
Therefore, I am inclined to believe that the error was due to iron. @Alexey warned about this
Version v1.75.2
New hardware: P8H61-M LX3, i3-3220, RAM 8 Gb, WIN10 LTSC
Future plans are to replace it with a motherboard with 6 SATA ports.
If there is a problem I will write.

snorkel · April 4, 2023, 10:02pm

I got the “unexpected shutdown of a runner” too. Aprox 26 hours ago. First one node, than after half an hour or so, the second node restarted.
Synology NAS 216+ 1GB RAM, Docker, 2 nodes on 2 disks, IronWolf 8TB, with 6.2TB and 3TB occupied. Both ext4, record file access time never, FW disabled, tcp fastopen 3 on host, both ver 1.75.2. So it’s happening on ext4 Linux too, but on verry limited resources. Windows it’s more demanding and maybe that’s why it crashes on stronger hardware than Linux.

Alexey · April 5, 2023, 3:29am

I think the filesystem on Linux is more optimized than on Windows and also the Linux caching likely work better than on Windows.
So perhaps all the deal in the available RAM for OS caching (the node usually doesn’t require a lot of RAM unless your disk subsystem is too slow).

maxPalermo · April 5, 2023, 1:40pm

the node suddenly crashes it has already happened 3 times, in the image you can see the error it gives me. How can I solve this problem?

pcresumen · April 5, 2023, 1:43pm

Hello, the problem you are having is the same as the one already discussed in this thread.

Fatal Error on my Node - Node Operators / troubleshooting - Storj Community Forum (official)

Alexey · April 6, 2023, 3:15am

The problem related to disk slowness. We did not come to the common conclusion yet, but usually disks are slow on writes, if they:

SMR
used for something other (for example, more than one node on the same disk or using system disk for the node, etc.)
network connected drive/network filesystem (SMB/CIFS, NFS, etc.)
external disks

The current suggestion is to check disk for errors and fix them. If there is no issues with the disk itself, but it’s just slow, you may increase a write timeout a little bit (the current is 1m0s, you may try to increase with 30s step):

There is no bug found in a new version, instead we fixed an old bug, when the readability and writeability checkers did not have a timeout at all and hanging forever in the background if the disk is too slow or too busy or your setup has hardware issues. These checkers were designed to protect your node from disqualification if your disk is not writeable or not readable, but they did not work in case of partial hanging. Now we added a timeout and this disk unavailability become visible.

snorkel · April 6, 2023, 3:31am

For Windows users, isn’t there a way to setup an autorestart for storagenode service after a crash? I don’t understand well what all those parameters do in task scheduler and if it can be set in such a way, but maybe a script will be usefull. I believe there are many scripts for crypto miners that can be adapted for storj. Maybe someone finds a solution and links it/describes it here.

Alexey · April 6, 2023, 3:54am

You may do so in the Services applet on the Recovery tab of the storagenode service.

Taconode · April 4, 2023, 7:37am

All was working fine 2023-04-02T11:50:52.243-0600 2023-04-02T11:50:52.243-0600 2023-04-02T11:50:52.243-0600 2023-04-02T11:50:52.243-0600 2023-04-02T11:50:52.243-0600 2023-04-02T11:50:52.243-0600 2023-04-02T11:50:52.243-0600 2023-04-02T11:50:52.243-0600 2023-04-02T11:50:52.243-0600 2023-04-02T11:50:52.243-0600 2023-04-02T11:50:52.243-0600 2023-04-02T11:50:52.243-0600 2023-04-02T11:50:52.243-0600 2023-04-02T11:50:52.244-0600 2023-04-02T11:50:52.244-0600 2023-04-02T11:50:52.244-0600 2023-04-02T11:50:52.244-0600 2023-04-02T11:50:52.244-0600 2023-04-02T11:50:54.868-0600 2023-04-02T11:50:55.921-0600 but this week the node stopped and show this troubleshot so I update for latest versión and still crashing, I leave the latest lines off logs
INFO piecestore upload canceled {“Piece ID”: “QQMXZMQOPDL6XGUGWFIJQOMAY3IECMYQMS2B77BLJAAKTBD7X7YA”, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Action”: “PUT”, “Size”: 0, “Remote Address”: “216.66.40.83:24410”}
INFO piecestore upload canceled {“Piece ID”: “ELPD2TYZHXIJRZPVDP2UQ5HSAF2TH6RZ6VTJBZRG3ONEGK37LD3A”, “Satellite ID”: “12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs”, “Action”: “PUT”, “Size”: 0, “Remote Address”: “5.75.175.75:50364”}
INFO piecestore upload canceled {“Piece ID”: “V4DI4Q7NUJCRI6ETAHOO5OLZMMS5ZA6EQBEXM6L5RWIC3QT4HQHA”, “Satellite ID”: “12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs”, “Action”: “PUT”, “Size”: 0, “Remote Address”: “128.140.6.9:17348”}
INFO piecestore upload canceled {“Piece ID”: “FDIMDHYGCJTSP4FIIN7ED7VR24QLOX3FPJQJIMTVDHYY7FHTFOUA”, “Satellite ID”: “12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs”, “Action”: “PUT”, “Size”: 0, “Remote Address”: “128.140.6.12:62212”}
INFO piecestore upload canceled {“Piece ID”: “EPOMB2OI36PBLSJ4BWY3PQKTB4IGBQMMXUYRDURZ2AHR2PYVV7HQ”, “Satellite ID”: “12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs”, “Action”: “PUT”, “Size”: 0, “Remote Address”: “184.104.224.98:20176”}
INFO piecestore upload canceled {“Piece ID”: “EYWR2SOBNKSCLP3B4J7JZ76JYFZ37CK27FANZAWWUXM4KXEBQEOA”, “Satellite ID”: “12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs”, “Action”: “PUT”, “Size”: 0, “Remote Address”: “128.140.5.247:31688”}
INFO piecestore upload canceled {“Piece ID”: “5SGMPLGVO2FBCYNZRUYDNLLF52Y5A5FSZ5LYIFMQCKIE3S7ECTRA”, “Satellite ID”: “12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs”, “Action”: “PUT”, “Size”: 0, “Remote Address”: “184.104.224.99:58276”}
INFO piecestore upload canceled {“Piece ID”: “75HN2DEDYFCYE2QBYQFKMRJC4D3MZPL5OSOVZIQUMV5556PV4FIA”, “Satellite ID”: “12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs”, “Action”: “PUT”, “Size”: 0, “Remote Address”: “23.88.99.251:21468”}
INFO piecestore upload canceled {“Piece ID”: “UAF2LZ3KBJZN2P7A53CBTF3HUIYMKSIGL3EDMVUYK3T4X5QJFUZA”, “Satellite ID”: “12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs”, “Action”: “PUT”, “Size”: 0, “Remote Address”: “128.140.6.2:20572”}
INFO piecestore upload canceled {“Piece ID”: “PKGR3UQBGBAXQGGZRZ2PCQ3HONJY6E3EG23BFQXPHM56BCDKC7XA”, “Satellite ID”: “12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs”, “Action”: “PUT”, “Size”: 0, “Remote Address”: “23.88.99.251:28596”}
INFO piecestore upload canceled {“Piece ID”: “X6XMW636JEOFZTC77VOGNMMO267FO2VNPZLAL5Q6XOE3GGJF6ECQ”, “Satellite ID”: “12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs”, “Action”: “PUT”, “Size”: 0, “Remote Address”: “159.69.199.209:31482”}
INFO piecestore upload canceled {“Piece ID”: “23YXWKITUYM3OK2UDJ563P27QXV7CPKLPEXNTYGEI6ZYDEZGLFXQ”, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Action”: “PUT”, “Size”: 0, “Remote Address”: “216.66.40.83:14638”}
INFO piecestore upload canceled {“Piece ID”: “5QZZ5VGEKFGM3G5Q5Q6B3B5MSLDAZ2CNGSEV6FTACQXYAIFNLYGA”, “Satellite ID”: “12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs”, “Action”: “PUT”, “Size”: 0, “Remote Address”: “216.66.40.82:52022”}
INFO piecestore upload canceled {“Piece ID”: “ZQDGHURXGF67DT2FJ4ASJSTZFN227NROU66ZQG2GQF5MIE35NLTQ”, “Satellite ID”: “12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs”, “Action”: “PUT”, “Size”: 0, “Remote Address”: “72.52.83.202:22478”}
INFO piecestore upload canceled {“Piece ID”: “6TCFM7MMEMI2XNXZQD4ENC2B3N4Q6MVQDF5R6F2R3JQC7WKYW6UQ”, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Action”: “PUT”, “Size”: 0, “Remote Address”: “216.66.40.82:53962”}
INFO piecestore upload canceled {“Piece ID”: “3KJFYLDL452B4JCYCFQIN5HCMYM4G6X4MZEFLUZG32ICIDL2QQUA”, “Satellite ID”: “12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB”, “Action”: “PUT_REPAIR”, “Size”: 0, “Remote Address”: “168.119.234.81:50596”}
INFO piecestore upload canceled {“Piece ID”: “4SXYRLBJKV6TTRCNYEKGB66WIK7Y6HSRDHNYEFY7BW6QFKIVGK7A”, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Action”: “PUT_REPAIR”, “Size”: 0, “Remote Address”: “5.161.74.111:36608”}
INFO piecestore upload canceled {“Piece ID”: “NXM2KPBMUW4CIU646P5BWD3OBSPYRI7IZVIY25WJAF4C6PWGLOJQ”, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Action”: “PUT_REPAIR”, “Size”: 0, “Remote Address”: “5.161.44.25:47830”}
INFO piecestore uploaded {“Piece ID”: “PA5LTNXOB5KXFZQ3FFIWDI5SGHEVXRIVC3Z2HND5HQYMR4D7ZPUQ”, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Action”: “PUT”, “Size”: 239104, “Remote Address”: “47.184.58.67:57756”}
FATAL Unrecoverable error {“error”: “piecestore monitor: timed out after 1m0s while verifying writability of storage directory”, “errorVerbose”: “piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:150\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:146\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75”}

daki82 · April 4, 2023, 10:13am

welcome, windows node? big trash folder maybe? serveral month old node not full/ half full? lets move to:

Taconode · April 4, 2023, 11:25pm

Yes, is a windows node I check the disc space but I leave margin is a 4tb disk and only use 3.2 for storj I make a clean installation some hours ago but still killing something