Storj node on Linux "bad sectors or a corrupted file system"

Lukasz · January 16, 2024, 9:49pm

Hi there,

Sorry if I created the same topic, but I could not find the resolution for my situation and not much time left.
My node after 1h is not responding to QUIC (it is online you can see the dashboard but without any data) and it is showing QUIC in red “misconfigured”. It is configured correctly as I did not change anything and a node has been working for 1-2 years. Checked both ports 14002 and 28967 open. After 1h 28967 is showing clouse.
It already went from 97% (yellow) on all servers to 95(red). That is not an issue with port forwarding.
Something with disk or some corrupted data. Logs telling/pointing “… This is most likely due to disk bad sectors or a corrupted file system. Check your disk for bad sectors and integrity; filestore error: config/storage/trash/ukfu6bhbboxilvt7jrwlqk7”.
So I should stop the node and run some program on Linux to check file integrity? Scan for bad sectors or remove some DB files and resync, but that will take a lot of time. What can I do to quickly fix the issue to not get permanently disqualified?
I believe that probably it is just a corrupted file, not a bad disk but how I can verify that?
p.s. The disk is correct, not SMR.

Please help

I cut all sensitive data, I think.
Logs and errors:

daki82 · January 17, 2024, 12:07am

You are not in a hurry, you can be 12 out of 30days offline.
so just stop the node and run the appropriate program for filesystem check.

Lukasz · January 17, 2024, 10:05am

Thank you @daki82 you give me hope Now, I checked the file system and that HDD is set to NTFS. So that means I can take it to a Windows PC and just run error-checking. That should not mess up any stored files if the HDD is ok. Is that the correct way or appropriate program?

Screenshot 2024-01-17 100350

Screenshot 2024-01-17 095417

daki82 · January 17, 2024, 10:52am

I think so.

maybe wait a month to recover the online score.
if it does not run fine after,

Consider copying the data to a safe place after that, reformat the drive to ext4 and copy the data back, run the node with the right parameters for the data path.

ntfs and linux is an odd combo, though i dont know if it matters for 2 tb.
think on leaving 10%free space on the disk too.

Lukasz · January 17, 2024, 4:10pm

Hi @daki82,

Disk tested on the windows there was some errors that was fixed automatically after scan. Installed on the node for now it is working and I do not see those error anymore. I will check again after 3-5h and let you know but that looks like it works agin

Thank you again

Alexey · January 18, 2024, 4:52am

If you use Windows 10 or newer, please disable deduplication and compression for the disk before processing it under Windows, otherwise it could become incompatible with your Linux, and it will have corruptions under Linux every time when you use it under Linux.
For example:

Lukasz · January 18, 2024, 10:24am

Hi there all seems to be working correctly, thank you @daki82 . The discussion can be close. Hi @Alexey thank you for the article. I need to check that, as my all storj nodes has NTFS partitions for customer data. I did that on beginning as was thinking easier to work connect to windows do scan or any other maintenance HDD. I do not know very well Linux and command interface related to disk that why I choose NTFS. Again thank you both for support/help

Lukasz · January 22, 2024, 1:41pm

Hi guys,

Sorry to bother you again unfortunately it work for a few days and now again, the nod is showing offline after 1h when I restart it.
showing below error:

I read " How to fix a “database disk image is malformed”. I was able to find on my node all of “db” tested and got the output of “OK”

Then another issue occurred when is restart the node is:

I read all from " Linux, “failed to sufficiently increase receive buffer size”" and from here How to increase the UDP Receive Buffer Size in the docker

So I set net.core.rmem_max=2500000 reboot the node still got the error

Screenshot 2024-01-22 125801

I also see that error on the logs

for that I try this but not see any difference INFO http: Accept error: accept tcp XXX.XXX.XXX.XXX:28967: accept

The only what I did not try that could be related is the topic How to fix database: file is not a database error

I start thinking about HDD if it is SMR instead of CMR. But looks like it is correct CMR.

Screenshot 2024-01-22 125118

Also what I spot is that the other node got a lower version the one with the issue has v1.95.1

Any help is welcome
Thank you,

Alexey · January 23, 2024, 3:47am

Database is locked means that your disk is not keep up the load, you likely need to move databases to SSD.

please check it

sysctl net.core.rmem_max

Lukasz · February 4, 2024, 12:48pm

Hi @Alexey ,

After 2 weeks and reading all the pointed resources, finally I hope to fix all the issues.
-db. issue fixed 100% done
-buffer issue done 100% done

too many open files - not sure we will see after a few hours of running the node.

only what I can see now on the logs is

Regards,

Lukasz · February 4, 2024, 9:23pm

Hi guys,

Is that the correct way to use the flag "–ulimit nofile=262144:262144 "

docker run -d --restart unless-stopped -p 28967:28967/tcp -p 28967:28967/udp -p 14002:14002 -e WALLET=“XXX”
-e EMAIL=“xxx”
-e ADDRESS=“XXX”
-e BANDWIDTH=“100TB”
-e STORAGE=“1.5TB”
–mount type=bind,source=“/home/Identity/storagenode”,destination=/app/identity
–mount type=bind,source=“/home/Storj_Users_Data”,destination=/app/config
–mount type=bind,source=“/media/stroj_storage_db2/”,destination=/app/dbs
–ulimit nofile=262144:262144
–name storagenode storjlabs/storagenode:latest
–storage2.database-dir=dbs

Probably not as I still get the issue:
2024-02-04T18:32:58Z INFO http: Accept error: accept tcp [::]:28967: accept4: too many open files; retrying in 1s {“process”: “storagenode”}

Ensure that the gateway binary is in /usr/local/bin and is named gateway (or edit the ExecStart line below to reflect the name and location of your binary

I try to follow the “scripts/gateway.service” instructions but I’m doing something wrong as I do not have an entry at /usr/local/bin/ for the gateway

Ensure that you’ve run setup and have edited the configuration appropriately prior to starting the service with this script and adjust the --config-dir option in the ExecSart line if necessary

Looks like I do not have access to that file. I try the below command. How should I edit that command to get access?

Screenshot 2024-02-04 213253

I also try manually edit that “#DefaultLimitNOFILE=16777216”

Screenshot 2024-02-04 211849

I think I’ve done that much, but unfortunately, looks like it is still not enough

Regards,

Alexey · February 5, 2024, 1:00am

You do not need to install a gateway, it was just example for the similar configuration.
The parameter --ulimit nofile=262144:262144 in your docker run command should work.

Lukasz · February 5, 2024, 10:10am

Hi Alexey,

That flag helped as I reset the node at 21:35 and then at 00:01 went offline according to ping time out xxx:28967

I was running longer than before when I used that command. Unfortunately looks like there is still some issue. I attached below all the errors I can find in the logs:

Kind Regards,

Alexey · February 6, 2024, 2:02am

So seems this value is not enough. Try to increase it to a bigger value, like 2097152 (because the default was likely 1048576).

Lukasz · February 8, 2024, 9:13pm

Hi Alexey,

It helped, I will be monitoring it, just now I’m worried about my low online status. Hopefully, it will pick up after a few weeks.

Thank you for all your help.

Kind Regards

Alexey · February 9, 2024, 5:08am

You are welcome!
The online score will fully recover after 30 days online. Each downtime requires an additional 30 days to recover.
I’m more worried about not 100% audit score. Please check your logs for errors related to audits:

It can recover only if all pieces are intact.