CPU usage high?

I will give it a go and and report back in a week or so. thanks!

1 Like

I’ve never seen 2.5" CMR disks bigger than 2TB, so I’d be pretty sure your disk is SMR. Even most 1TB disks are SMR these days, so manufacturers can cut prices down.

Personnally, I’m using the --storage2.max-concurrent-requests option to go easy on such drives. Not ideal for the network, but that’s the only way I found to make a single SMR drive cope with the load. Without this option, my disk stalled so much it got my node suspended because it was unresponsive with all its databases locked.

This option should be used in last resort I guess, so if you can go with 2 disks maybe it’ll be alright even though the new node won’t be very active at first.
But not all smr drives are created equal, so… worth a try :slight_smile:

i use max concurrent for a long time, and my main issue with that was it would create its own errors simply caused by having a max set…
but ofc a little software problems rather than wearing a hdd out fast… :smiley: thats an easy choice…

i would also recommend recommend the dual node or triple node approach… after x3 the iops performance gains start to diminish so much it’s not worth it… i mean adding 1 node is +100% iops performance and adding a 3rd is 2 nodes + 1 and thus 50% iops performance increase…

so to extend hdd life i would recommend anyone running storj to run 2 or 3 hdd / nodes or similar setups.

by doing almost everything i can to remove iowait my storagenode errors are basically non existing…

this mornings log successrate.

i know my uploads are a bit slacking, but i hope to get that removed when i install my new ssd which works as a write cache… the current two SSD’s i got can’t keep up… which is weird because we are uploading… but apparently that causes a ton of writes… and ofc i’m running everything in sync writes for various reasons.

afaik anyways…

========== AUDIT ==============
Critically failed:     0
Critical Fail Rate:    0.000%
Recoverable failed:    0
Recoverable Fail Rate: 0.000%
Successful:            316
Success Rate:          100.000%
========== DOWNLOAD ===========
Failed:                0
Fail Rate:             0.000%
Canceled:              4
Cancel Rate:           0.018%
Successful:            21878
Success Rate:          99.982%
========== UPLOAD =============
Rejected:              0
Acceptance Rate:       100.000%
---------- accepted -----------
Failed:                0
Fail Rate:             0.000%
Canceled:              5
Cancel Rate:           0.058%
Successful:            8675
Success Rate:          99.942%
========== REPAIR DOWNLOAD ====
Failed:                0
Fail Rate:             0.000%
Canceled:              0
Cancel Rate:           0.000%
Successful:            8829
Success Rate:          100.000%
========== REPAIR UPLOAD ======
Failed:                0
Fail Rate:             0.000%
Canceled:              0
Cancel Rate:           0.000%
Successful:            2104
Success Rate:          100.000%
========== DELETE =============
Failed:                0
Fail Rate:             0.000%
Successful:            13100
Success Rate:          100.000%

the iowait spikes is a bad harddrive that i still haven’t been able to fix… keeps turning itself off, but it’s not to bad, when it doesn’t get to much traffic… :smiley:

oh yeah you should check that your APM on the HDD firmware is turned off, has helped my latency… the drive is still broken… but works better and even all my other hdd’s seemed to reduce latency / iowait in general.

To be exactly precise, there is one Toshiba CMR model with 3TB, all other 2.5in drives above 2TB use SMR

Sure, but personally what I really dislike about this option is that it doesn’t feel like the best approach to me: Ideally I think it should be dynamic, and the node software should adjust this in real time to reject uploads (ingress) when the disk can’t keep up (for instance when the average write speed of pieces falls below a threshold maybe?), but should then go back to accepting more pieces in parallel when the disk is “calm” again.

Because right now, it statically caps the number of files to be written in parallel even though there are moments where SMR disks can cope with way more than that.

1 Like

Thanks everyone for your feedback.

I have been trying all night to get my second node up. I am hosting it on a second machine with this command:

sudo docker run -d --restart unless-stopped --stop-timeout 300
-p 28967:28967
-p :14002:14002
-e EMAIL="XXXXX@gmail.com"
-e ADDRESS=“XXXX.ddns.net:28968
–mount type=bind,source=/home/XXX/storj/Identity/storagenode,destination=/app/identity
–mount type=bind,source=/mnt/storj2,destination=/app/config
–name storagenode2 storjlabs/storagenode:latest

and I currently have my port forwarding setup like this (i have just many different combos)

I have been reading this guide (Setting up second machine with storj on same network) and i am trying to use:

node2: outside world => 28968 => router => 28967 => node machine2 [ => 28967 => docker container]

I think I have set it up right but Storj is showing offline? Happy to try another way if there is a better way. I might even stick it on the same box if you think that is better / easier

Port tester here https://www.yougetsignal.com/ is saying the port 28968 for my IP is opened so I think the port forwarding is setup ok?

This setup only works if you’re running the second node in a different machine. Hence why it says node machine2.

If you’re running them on the same machine you will need to do this.
node2: outside world => 28968 => router => 28968 => node machine [ => 28967 => docker container]

Meaning the docker run command will have -p 28968:28967 and the router forwards public port 28968 to 28968 on the node machine.

I am running the second Storj node on a second box. So I have specified:

-p 28967:28967

and set my router to send incoming data from 28968 to 28967

Please, check the second identity: https://documentation.storj.io/dependencies/identity#confirm-the-identity
Also, it should be unique (new) identity and new authorization token to sign it.

You are right. I generated a new cert but used the same email as my first node. I thought it would be fine but I think I needed to use a new email address? even though the key after the email address did not match my original auth token. It is up and running now

You do not need a different email address, but the different authorization token. Perhaps you didn’t authorize it before.

it’s actually not a bad idea… to put some sort of latency tracking on the pieces so that the software can detect when the disk falls behind… but then one is into the rejecting uploads and presently the only option would be to reject downloads because there are no noticeable ingress…

that means the data is inaccessible to the network, and thus if it falls below the threshold repair will be started, which storj wants to avoid…

so tho it’s a local problem it’s not a global problem… and fixing the local problem makes it a global problem i guess… so i suppose in theory not easy… maybe thats why it’s not been done yet

As far as I know, the --storage2.max-concurrent-requests option is only capping uploads (ingress).
It does not impact clients’ downloads, so data would still be available to the network.

duno how max concurrent works these days but when i used it 3-4 months ago it had a lot of issues… even if i ran it at 20-40 the incoming deletions and other such node work commands / cleanup or whatever it was would run into the max concurrent limit and cause the db locked issue… but i do believe a lot of effort has gone into smoothing out that…

yeah would make sense if it only affects uploads, uploads doesn’t matter someone else can take that…
i was happy to get away from it and tho i didn’t see it at the time, it was a cause to much grief… which vanished when i finally got my system running fast enough that the storage could keep up with whatever the storagenode was doing.

not like my setup was crazy slow to begin with… but had some bad configuration issues, like i was mixing sas and sata disks in the same vdev’s or arrays whatever we want to call it…

after i reduced my hdd latency and got max concurrent set to infinite then my system started to run nearly error free, maybe 1 pr 24hour on avg and sometimes days without a single one

Well this is interesting. My original node has just been disqualified. About 1 week after bringing up a new node on a separate machine but same IP

The disqualification could be only for failed audits. Audit can fail if storagenode is unable to provide a requested piece for audit, either because it’s lost, inaccessible, corrupted or 4x timeouts on the same piece.
Please, search for GET_AUDIT and failed in the same line in your logs.

Do you have the command I need to get the logs?

You can use scripts from this article:

I get zero results when I run docker logs storagenode 2>&1 | grep GET_AUDIT | grep failed

I do get results if I run docker logs storagenode 2>&1

Last night the node was showing offline. I left it on all day and the Storj node is now showing online but I am disqualified from one satellite. Can I leave it running or I should turn the node off?