How i fixed my RPI overloading because of 100% HDD usage

GiantJack · November 4, 2022, 4:36am

today my node at home with an old RPI 3 was offline after running about 1 month without problem

after reboot i noticed:
syslog:
Nov 4 00:20:46 raspberrypi kernel: [12930.050559] INFO: task storagenode:1647 blocked for more than 120 seconds.
storagenode:
2022-11-04T03:44:39.603Z ERROR piecestore failed to add bandwidth usage {"Process": "storagenode", "error": "bandwidthdb: database is locked", "errorVerbose": "bandwidthdb: database is locked\n\tstorj.io/storj/storagenode/storagenodedb.(

load average: was going crazy like 50 or even 90

after about 2 hours it was not responding anymore, only ping but SSH was not working and the xch (chia) miner was shown offline too

now the easy fix for me was limit the iops docker can use on the USB HDD with this:

root@raspberrypi:~# df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/root        29G  8.0G   20G  30% /
devtmpfs        422M     0  422M   0% /dev
tmpfs           455M     0  455M   0% /dev/shm
tmpfs           455M  7.1M  448M   2% /run
tmpfs           5.0M  4.0K  5.0M   1% /run/lock
tmpfs           455M     0  455M   0% /sys/fs/cgroup
/dev/mmcblk0p6  253M   49M  204M  20% /boot
/dev/sdh2       7.3T  7.3T  1.3G 100% /p7
/dev/sdf1       4.6T  4.6T  2.6G 100% /p3
/dev/sda1       7.3T  7.3T  1.4G 100% /p4
/dev/sdd2       7.3T  7.3T  2.1G 100% /p6
/dev/sdc1       7.3T  7.3T   51G 100% /pool
/dev/sdb1       7.3T  7.3T  1.2G 100% /p5
/dev/sdg1       9.1T  9.1T  1.8G 100% /p8
/dev/sde1       4.6T  1.5T  3.0T  34% /store

my node data is in /dev/sde1

remove old storagenode and add this to the docker command: --device-write-iops /dev/sde:300 --device-read-iops /dev/sde:300

my full docker command looks like this:
docker run -d --restart unless-stopped --stop-timeout 300 --device-write-iops /dev/sde:300 --device-read-iops /dev/sde:300 -p 28967:28967/tcp -p 28967:28967/udp -p 127.0.0.1:14002:14002 -e WALLET="mywallet" -e EMAIL="mymail" -e ADDRESS="myip:28967" -e STORAGE="4TB" --user $(id -u):$(id -g) --mount type=bind,source="/root/.local/share/storj/identity",destination=/app/identity --mount type=bind,source="/store/real",destination=/app/config --name storagenode storjlabs/storagenode:latest

HDD usage is down to 60-80% (from 100% all time)
load average: 4.82, 4.30, 4.23 (was 50 or more sometimes)

guess it will help the HDD to not die too fast, cant be good at 100% all time i think
its a 5TB 2.5" HDD, guess with a faster 3.5" u can use more than 300 iops

i will see some days if this works fine but looks very good now compared to before

Bivvo · November 4, 2022, 5:11am

SMR drive, right? If so, that’s the reason. Change to CMR drives.

Pac · November 4, 2022, 9:45pm

I would guess so too.

In any case, I’m not sure limiting iops will help. I mean it will ensure the system stays responsive, but the node will only run fine as long as it has enough RAM, but the node application is known to stack up received ingress pieces in RAM while the disk is busy. This is usually just for a short period of time as a buffer, but when ingress comes in faster than what the disk can cope with (and 2.5" SMR disks usually perform poorly in that regard), data starts stacking up in RAM up to the point where the OOM Killer has to shut the node down abruptly, which isn’t good…
My guess is that limiting iops might make this issue happen even faster.

I may be wrong, but I’d keep an eye on the container uptime and the logs, to be sure it doesn’t get killed and restarted regularly (which also could explain a continuous 100% usage because of the filewalker that gets triggered at node start up - this can be disabled now by the way).

GiantJack · November 6, 2022, 8:42am

indeed its a SMR drive i did not know… node is down again, trying to copy the data to an other HDD now but its really slow about 700 MB/min… will take 37 hours for the 1.7TB useing clonezilla
any idea why its that damn slow
edit: speed is even getting lower about 500MB/min now with 50 h remaining
edit²: speed is getting better 700MB/min again, guess ill just wait

Alexey · November 6, 2022, 9:29am

It’s slow by design. You may also use (almost) online migration: How do I migrate my node to a new device? - Storj Node Operator Docs
But it will be much slower of course, but almost no downtime. However, it’s better to set the allocated space to 500GB to stop uploads and reduce load on your slow HDD.

GiantJack · November 6, 2022, 9:36am

thanks, i will wait the ~37 hours for now… online migration won´t work because the node goes down after some hours anyway (stupid smr drives) maybe it would work with lower allocated space as u said but won´t try it

as its running in VM i can use my PC normally alongside

GiantJack · November 7, 2022, 2:22am

what a stupid tool that clonezilla is:

there are no bad sectors… back to rsync, but got the node running on my main PC for now with allocated space set to not upload new files

Toyoo · November 7, 2022, 3:46pm

Wouldn’t be surprised if this “badblock” was caused by a bad cable. I used to have a cable that was silently corrupting data on my USB HDD at a rate of around 1 bad block per ~60 GB of data. Mostly undetectable unless you run a tool like clonezilla.

GiantJack · November 7, 2022, 11:38pm

i guess not rsync is at ~400GB now faster than i expected