Why such low ingress

YuriyGavrilov · June 30, 2024, 10:11am

This report from the node with regular restarts: with used 3.82 tb but free 7.13tb, by the os reported that avalible 3.3T 81% /volume2

rootfs 3.9G 63M 3.8G 2% /
tmpfs 3.9G 0 3.9G 0% /tmp
/dev/md0 2.0G 361M 1.5G 20% /volume0
/dev/loop0 951K 12K 919K 2% /share
/dev/md1 19T 15T 4.0T 79% /volume1
/dev/md1 19T 15T 4.0T 79% /volume1/.@iscsi
/dev/md1 19T 15T 4.0T 79% /volume1/.@plugins
/dev/md2 17T 14T 3.3T 81% /volume2
/dev/md2 17T 14T 3.3T 81% /volume2/.@iscsi
cgroup 3.9G 0 3.9G 0% /sys/fs/cgroup

— second node this “0 ingress”: have some traffic today… so it seams normal situation.

just strange that it is so discreet. some day with zero traffic but some with ±300-500 gb

for the os reported this node is 380G Avail for now. on the node UI there are Free 0.67TB. reported

Filesystem Size Used Avail Use% Mounted on
udev 1.9G 0 1.9G 0% /dev
tmpfs 388M 728K 387M 1% /run
/dev/mmcblk1p7 57G 16G 40G 28% /
tmpfs 1.9G 0 1.9G 0% /dev/shm
tmpfs 5.0M 4.0K 5.0M 1% /run/lock
tmpfs 1.9G 0 1.9G 0% /sys/fs/cgroup
/dev/mmcblk1p6 112M 884K 111M 1% /boot/efi
/dev/sda1 2.3T 1.8T 461G 80% /mnt/disk
/dev/sdb4 9.1T 8.2T 380G 96% /mnt/disk1
tmpfs 388M 0 388M 0% /run/user/1000

Alexey · June 30, 2024, 10:28am

This depends on the customers’ activity, so, kind of expected.

YuriyGavrilov · June 30, 2024, 11:38am

Yep, it seams to be so, there is only one reason what should be discovered with the first node and regular restarts…

today some new kind of error like

|2024-06-30T11:25:33Z|ERROR|services|unexpected shutdown of a runner|{Process: storagenode, name: piecestore:monitor, error: piecestore monitor: timed out after 1m0s while verifying writability of storage directory, errorVerbose: piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:175\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:164\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78}|
|---|---|---|---|---|
|2024-06-30T11:27:16Z|ERROR|piecestore|upload failed|{Process: storagenode, Piece ID: SNAJ57UF4APURAIB2U4K6I7Q5W3GBKDR4ODWD6SOBX5CYTOLQIMA, Satellite ID: 12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs, Action: PUT, Remote Address: 79.127.226.102:43904, Size: 1048576, error: manager closed: unexpected EOF, errorVerbose: manager closed: unexpected EOF\n\tgithub.com/jtolio/noiseconn.(*Conn).readMsg:225\n\tgithub.com/jtolio/noiseconn.(*Conn).Read:171\n\tstorj.io/drpc/drpcwire.(*Reader).read:68\n\tstorj.io/drpc/drpcwire.(*Reader).ReadPacketUsing:113\n\tstorj.io/drpc/drpcmanager.(*Manager).manageReader:229}|

it is ok if it slow, but why it shutting down and restart again

docker events shows “die”


admin@DATAHUB:/volume1/home/admin $ sudo docker events --since 2024-06-30

2024-06-30T11:03:55.610637494+03:00 network connect 0cccdaa13d5b7ca1e9fc45bc65f73bede06f9935df0b1d1b838d9443831319da (container=8acefa9f583e559a98909e0c26f8c462f594a3e61f46e46e4a3212472c860905, name=bridge, type=bridge)

2024-06-30T11:03:57.239151055+03:00 container start 8acefa9f583e559a98909e0c26f8c462f594a3e61f46e46e4a3212472c860905 (image=storjlabs/watchtower, io.storj.watchtower=true, name=watchtower)

2024-06-30T13:05:32.404215804+03:00 network connect 0cccdaa13d5b7ca1e9fc45bc65f73bede06f9935df0b1d1b838d9443831319da (container=aa554f32a1de09840c43fc66fc85cada077d397a8d4ab24e69667acacaf2bfda, name=bridge, type=bridge)

2024-06-30T13:05:34.969124229+03:00 container start aa554f32a1de09840c43fc66fc85cada077d397a8d4ab24e69667acacaf2bfda (image=storjlabs/storagenode:latest, name=storagenode)

2024-06-30T13:05:34.969146886+03:00 container restart aa554f32a1de09840c43fc66fc85cada077d397a8d4ab24e69667acacaf2bfda (image=storjlabs/storagenode:latest, name=storagenode)

2024-06-30T13:36:47.003787291+03:00 network disconnect 0cccdaa13d5b7ca1e9fc45bc65f73bede06f9935df0b1d1b838d9443831319da (container=aa554f32a1de09840c43fc66fc85cada077d397a8d4ab24e69667acacaf2bfda, name=bridge, type=bridge)

2024-06-30T13:36:47.120687588+03:00 container die aa554f32a1de09840c43fc66fc85cada077d397a8d4ab24e69667acacaf2bfda (execDuration=1870, exitCode=0, image=storjlabs/storagenode:latest, name=storagenode)

2024-06-30T13:36:48.095620100+03:00 network connect 0cccdaa13d5b7ca1e9fc45bc65f73bede06f9935df0b1d1b838d9443831319da (container=aa554f32a1de09840c43fc66fc85cada077d397a8d4ab24e69667acacaf2bfda, name=bridge, type=bridge)

2024-06-30T13:36:49.628211361+03:00 container start aa554f32a1de09840c43fc66fc85cada077d397a8d4ab24e69667acacaf2bfda (image=storjlabs/storagenode:latest, name=storagenode)

2024-06-30T14:09:17.465333210+03:00 network disconnect 0cccdaa13d5b7ca1e9fc45bc65f73bede06f9935df0b1d1b838d9443831319da (container=aa554f32a1de09840c43fc66fc85cada077d397a8d4ab24e69667acacaf2bfda, name=bridge, type=bridge)

2024-06-30T14:09:17.615534404+03:00 container die aa554f32a1de09840c43fc66fc85cada077d397a8d4ab24e69667acacaf2bfda (execDuration=1946, exitCode=0, image=storjlabs/storagenode:latest, name=storagenode)

2024-06-30T14:09:18.582641937+03:00 network connect 0cccdaa13d5b7ca1e9fc45bc65f73bede06f9935df0b1d1b838d9443831319da (container=aa554f32a1de09840c43fc66fc85cada077d397a8d4ab24e69667acacaf2bfda, name=bridge, type=bridge)

2024-06-30T14:09:19.823206505+03:00 container start aa554f32a1de09840c43fc66fc85cada077d397a8d4ab24e69667acacaf2bfda (image=storjlabs/storagenode:latest, name=storagenode)

2024-06-30T14:29:38.561572544+03:00 network disconnect 0cccdaa13d5b7ca1e9fc45bc65f73bede06f9935df0b1d1b838d9443831319da (container=aa554f32a1de09840c43fc66fc85cada077d397a8d4ab24e69667acacaf2bfda, name=bridge, type=bridge)

2024-06-30T14:29:38.712520405+03:00 container die aa554f32a1de09840c43fc66fc85cada077d397a8d4ab24e69667acacaf2bfda (execDuration=1217, exitCode=0, image=storjlabs/storagenode:latest, name=storagenode)

2024-06-30T14:29:39.748786389+03:00 network connect 0cccdaa13d5b7ca1e9fc45bc65f73bede06f9935df0b1d1b838d9443831319da (container=aa554f32a1de09840c43fc66fc85cada077d397a8d4ab24e69667acacaf2bfda, name=bridge, type=bridge)

2024-06-30T14:29:40.770352145+03:00 container start aa554f32a1de09840c43fc66fc85cada077d397a8d4ab24e69667acacaf2bfda (image=storjlabs/storagenode:latest, name=storagenode)

and some uniq -c logs inside:

|      5 Version is up to date|{Process: storagenode-updater, Service: storagenode}|
|---|---|
|      5 Version is up to date|{Process: storagenode-updater, Service: storagenode-updater}|
|      2 bandwidth|Persisting bandwidth usage cache to db|
|      2 collector|collect|
|      2 collector|error during collecting pieces: |
|      2 db.migration|Database Version|
|      2 failure during run|{Process: storagenode, error: piecestore monitor: timed out after 1m0s while verifying writability of storage directory, errorVerbose: piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:175\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:164\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78}|
|      1 gracefulexit:chore|error retrieving satellites.|
|      8 pieces:trash|emptying trash finished|
|      8 pieces:trash|emptying trash started|
|    753 piecestore|download canceled|
|     17 piecestore|download failed|
|   2452 piecestore|download started|
|   1694 piecestore|downloaded|
|     35 piecestore|error sending hash and order limit|
|    482 piecestore|upload canceled|
|   9536 piecestore|upload canceled (race lost or node shutdown)|
|     12 piecestore|upload failed|
|  41159 piecestore|upload started|
|  31608 piecestore|uploaded|
|      1 piecestore:cache|error getting current used space: |
|      2 piecestore:monitor|Disk space is less than requested. Allocated space is|
|      2 preflight:localtime|local system clock is in sync with trusted satellites' system clock.|
|      2 preflight:localtime|start checking local system clock with trusted satellites' system clock.|
|      8 reputation:service|node scores updated|
|      2 retain|Prepared to run a Retain request.|
|      2 retain|retain pieces failed|
|      2 server|enable with: sysctl -w net.ipv4.tcp_fastopen=3|
|      2 server|kernel support for server-side tcp fast open remains disabled.|
|      1 servers|service takes long to shutdown|
|      1 servers|slow shutdown|
|      3 services|service takes long to shutdown|
|      1 services|slow shutdown|
|      1 services|unexpected shutdown of a runner|
|      2 trust|Scheduling next refresh|

Alexey · June 30, 2024, 12:31pm

Because of

You need to optimize your disk subsystem. It’s too slow to write a small file even after 1m0s timeout.
Or you may increase this timeout, if you know, that there is no hardware issues, and it’s just slow.
It will not solve the problem with slowness, but it would allow to ignore this problem.

YuriyGavrilov · June 30, 2024, 4:23pm

Yep it seams fs should be optimized. but funny that second node upload regularly 300-600gb daily. today 750gb.

and has no such restarting issue. even it is based on cheap SoC rockpro64 instead of first node with restarting and Asustor hardware which should in theory be more faster

Alexey · July 1, 2024, 4:53am

It’s more related to the disk subsystem, than to a CPU power. However, adding RAM if possible may solve an issue with the disk slowness in many cases on Linux.
But I believe that adding RAM to SoC is not the option, so, need to increase a check timeout a little bit.

YuriyGavrilov · July 1, 2024, 7:00pm

funny that the SoC node works normal. but Asustor NaS is restarting. so need to dig in detail

Alexey · July 2, 2024, 4:49am

If it’s restarting, then it is failing the one of the checks - either writability or readability or both. It’s also possible to have other FATAL Unrecoverable errors, they all resulting in the node restart too.

YuriyGavrilov · July 4, 2024, 6:25pm

hm… setup timed out after 1m0s while verifying writability to 2 minutes and it work 27 hours without restarting… seems to be working. but maybe SSD will helps better. but a bit afraid about data loss. I loss already one node due to ssd die. so don’t want it again. May be I will add ssd cache for reading it is more save.

Alexey · July 5, 2024, 7:47am

This is a bandaid, not the solution. Your storage is able to save a smal file only with 2m0 timeout.
This is need to be addressed, it could be that your disk is dying, or just a not optimized filesystem.
However, in some cases it could be considered as OK, because you would only lose the upload race, not affecting audits.
The failed reads much more dangerous, because they would affect audits too.