High read load and upload cancelled on a full system

NodeOps · December 11, 2023, 9:11am

Hi

So on the dec. 2., my node started having 100% disk busy times 100% of the time.
It started after a reboot of my system and adding two disk to a non related pool, that freed up two drives that i’m now using on two other nodes, on the same system.

I dont see/think there is a correlation here, to my issue, just full disclosure, it did things to my system at the time of the issue.

Both docker stats and iostat confimes the “heavy” load on the read side with around 1-2MB/s, high IOwait and util

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.25    0.00    0.60   52.89    0.00   46.25

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
vda             27.00   1212.80   218.90  89.02  292.43    44.92    1.60    354.40    89.90  98.25  191.75   221.50    0.00      0.00     0.00   0.00    0.00     0.00    0.40   30.00    8.21  96.32

CONTAINER ID   NAME          CPU %     MEM USAGE / LIMIT   MEM %     NET I/O          BLOCK I/O         PIDS
105048dfb1cd   watchtower    0.00%     14.05MiB / 5.8GiB   0.24%     1.69kB / 0B      10.8MB / 10.7MB   10
ad01c7ab94f9   storagenode   1.38%     2.021GiB / 5.8GiB   34.84%    11.9MB / 260MB   4.32GB / 74MB     46

And that would be great if is was being payed for it, but I’m not (as that would translate into a 100GB+ egress per day load)

If I stop the docker container, the load stops.

Looking at the logs I see two issues.

Lots of upload started and upload cancelled. Which is weird to me as the system is full per settings ( 6.3TB )

2023-10-18T13:40:45Z    INFO    piecestore      upload started  {"process": "storagenode", "Piece ID": "Y3CRVWYNY4XF7EZ5MH2QHJHLX627OQNFPT7NQSDSGLIJLOA7SC7Q", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT", "Available Space": 1193176610432, "Remote Address": "5.161.192.57:32442"}
2023-10-18T13:40:45Z    INFO    piecestore      upload canceled (race lost or node shutdown)    {"process": "storagenode", "Piece ID": "5QF3I4GNRO4WN4EM64D4JV7OLZOJMZREXBL6BVZOZM2JNS4CQVTQ"}
2023-10-18T13:40:45Z    INFO    piecestore      upload started  {"process": "storagenode", "Piece ID": "UN7ORTS5UB6FFRJLWR2TVB7UOWEVF6R6TMZ24HBES52XVYTQIH7A", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT", "Available Space": 1193176596608, "Remote Address": "5.161.220.231:45664"}
2023-10-18T13:40:45Z    INFO    piecestore      upload canceled (race lost or node shutdown)    {"process": "storagenode", "Piece ID": "Y3CRVWYNY4XF7EZ5MH2QHJHLX627OQNFPT7NQSDSGLIJLOA7SC7Q"}
2023-10-18T13:40:46Z    INFO    piecestore      upload canceled (race lost or node shutdown)    {"process": "storagenode", "Piece ID": "JDDL5VROUAODKRZWHT47CCY667L5PIYLHDEZM6LOR3E36FWHHOWQ"}
2023-10-18T13:40:46Z    INFO    piecestore      upload started  {"process": "storagenode", "Piece ID": "N7MVRVOHZA4WRHPBUM7URUSLBWT3JRXWT6HZMGPQ7CTFSS7ZR76A", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT", "Available Space": 1193176582528, "Remote Address": "23.237.232.130:34448"}
2023-10-18T13:40:46Z    INFO    piecestore      upload started  {"process": "storagenode", "Piece ID": "VMN5UAG6F4RFMA6TIUXAGSRONGRRCSGPBR3ZMVBYOHCRHMS2GTAA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT", "Available Space": 1193176572288, "Remote Address": "5.161.117.79:44630"}
2023-10-18T13:40:46Z    INFO    piecestore      upload canceled (race lost or node shutdown)    {"process": "storagenode", "Piece ID": "UN7ORTS5UB6FFRJLWR2TVB7UOWEVF6R6TMZ24HBES52XVYTQIH7A"}
2023-10-18T13:40:46Z    INFO    piecestore      upload canceled (race lost or node shutdown)    {"process": "storagenode", "Piece ID": "N7MVRVOHZA4WRHPBUM7URUSLBWT3JRXWT6HZMGPQ7CTFSS7ZR76A"}
2023-10-18T13:40:46Z    INFO    piecestore      upload canceled (race lost or node shutdown)    {"process": "storagenode", "Piece ID": "VMN5UAG6F4RFMA6TIUXAGSRONGRRCSGPBR3ZMVBYOHCRHMS2GTAA"}
2023-10-18T13:40:46Z    INFO    piecestore      upload canceled {"process": "storagenode", "Piece ID": "MPIIDWWBR4JE22KDUI5L5DCSGGVDGNSGG7QFPR3CVZU7CLKIY4KA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT", "Size": 65536, "Remote Address": "72.52.83.203:46234"}

It seems the upload, keeps trying because it thinks there is free space available, it seems based on the “Available Space” field.

2023-10-18T14:52:54Z    INFO    piecestore      upload started  {"process": "storagenode", "Piece ID": "JPPI2AYVVGE3A4XBV4OF2CP7Q2MJFOJCQMVVUHQ5QUVGCXVCBQ4A", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT", "Available Space": 1192009931392, "Remote Address": "23.237.232.146:55970"}

df confirms the disk usage and that there is still physical space available on disk

 df -T --si
Filesystem     Type      Size  Used Avail Use% Mounted on
udev           devtmpfs  3.1G     0  3.1G   0% /dev
tmpfs          tmpfs     623M  603k  623M   1% /run
/dev/vda2      ext4      7.4T  6.4T  607G  92% /
tmpfs          tmpfs     3.2G     0  3.2G   0% /dev/shm
tmpfs          tmpfs     5.3M     0  5.3M   0% /run/lock
/dev/vda1      vfat      536M  7.8M  529M   2% /boot/efi
overlay        overlay   7.4T  6.4T  607G  92% /var/lib/docker/overlay2/6a97fefe52f0f138985176c9690db0ba0d9c96cf88c2ff5e556d4857873f59ce/merged
tmpfs          tmpfs     623M     0  623M   0% /run/user/1000
overlay        overlay   7.4T  6.4T  607G  92% /var/lib/docker/overlay2/5d10a35cf3b7082f947c9727335641436acf97f7da409d2895e5d31b946fae1b/merged

So i’m also getting download cancelled which makes sense as my node is now very slow to respond to download requests as the disk is 100% busy.

Which brings me back to why do I have a high read load if nothing is being serviced to anyone hardly. Its feels very much like a DOS attack from the satellite or some bug, but I just cant find the issue.

Anyone got a clue to where I should look next to troubleshoot the issue

NodeOps · December 11, 2023, 12:28pm

So I just found these two

Which seems to be in line with what I have issues with.

I have changed storage2.piece-scan-on-startup from true to false

I did something, but it didn’t fix the issue.

IOwait is down to less than 30% and util is lower than 100% but still 95%+

I can see the filewalker isn’t running now

NodeOps · December 11, 2023, 12:48pm

Just to clarify it might have gotten better, but it is still a far cry from what it use to be

lyoth · December 11, 2023, 4:59pm

I recommend you to install your OS on a separate drive than where you store your storagenode.

NodeOps · December 11, 2023, 7:43pm

sure, but there are benefits to both, and I don’t see the relation to my issue.

Toyoo · December 11, 2023, 8:48pm

This is some sort of a virtual machine, judging from the vda device. How does it look like from the host? Let’s try eliminating at least one variable here.

NodeOps · December 11, 2023, 9:24pm

Sure it a debian vm running on bhyve on truenas core.

You can see the disk busy of ada1 which is the single disk this vm runs on and holds the storj data/config, data from truenas. Also iostat -x and docker stats from the first post.

Toyoo · December 11, 2023, 9:41pm

Ok. Let’s just make sure the device performs correctly with as little software in between. Can you unmount it from the VM and run some simple benchmarks from the host? Like, some sequential and random reads benchmarks with fio directly on the block device? You can use these examples, they’ll be good enough. Do not run write tests on the block device, you will lose data.

For a hard disk drive you should see around 250 IOPS in the random reads test and >100 MB/s in the sequential reads test.

NodeOps · December 12, 2023, 12:22am

Sure things -

Random4kRead: (groupid=0, jobs=64): err= 0: pid=40383: Tue Dec 12 01:08:31 2023
  read: IOPS=191, BW=766KiB/s (785kB/s)(79.5MiB/106218msec)
    clat (msec): min=6, max=1748, avg=324.60, stdev=187.99
     lat (msec): min=6, max=1748, avg=324.60, stdev=187.99
    clat percentiles (msec):
     |  1.00th=[   32],  5.00th=[   80], 10.00th=[  118], 20.00th=[  174],
     | 30.00th=[  218], 40.00th=[  257], 50.00th=[  296], 60.00th=[  338],
     | 70.00th=[  388], 80.00th=[  447], 90.00th=[  558], 95.00th=[  667],
     | 99.00th=[  936], 99.50th=[ 1070], 99.90th=[ 1401], 99.95th=[ 1536],
     | 99.99th=[ 1687]
   bw (  KiB/s): min=  413, max= 2490, per=100.00%, avg=793.22, stdev= 6.04, samples=12170
   iops        : min=   60, max=  593, avg=158.10, stdev= 1.55, samples=12170
  lat (msec)   : 10=0.05%, 20=0.33%, 50=1.94%, 100=5.12%, 250=31.08%
  lat (msec)   : 500=47.07%, 750=11.37%, 1000=2.32%, 2000=0.72%
  cpu          : usr=0.20%, sys=0.00%, ctx=28317, majf=0, minf=64
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=20345,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=256

Run status group 0 (all jobs):
   READ: bw=766KiB/s (785kB/s), 766KiB/s-766KiB/s (785kB/s-785kB/s), io=79.5MiB(83.3MB), run=106218-106218msec

Seq1mRead: (groupid=0, jobs=4): err= 0: pid=40792: Tue Dec 12 01:16:26 2023
  read: IOPS=262, BW=263MiB/s (275MB/s)(30.8GiB/120015msec)
    clat (msec): min=2, max=201, avg=15.23, stdev=15.25
     lat (msec): min=2, max=201, avg=15.23, stdev=15.25
    clat percentiles (msec):
     |  1.00th=[    4],  5.00th=[    6], 10.00th=[    6], 20.00th=[    8],
     | 30.00th=[    8], 40.00th=[    8], 50.00th=[    9], 60.00th=[   10],
     | 70.00th=[   12], 80.00th=[   22], 90.00th=[   38], 95.00th=[   47],
     | 99.00th=[   72], 99.50th=[   84], 99.90th=[  133], 99.95th=[  157],
     | 99.99th=[  199]
   bw (  KiB/s): min=108213, max=532214, per=99.81%, avg=268284.57, stdev=29814.37, samples=952
   iops        : min=  104, max=  519, avg=259.99, stdev=29.11, samples=952
  lat (msec)   : 4=2.36%, 10=63.49%, 20=10.73%, 50=19.71%, 100=3.46%
  lat (msec)   : 250=0.25%
  cpu          : usr=0.04%, sys=0.16%, ctx=31818, majf=0, minf=1024
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=31504,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=256

Run status group 0 (all jobs):
   READ: bw=263MiB/s (275MB/s), 263MiB/s-263MiB/s (275MB/s-275MB/s), io=30.8GiB(33.0GB), run=120015-120015msec

Alexey · December 12, 2023, 3:59am

Since you have a discrepancy in the reported used space and the actual used space, you must not disable a filewalker. It should complete the calculation, then retain process should move the garbage to the trash. Until you have a discrepancy, your node will be offered for uploads, because the satellites have the information about used space (on the left graph) and the allocation (on the right graph), so they do not know, that your node is actually full, and they receive this information only when the customers are trying to upload, thus canceled uploads.

What’s filesystem on this drive?
Do you have errors for gc-filewalker, lazyfilewalker and retain?

NodeOps · December 12, 2023, 6:13am

Okay, I will enable filewalker again then. The filesystem for the VM is ext4, on top of a ZFS pool ( one disk )

Alexey · December 12, 2023, 6:59am

That’s the reason unfortunately, the VM is adding even more latency. ZFS without an SSD as a special device sometimes is very slow. Using ZFS directly instead of ext4 on top might work better.
However, if the filewalker would finish its work (may take several days in this setup), it should work better, especially if you have enough RAM on the host for the cache. People think that it requires 1GB per TB. I’m only afraid, that with virtualization it could not be used effectively.

NodeOps · December 12, 2023, 7:35am

I understand your concern, but i has been running flawless for one year. Low latency, low busy, low io req and general low io. The disk it self can perform as fio shows.

Here are graph from just before dec. 2.

Then on dec. 2. everything spikes like crazy and stays elevated, most by a factor of 10 or more

ZFS has plenty of memory available and hardly uses it and had it been poor performance, sure I can agree with out, but there is an actual load here that has changed and seems to be hammering the disk, causing high disk busy, higher OIO, and high disk latency

I will start filewalker again once I have gotten the docker log file of the VM, so I can search it faster. But with an load around 1mb/s it will take at least 70 days for filewalker to finish.

Can I run filewalker without having the node online?

Toyoo · December 12, 2023, 8:56am

Looks ok. So now I would check how quickly can you do metadata operations. With the usual size of pieces stored on a node and with ext4, du on the node’s config directory should take less than 15 minutes per terabyte of stored data.

NodeOps · December 12, 2023, 2:27pm

Thanks for the help. After some digging it seems that the issue is ext4 filesystem is corrupt.
I have bad superblock, tried to fix it and the filesystem mounts, but i’m getting other issues now. Which point to data corruption Filesystem gets mount as read only now

Seems like the best path forward is to start all over again

I’m contemplation if I should let fsck run, but it will take at least 15 days to finish at which point the node will have an very poor uptime and no guarantee that data is recoverable.

Anyways i’m guessing Storj will have “marked” all data as stall after such an extended time period

I might try to see if I can copy it to a new location and built a new container from scratch but with the current data and config ( Not in a VM ).

I think we can close this case here

Alexey · December 13, 2023, 3:01am

I would suggest to fix the filesystem before abandon the node. The online score would recover after 30 days online, so you may try to fix issues and run the node back.
If the node is not disqualified on all satellites, it can work and would be paid for the usage.

you need to have data and its identity to run the node back.

Toyoo · December 13, 2023, 9:38am

fsck on one of my 18TB ext4 file systems runs few hours, not 15 days. Besides, trying to recover a node might be a good exercise, and worst case you will start from scratch few days later. If you have time to spend on this fight, I would try.

NodeOps · December 13, 2023, 10:41am

You might be right, but looking at the speed of which fsck was reading the disk, that’s my best guess. ( there seemed to be a lot of read errors, just mounting the disk at boot took around 10 min).

I have spare space on another pool, so I’m copying all the data there now, will clean the disk and copy it back. Seems it will take like 3 days to copy the data.

I’m wandering if I can do a soft link to the remote pool while I’m copying the data back and then remove the soft link once the data is there again or I might just change the docker config to point to a new path once data is copied back.

We will see

arrogantrabbit · December 13, 2023, 10:41am

Because you have bypassed all optimizations ZFS could do by hiding everything under virtual disk. ZFS can’t optimize access to it, it does not know what it is. It’s some blob of data you randomly read and write to.

Remove your vm. Copy node data to a dataset on your pool. Run node in jail. All your issues will be gone.

NodeOps · December 13, 2023, 11:17am

Thats the plan. But I’m not running ZFS for Storjs sake, so I dont need or want ZFS to optimize performance, the disk it self delivers plenty performance for that.

But main pool has nvme and will benefit from RAM and has all my other jails running.

The main reason for the VM, was that I was running k8s and Truenas core doesn’t support that natively and scale at that time didn’t support a lot of the other stuff i’m running.

There are always trade offs