Used disk space resets after node rm/restart

baker · December 14, 2019, 6:24pm

Hi Alexey. I ran chown on the entire folder the node uses. I do not run as root. The space still resets to zero. Same behaviour as I described before.

Alexey · December 14, 2019, 6:44pm

Please, enable the debugging log level in your config.yaml (log.level: debug), save the config and restart the node. Then check for any errors related to databases.

fragamemnon · December 12, 2019, 7:05am

After updating to 0.27.1, my main node (1.6TB as per the config), which was full, reported zero disk usage. I did not check the dashboard at first after updating, just the container logs, and all seemed fine.
However, what worries me is that the node has apparently started getting data.

I have a dedicated hard-drive for its mount point, and the mount works alright. The storage directory lists succesfully and is the expected size (1546388M as of writing this), however according to the node dashboards (both CLI and web-based GUI) I have used 18.9GB and the logs reveal plenty of uploads. The timestamps of files within the storage dir are recent, which means that the container still uses this very directory.

The configs were not changed in any way, nor was the launch script that I use.
I am also running a second node with another hard-drive dedicated to it (since the first one was full), and that worked just fine.

I am at a complete loss here, and I am concerned about effectively all the data stored, and filling out the drive to its capacity since the old files are there but not taken into account.
What do I do guys?

E: FWIW, the successrate checker script doesn’t report anything out of the ordinary either:

Successful:           35
Recoverable failed:   0
Unrecoverable failed: 0
Success Rate Min:     100.000%
Success Rate Max:     100.000%
========== DOWNLOAD ==========
Successful:           5059
Failed:               199
Success Rate:         96.215%
========== UPLOAD ============
Successful:           22679
Rejected:             0
Failed:               245
Acceptance Rate:      100.000%
Success Rate:         98.931%
========== REPAIR DOWNLOAD ===
Successful:           330
Failed:               26
Success Rate:         92.697%
========== REPAIR UPLOAD =====
Successful:           113
Failed:               0
Success Rate:         100.000%

baker · December 12, 2019, 3:02pm

Hi fragamemnon,

I have encountered a strikingly similar situation, however my circumstances are slightly different. And for me this happened while running v0.26.2. I have not updated to v0.27.1 as I have still not resolved the issue.

fragamemnon · December 12, 2019, 4:23pm

Update: After a graceful reboot, the node will not start due to insufficient disk space error (the drive is 2TB, with 288GB currently free).

baker · December 12, 2019, 4:26pm

Makes sense since the minimum space requirement is 500GB and the node can’t see that the used space is actually being used by storj.

fragamemnon · December 12, 2019, 4:28pm

When I updated and ran the node for the first time, the disk had ~300GB free, which is still well below the minimum. It ran succesfully.

baker · December 12, 2019, 4:36pm

This seems exactly like the pattern of behaviour I saw. Although I have more than enough free space, on the first restart when my used space went back to zero, the node did not detect that my allocated space was less than the free space, and gave no errors to that effect. On subsequent restarts my node detected that the allocated space was less than the free space, gave the error, and changed the allocated space to match. (all verified in the logs)

fragamemnon · December 13, 2019, 10:23am

I am still unable to get the node to read the old database.
I’d be reluctant to delete the possibly abandoned data because I risk deleting something else by accident (timestamps do not give me absolute certainty).

Still looking for advice, downtime is growing…
@Alexey , could you chime in?

Alexey · December 14, 2019, 4:52pm

Please, enable a debug log level for your storagenode (log.level: debug in your config.yaml), save the config and restart the node.
Please, check logs of the node for some errors related to any database.

fragamemnon · December 16, 2019, 7:06am

Thanks for chiming in, Alexey.
Here’s the log output:

2019-12-16T07:06:10.201Z        INFO    Configuration loaded from: /app/config/config.yaml
2019-12-16T07:06:10.202Z        DEBUG   debug server listening on 127.0.0.1:33127
2019-12-16T07:06:10.219Z        INFO    Operator email: fragamemnon@overclocked.net
2019-12-16T07:06:10.219Z        INFO    operator wallet: 0x86e846f759da3025b65732ec133f41c16bc5aa6b
2019-12-16T07:06:10.675Z        DEBUG   Binary Version: v0.27.1 with CommitHash 08a5cb34f1f43e6c851dabbd3f9e9d1510534dcd, built at 2019-12-11 12:51:29 +0000 UTC as Release true
2019-12-16T07:06:11.289Z        DEBUG   version allowed minimum version from control server is: v0.26.0
2019-12-16T07:06:11.289Z        INFO    version running on version v0.27.1
2019-12-16T07:06:11.290Z        DEBUG   telemetry       Initialized batcher with id = "128oWRvkvoWtetkJ6ntVzU9KJhPvkrJnwqArHxFZVCxvyDQxb5J"
2019-12-16T07:06:11.296Z        INFO    db.migration    Database Version        {"version": 26}
2019-12-16T07:06:11.297Z        DEBUG   gracefulexit:chore      checking pending exits
2019-12-16T07:06:11.297Z        INFO    contact:chore   Storagenode contact chore starting up
2019-12-16T07:06:11.297Z        INFO    Node 128oWRvkvoWtetkJ6ntVzU9KJhPvkrJnwqArHxFZVCxvyDQxb5J started
2019-12-16T07:06:11.297Z        INFO    Public server started on [::]:28967
2019-12-16T07:06:11.297Z        INFO    Private server started on 127.0.0.1:7778
2019-12-16T07:06:11.298Z        INFO    bandwidth       Performing bandwidth usage rollups
2019-12-16T07:06:11.298Z        INFO    pieces:trashchore       Storagenode TrashChore starting up
2019-12-16T07:06:11.298Z        DEBUG   pieces:trashchore       starting EmptyTrash cycle
2019-12-16T07:06:11.298Z        DEBUG   orders  cleaning
2019-12-16T07:06:11.299Z        DEBUG   orders  sending
2019-12-16T07:06:11.299Z        DEBUG   gracefulexit:chore      no satellites found
2019-12-16T07:06:11.303Z        DEBUG   orders  no orders to send
2019-12-16T07:06:11.329Z        DEBUG   orders  cleanup finished        {"items deleted": 0}
2019-12-16T07:06:11.333Z        INFO    piecestore:monitor      Remaining Bandwidth     {"bytes": 59926975982336}
2019-12-16T07:06:11.334Z        WARN    piecestore:monitor      Disk space is less than requested. Allocating space     {"bytes": 340815607040}
2019-12-16T07:06:11.334Z        ERROR   piecestore:monitor      Total disk space less than required minimum     {"bytes": 500000000000}
2019-12-16T07:06:11.334Z        ERROR   piecestore:cacheUpdate  error getting current space used calculation:   {"error": "context canceled"}
2019-12-16T07:06:11.334Z        ERROR   version Failed to do periodic version check: version control client error: Get https://version.storj.io: context canceled
2019-12-16T07:06:11.335Z        ERROR   piecestore:cacheUpdate  error persisting cache totals to the database:  {"error": "piece space used error: context canceled", "errorVerbose": "piece space used error: context canceled\n\tstorj.io/storj/storagenode/storagenodedb.(*pieceSpaceUsedDB).UpdateTotal:115\n\tstorj.io/storj/storagenode/pieces.(*CacheService).PersistCacheTotals:82\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run.func1:68\n\tstorj.io/storj/private/sync2.(*Cycle).Run:87\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:63\n\tstorj.io/storj/storagenode.(*Peer).Run.func6:445\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}
2019-12-16T07:06:11.362Z        FATAL   Unrecoverable error     {"error": "piecestore monitor: disk space requirement not met", "errorVerbose": "piecestore monitor: disk space requirement not met\n\tstorj.io/storj/storagenode/monitor.(*Service).Run:118\n\tstorj.io/storj/storagenode.(*Peer).Run.func2:433\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}

Alexey · December 16, 2019, 8:35am

This is the consequence of losing database…
The only way - to free up some space on that drive to be able to run a node. Since it doesn’t recognized the own pieces, it can see only the free space. It should have at least 500GB to be able to start a node.

Please, stop and remove the container, then check the piece_spaced_used.db with this instruction:

baker · December 16, 2019, 2:26pm

So it seems I might have been a bit impatient when I reported back initially. After replacing the owner on the entire node working folder, I started the node and the used space started at 0 again. I had let the node run for about 60s, the used space had gone up to about 100 MB. I self-declared the problem unfixed and logged out. However I had forgotten to stop the node. When I remembered about an hour later, I logged back in and saw that the node was showing 1.9 TB of used space. I tried restarting the node, and the used space remained at 1.9 TB. This was all while the node was running v0.26.2. (this all happened Dec 14), but I haven’t been able to report back until now)

I forgot to dump the logs before watchtower updated to v0.27.1, so unfortunately I can’t get any info about what may have fixed the problem. I guess it was the ownership issue, although I don’t know what could have changed as the node had been running fine for months. And it seems that the node was properly tracking all of the pieces during this since I have not seen an unrecoverable failed audits.

Thanks for your help @deathlessdd and Alexey. Looks like I am back up and running.

fragamemnon · December 16, 2019, 2:30pm

I’ve follwed the database repair steps. However, the HDD is of 2TB capacity, and I have 1.6TB of node data.
How should I proceed about freeing up space?

Alexey · December 23, 2019, 8:11am

Please, open the config.yaml in your data folder, try to uncomment (remove the # ) this line in the config:

# storage2.monitor.minimum-disk-space: 500.0 GB

and replace the 500.0 with 300.0, save the configuration file and restart the container

docker restart -t 300 storagenode

Then check is it able to start and recognize the space?

Alexey · December 31, 2019, 12:38pm

2 posts were split to a new topic: Storagenode had corrupted data in over 400 blocks on the hard-drive storage

maprox · January 9, 2020, 7:56pm

Hello all,

I’m quite new to the storj community, and have the same problem.
Each time docker container is restarted Disk Space Remaining is being reset.

I checked the logs, and see this line

2020-01-09T19:46:03.215Z ERROR piecestore:cache error getting current space used calculation: {“error”: “lstat config/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/ys/r4gkofts26bk5m3gflnvyvkhjttix5tods2577zl2onse6fceq.sj1: structure needs cleaning”}

Is it anything I can fix?
Thanks in advance!

Dylan · January 9, 2020, 8:31pm

Hello @maprox and welcome to the forum!

What does your docker run command look like?

BrightSilence · January 9, 2020, 8:52pm

This suggest file system issues. Good idea to run e2fsck to try and fix it.

maprox · January 10, 2020, 8:36am

Are you saying storj node corrupted clean new disk for less than a month?
Doesn’t look feasible for me. As far as I understand (and see in dashboard) new nodes don’t have much traffic. What are they doing with the disk if it got corrupted that fast?
For me this looks more like some software error.

For @Dylan, it’s “almost” exactly as in this document: https://documentation.storj.io/setup/cli/storage-node

docker run -d --restart unless-stopped -p 28967:28967 \
    -p 14002:14002 \
    -e WALLET="0xXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" \
    -e EMAIL="user@example.com" \
    -e ADDRESS="ip:28967" \
    -e BANDWIDTH="10TB" \
    -e STORAGE="3TB" \
    --mount type=bind,source="/media/drive/identity/storagenode",destination=/app/identity \
    --mount type=bind,source="/media/drive/storj",destination=/app/config \
    --name storagenode storjlabs/storagenode:beta