Yes. Use external journal and place it on your NVMe storage (assuming you have at least a half-decent NVMe device).
It’s a T-FORCE TM8FP7001T. I chose it for the very high wear cycle.
I don’t know how to do that. And if the NVME dies, the journal dies with it; can it affect the storage drives? If I replace the broken NVME and reinstall OS, can the journal be recreated without destroying the storage data?
A journal is effectively only a very short term backup of some data that will land somewhere else on the file system anyway a few seconds later. It is never read from unless in case of recovering from an unclean shutdown, and a clean shutdown drops data in journal.
In case the NVMe fails while the HDD still operates, the file system can still shut down cleanly. Journal can be moved back into the main block device any time too.
In my opinion, use gdisk to create a partition, then just use
sudo mkfs.ext4 /dev/sda1
or use a LVM first (to have several useful features without a huge performance impact), and format the logical volume with usual ext4 with default parameters.
And add this parameter to mkfs:
-e remount-ro
And add this parameter to your fstab entry:
errors=remount-ro
This will turn the filesystem into read-only mode if any error with the filesystem occurs. If this happens stop the node, unmount the filesystem and run fsck on the filesystem to fix the error.
Should I set in fstab “defauts,noatime 0 0” or “0 2”?
Should I let it run the filesystem check at boot?
Does it take long time to run it when the HDD will be almost full?
I tried the “0 2”, and after reboot, there was some intens activity on both drives like 5-10 seconds, and then just a continous “blinking” for minutes, without stopping, untill I changed to “0 0” and reboot.
Now is just the intens 5-10 sec boot activity on them.
I wonder if that was the fs scan, or some heads were just “unsettled”.
the two last columns are “dump” (not used, so always zero) and filesystem check (0 = don’t check, 2 = check after checking all that have 1, which only root should be).
Yep, I got that, but what would you recommend? Does the scan takes long?
Perhaps it shouldn’t, unless you have an abruptly restarts.
Since most distros migrated to systemd, systemd decides to check the filesystems on boot if they are marked dirty (ie not cleanly unmounted). For a few (stupid) reasons, this sometimes fails, which drops the boot process into recovery (ie if your node is remote, you have to get there to fix it). I prefer the system to come back online as soon as possible, I remote into the system and stop nodes, unmount the filesystems and manually check them.
Man, when you switch from a GUI OS, especialy Win, to a CLI one, like Ubuntu Server, you start learning so much stuff from under the hood.
I spent hours and hours reading and try understanding each command and each parameter, just to get the basic install done. ![]()
@Toyoo Would it reduce the directory fragmentation, if storj would put all new blocks in a single directory? A new directory could be created when a max file limit is reached or a new week begins.
I mention this idea here
File adds would only happen to the latest directory and
To all other directories only file delete operations (move to trash) would be exacted.
This is a trade-off resulting in longer piece expiration processes and downloads for low-memory nodes.
What information does this process require?
If the entry size is 124bytes then a in memory dictionary/array would require around 122Mb memory for each 1mil entries. Most likely the entry size is less then that.
Then I read in the forum that there was already some kind of sqlite lookup db in place,
but had understandable issues with the locking.
A simple journal near 1:1 memory layout with append only operations would be already to act as persistent storage for the in memory table.
For me it makes simply little sense that storj need to use a dedicated HDD or to hold all fs-metadata in the system-cache to be able to operate or let some other process like a btc-node use the disk.
As far as I know, it’s going through all records in the expiration database and deletes them one by one. This is repeated every 24 hours since the start.
Satellite ID, piece ID, expiration time.
Ok, and?
If it is append only, this means that pieces deleted by GC are not removed from this journal? Or, well, expired pieces?
It is simple, that’s the key benefit. There were ideas to make it more complex for speed. More complexity is risky though.
@Toyoo I think the first implementation used an ACID data store (sqlite) and run in the well known lock contention issue.
Then storj switch to the file-walker, this created the random access IO issue
and every one suggest to overcome it by utilizing a file system cache in memory or on flash.
There are none cheap, simple and optimal options for it, at least I could not find one.
The optimal way to go would be to hold the required data in a better suitable data store.
A journal would be the best fit. It does not require locking in the ACID way. All operations like added,updated,deleted or “expired” are appended to the end of the file. The consumer need then to read all operations in sequence and apply them to the state (in memory). A other name for it would be “event-sourcing”.
Most implementations a hybrid approach, where the state is periodically written to disk, to reduce the load time. On restart this snapshot is loaded and all operations after the timestamp are read from the journal and applied to reproduce the latest state.
Entries made before the snapshot can be purged at any time.
I would never recommend to implement an own database engine,
but the journal is more a pattern and can be implemented with near all languages and system provided features.
Instead of the in-memory state a sqlite can be used again, this would the journal in a sense the “write-buffer”
Sorry, I do not see this solution myself. You would probably have to explain how would you implement all operations a storage node performs on this data structure, like e.g. how do you remove a piece that was deleted before expiration time, etc. Without that I can’t discuss it without filling the missing parts by guessing. Besides, 122 MB per 1 M entries is a lot. We have nodes which handle upwards of 70M pieces now, this would suggest spending close to 10 GB of RAM just for piece expiration.
@toyoo This would be the index/table of all files use for locating the chunk/file and other operations like scavenging. The data structure does not be fully held in memory.
A boltdb or the already used sqlite could still be applied,
but the main idea is to use journal in front to decouple the read-side from the write-side.
The db itself could be relocated on flash storage and a filesystem cache would not be required anymore.
The node can use the db for all its operations mainly for scavenging, statistic and as a file-locator. Only content read/write operations would require to access the chunk storage.
A related example would be bitcoin-core, it has a similar approach with its index files. But the journal would be the block-chunks themselves.