Best practice for using Storj in a Python environment

mpw · February 28, 2024, 9:38am

Hi everyone!

I’ve been using Storj for quite a while now in both professional/technical work in my business and, more recently, for private data storage and backups. So far, I’m really happy with it.

My original idea was to use Storj as a full alternative to common object stores like AWS S3 or Azure blob storage which are both expensive and lock you in with their egress policies. For the purpose of just storing large amounts of data that need occasional retrieval or simple access patterns Storj works very well.
However, I recently decided to use Storj in a more comprehensive way for “everyday” processes, incl. model development, data analytics etc. For this, I need to be able to flexibly access data in different locations/buckets programmatically. In other words: access and interact with Storj buckets the way you’d interact with S3 buckets through boto3 or with Azure blobs through their Python API. For now, I’m primarily talking about uploading/downloading/copying/moving data, not so much managing permissions etc.

I’m sure I’m not the first person to ask about these things but I couldn’t really find good answers, so I decided to write this post. To be more concise, here are my main questions:

In general, what is the recommended way for programmatically interacting with Storj in Python? Is python-uplink the go-to option? Or are there other packages? Or should I just handle interactions through subprocesses via the Uplink CLI?
Storj is advertised as being S3-compatible and I’ve found a few tutorials and articles mentioning boto3, s3fs and “native” integration in pandas, for example. Nevertheless, I find very little on how to really use Storj in a Python environment. Do tools developed for S3 (like boto3 or s3fs) also work with Storj without the need for a Storj-specific package? How far does this S3-compatibility really go?
It’s honestly surprising that Storj has multiple Uplink implementations but only a third-party version for Python that hasn’t seen any updates in years. Python is one of the most popular languages around so I’m wondering why there is no well-maintained Python API?
Is python-uplink still usable and reliable? It seems to be dead since 2022 and I’m not comfortable relying on a package that’s obviously not maintained.

I hope you can help and give me some advice. Thank you all in advance!

s-t-o-r-j-user · February 28, 2024, 12:01pm

Thumbs up for your post @mpw. I was also advocating for better integration with programming environments, I guess the last time in April 2023. Unfortunately not much has changed since that time. Apart to Python I would also like to add Julia to the list, also maybe some low level programming languages.

pwilloughby · February 28, 2024, 5:09pm

for #2 take a look at the compatibility table S3 Compatibility - Storj Docs There’s also instructions for configuring s3fs Connecting s3fs to Storj - Storj Docs For boto3 the hardest part might be configuring it to use the storj endpoint https://gateway.storjshare.io with Service-specific endpoints - AWS SDKs and Tools

jtolio · February 28, 2024, 5:49pm

Hi @mpw!

I’m excited that you’re liking Storj. I agree with you that our guidance about what to do within Python is currently lacking. This is broadly due to our internal focus on providing drop-in S3 compatibility instead of our own bespoke libraries. That said, there is specific guidance I think I can provide.

For access management, we recently launched a Python-native library: GitHub - storj/access-python: access grant management in Python. This library’s purpose is to support access grant restriction and manipulation. This is similar to the functionality provided in our Go library for access grant management (uplink package - storj.io/uplink - Go Packages)
Once you have an access grant you like, you can register it with either of the above libraries for use with our edge services. Registering an access grant with our edge services gives you S3 configuration and credentials that can be used with boto-s3.
Best practice currently is to use boto-s3 for object management. As @pwilloughby points out, you’ll need to tell boto not just about your access key and secret key that you get by registering an access grant, but you’ll also need to tell boto about our gateway endpoint, https://gateway.storjshare.io
The third-party uplink-python library does need an update to use the latest release of our C-bindings GitHub - storj/uplink-c: Uplink C library. We haven’t presently had the time to help support uplink-python, but the Python bindings to our C library themselves shouldn’t need much change. The only reason you’d care about these native bindings instead of boto is if you want your application to establish TCP connections directly to the nodes in our network instead of our gateway (you might! depending on your network topology and other details this may be faster).

Does this help? I can probably get you Python-specific hints about configuring boto if you like.

s-t-o-r-j-user · February 28, 2024, 5:57pm

I wasn’t aware of those developments; I must have missed the announcement on the forum. Still, regarding Python, I believe it’s crucial to endorse and maintain the full chain. Otherwise, few will be inclined to take the risk during development. As for Julia, if your team lacks necessary skills, I recommend posting on the Julia Discourse expressing genuine interest in the language and seeking assistance. You could also directly contact individuals maintaining AWS.jl and Minio.jl. If you’re confident in your technology, especially in reliability, consider reaching out to the Julia Hub team. The co-founders are approachable, and the Director of Customer Success is usually receptive. Additionally, participating in Julia Community team meetings could be beneficial [Community].

Roxor · February 29, 2024, 7:01pm

Why? If their customers are asking for, and paying for, “drop-in S3 compatibility”… nothing about that is specific to python. So why is it crucial to maintain python libraries? It seems like they’re being given the appropriate level of attention.

For most customers (especially large ones) tailoring any part of their workflow to custom Storj libraries would be a poor business decision, as they’d lose the ability to easily move to any other S3 provider.