The current status of decentralized data storage

The current status of decentralized data storage

Data storage can be a vulnerability in blockchain applications. If the app is decentralized, the data also should be or app will be less decentralized

Introduction

I want to start by explaining the interesting terms - Decentralized and Storage. It’s not so ambiguous to interpret what storage means, and that is to keep a file (file here means anything from an image to a JSON object) or set of data saved (on a Computer). You dedicate a certain amount of disk space on your computer to storing some files over some time. Decentralization, on the other hand, might require a bit more explanation. There are various meanings to decentralization but they all boil down to one thing, no single point of reference for any action. When a system is decentralized, it means everyone is in sync, has power and similar rights in the network and there are also rules (decided by everyone in the network) for binding participation. Based on this, decentralized storage means being able to store your files on any selected computer (within the decentralized network) and read it on any other computer. It’s a way of ensuring data is always readily available and does not rely on a single location to retrieve it.

How do digital storage solutions work in the current day? We use cloud-based solutions, such as AWS cloud, Google cloud or Microsoft Azure among others. These cloud services provide a way to store files on a remote computer hosted in some centralized locations (us-west-2 etc). Now these locations are secure, and cloud services providers have gotten better at storing and delivering these files, they’re not very expensive and they scale well. So what is the problem with them, and what gave rise to decentralized storage? In the most seldom cases, a cloud server could have downtime. We’ve had scenarios whereby cloud servers were down, even if only for a few moments. Now when these servers go down, you will be unable to read your data because the only source of that data is unavailable. This is a huge bottleneck for a blockchain (decentralized) application where any unavailable data could lead to very huge consequences. On the blockchain where everything is open, available and is supposed to be decentralized, the best way to store data needed is decentralized. This is why NFT projects have their metadata on decentralized storage platforms like IPFS mostly.

Location vs Content

Availability and potential downtime are not the only issues if you bring up how the files are saved. In these cloud-based services, files are stored based on their location. To access a file stored in an AWS S3 bucket, you’ll need to visit a URL that’s the digital location of your files. This URL points to the location where your file is, but there’s no assurance that the location exists. In Decentralized storage, we use an approach called content-addressing, as opposed to location addressing to retrieve stored files. What does that mean? Let me use an example of a wandering wizard that heard the news of a legendary book. The legend says the book is located at a library called Dressrosa and gives a description of the shelf where the book can be found. This is a very descriptive location and anyone will be able to figure it out. The best case is you get to the library and all the shelves exist. But, say you get to the library and you cannot find the shelf, then the legendary book is a lost cause. That is exactly how storage retrieval works in centralized cloud-based storage you store your files on the cloud and you get a URL pointing to the location of the file. But with content-based addressing, you pretty much know how to bring back the stored file (similar to piecing together the parts of the book). It’s sort of, instead of going to look for the book, you know a magic spell that can allow you to write up the book. So you can just sit there and compile the book. Now that makes sense. Also, this process of regenerating the data happens very fast and is undisturbed. The way we access files in decentralized storage is via content addressing and what makes this possible is a technology called IPFS (InterPlanetary File System).

IPFS is a distributed way of storing, retrieving and sharing files between peers. it is a technology that allows you to content-address files saved on other people’s computers. Powerful, and how it does this is, when data is saved, it’s converted to a string of characters called the CID (Content Identifier) and no two distinct data can have the same CID. IPFS is built on three main concepts, which we will describe in brief detail;

  • Content Addressing - This is the most obvious concept of IPFS. Files saved on IPFS do not get a URL, instead, they’re pretty much converted into a string of characters and the characters could look like this (ipfs://bafkreiav7yuf36u3rr). The characters after the // is the CID, Content Identifier, and is also known as the content address. Every file stored on IPFS will generate a unique CID that cannot be replicated by any other file. The CID itself can be seen as the data because just by knowing it, you can always retrieve the image back from any nodes that are storing it. One major thing to point out is, CIDs are not the same as HASH. So let’s say you were able to obtain a hash of characters that are of the same length as a valid CID, it won’t be able to return a file and even if it does return, it will be based on pure guess.

  • Directed Acrylic Graph, DAGs - The DAG is one key part of how IPFS works. Earlier I mentioned that a HASH will not return a file. That is because of the DAG. When a file is stored on IPFS, it is broken down into various tiny pieces ( called blocks) and each of these blocks has its own CID, also stored on IPFS. This particular feature of IPFS is very similar to a peer-to-peer network like BitTorrent. In BitTorrent, peers can download files in bits. So you could download the last 10 minutes of a movie from peer 1 and download the first 30 minutes from peer 2. You can do the same with IPFS data. IPFS content makes use of a Merkle DAG and a Merkle is like a family tree where at the top is one root node and all the nodes are linked to their respective children nodes. In the Ethereum blockchain, a Merkle tree is used to verify the validity of a block. But in IPFS, a Merkle relates the different blocks of a file to the root node.

  • Distributed Hash Tables, DHTs - are the final piece of the puzzle. The DHT is how peers discover themselves on the IPFS network. This uses a library called libp2p, the same library that ETH 2.0 uses for peer discovery.

Using IPFS

To store content on IPFS, you need to have an IPFS node running. This is software that you can usually download and run on your computer. With this node, you can run command line instructions to interact with the IPFS node (also called daemon or peer). Here is a Javascript implementation of an IPFS node. You can follow the installation guide to get started.

IPFS CID can be read in the browser by prefixing https://ipfs.io/ipfs to the IPFS CID (it will look like this https://ipfs.io/ipfs/<CID>). According to the IPFS docs, you can swap out https://ipfs.io with your http-to-ipfs gateway but you are responsible for keeping it running. The http-to-ipfs gateway is different from the usual location of a file on the server. If anything, it’s a gateway that specifies what node is storing the data. Knowing the CID is usually a surefire way to retrieve data from IPFS. Protocol Labs keeps a list of IPFS gateways and a very helpful tool is the IPFS companion, a chrome extension that can find ipfs-gateways and retrieve your stored files through that. I use IPFS companion and all I need to do is paste my CID into my browser and the extension handles the gateway part, really convenient.

Limitations of IPFS

IPFS is made to be a peer-to-peer file-sharing system, anyone can run an IPFS node and store files. As long as someone in the network is storing your files, they can be retrieved. But because IPFS is so decentralized, it can get tricky trying to retrieve your files. There is no obligation for a node to keep storing your files, and file retrieval might be slower at times. Many people use the concept of Pinning to make retrieval much faster though. Pinning ensures that files are available on various nodes, easy to retrieve and guaranteed to be available. Also, pinning on nodes is quite expensive (Pinata plans are as much as $100/month) or you have to run your node. The summary of this is, the IPFS node storing your files is not obliged to keep storing them and it’s expensive to PIN to nodes and ensure they always store your files. IPFS is awesome, but it can be better.

What if we could sort of have a way to ensure that nodes (I’ll call them storage providers, from now) do not just abandon stored data, and also get them to reduce their prices? This is exactly why Protocol Labs, the same company that created IPFS, developed Filecoin. Filecoin is meant to be an open economy around IPFS. To better put it, Filecoin is a marketplace for decentralized storage. Storage providers on Filecoin are incentivized to consistently provide proof of storage for the data they store, and are rewarded in FIL tokens for doing so. Storage providers that cannot provide proof of data storage, or that are mischievous in the system get punished.

One important thing to note is that, while IPFS is a platform where participants can easily exchange files in a peer-to-peer and distributed manner, Filecoin is a blockchain network that allows incentivization and the creation of an economic market around decentralized file storage systems like IPFS. Storage providers on Filecoin store files for as long as the storage deal is active. To store files on Filecoin, you need to store the file on IPFS and get a CID, calling it the Data CID. This data can then be taken to a storage provider and another CID will be provided called the Deal CID. The Storage provider receives payment for storing the files, in FIL tokens and they also get rewarded for providing proof of storage, also in FIL tokens.

On Filecoin, two types of deals can be made - Storage deals and Retrieval deals. To store a file with a storage provider on Filecoin, you need to provide a storage deal and a Retrieval deal is used to retrieve files from a storage provider. Filecoin, being a blockchain, also uses cryptographic proofs to ensure that users are storing the files received. You could probably see how some zero-knowledge proof can come in here. The two types of proofs that Filecoin uses are:

  • Proof of Replication (PoRep) - This is proof the storage provider provides to demonstrate that they have received all the data to be stored and also encoded uniquely.

  • Proof of SpaceTime (PoST) - is proof that the storage provider provides after storing the data. The aim is to consistently be able to prove that the data is still being stored, ideally without having to reveal the files.

It’s important to also point out that even though Filecoin and IPFS might seem different, all Filecoin nodes are effectively IPFS nodes as well.

Using Filecoin

To save files to Filecoin, you need to have a Filecoin node running, a Filecoin address and some FIL tokens. There are various implementations of the Filecoin protocol but the most popular one, being maintained by protocol labs itself is Lotus Filecoin. Examples of the Filecoin nodes include:

  • Web3 Storage

  • NFT storage

  • Estuary

  • Chainsafe Storage

  • Lotus Client

Some of these work via command line interactions, like Lotus client, some of them have a visual UI where users can upload files and interact with them and most of the nodes have an SDK that can be used to manage data programmatically. The best choice for you will depend on your use case. If you’re building a Dapp, then maybe the SDK is good enough. With Lotus, you can easily hook into the Filecoin network itself, submit transactions to the network and retrieve storage deals as well.

Filecoin is a blockchain and being a Blockchain, Filecoin is also programmable. Recent innovations of Filecoin is the invention of the Filecoin VM and Filecoin EVM that allow the creation of actors on the Filecoin network. Actors are basically smart contracts that are deployed onto the network and can listen for transactions and then take actions based on those. Filecoin VM and EVM are the same things, except that FEVM is much suited for developers that have an expertise with EVM based tools such as Hardhat, EthersJS, Web3JS and so on. With the creation of these VMs on Filecoin, users can now create Dapps that rely on the Filecoin network. This has given rise to a world of possibilities on Filecoin from people building Data DAOs, DEXes, DeFi implementations and even NFT projects on Filecoin network. This is very possible because Filecoin is fundamentally a blockchain network and it operates as such. This right here, is another key difference between Filecoin and IPFS.

A project that’s very similar to Filecoin in terms of idea, and problem tackled is Crust network. Crust is a blockchain that is built using Substrate framework and is also used to provide an economy around decentralized storage provision. Crust currently is built around IPFS but crust is meant to work with any decentralized storage solution. Other solutions include Arweave, Siacoin, Storj among others. While Filecoin is probably the most innovative and active, other networks are doing similar work and these decentralized storage solutions are here to stay. One challenge that’s been heavy is bringing up the speed of data retrieval. On many IPFS nodes, it works to pin your files to certain nodes and make the retrieval easy and fast. Pinning is a more expensive action and that is why networks like Filecoin, crust have been coming up to build a market around storage, build competition between providers and effectively drive down the price of storage.

Summary

Cloud based storage are efficient but they do not work well with blockchain applications, reasons being that availability of data is centralized in one location, the cost is expensive and there’s no assurance that the data will exist forever. The most useful way to store files or data used for blockchain applications is to have a decentralized storage solution such as IPFS where the data is not dependent on the location but the content is the CID that is generated itself. IPFS on its own is a great technology but there are some limitations around data persistence, cost of pinning data and also incentivization and punishing schemes. This is the reason why Protocol Labs came up with Filecoin network, A blockchain network built to serve a marketplace for decentralized storage.