A couple of days ago, I came across a blog post that compares IPFS and BitTorrent. The blog post does a pretty good job of explaining BitTorrent, but gets a bunch of things about IPFS wrong.
I don't blame the author, because I think historically, the IPFS project hasn't done a great job of clearly communicating the different subsystems that IPFS is comprised of. What's more, IPFS and the related suite of projects, e.g. IPLD favour optionality and evolution in their design leading to complexity.
The How IPFS Works page, which is referenced by the post, and was overhauled last year has a pretty good high-level summary of how IPFS works.
This blog post aims to both share my own spin of what IPFS is, and correct some of the things the blog post got wrong.
IPFS is a modular system for addressing, routing, and transferring data based on the principles of content addressing and peer-to-peer networking.
Let's break that down with an example.
Data is pretty abstract, and while IPFS works with many different kinds of data, let's start with a file as our data. (after all, FS in IPFS stands for file system and a file is the basic unit in file systems).
Here's a picture of my dog Kaputzi:
So what can IPFS do for this picture of my dog?
Addressing (or what's the CID of my 🐶 jpeg)
IPFS uses hashes of data to address it using CIDs. CIDs are pretty much just glorified hashes. You may be familiar with this from Git: every commit is a hash of the commit's contents.
Adding Kaputzi's picture to IPFS will return the following CID:
bafkreia2xtwwdys4dxonlzjod5yxdz7tkiut5l2sgrdrh4d52d3qpstrpy. This string is pretty much just a SHA256 hash with some extra metadata about what kind of data it is. This gets more complicated (enter UnixFS) once you start representing larger files, directories, and other kinds of data, but in principle, everything is just a hash.
Note: CIDs are cool in that they can be encoded in different ways, so for example,
base256emojiencoding of the same CID. Most efficiently, CIDs can encoded in binary format. Most commonly, CIDs are encoded with base32.
Content addressing is cool, because it means you can get the data of a CID from anyone (or multiple people simultaneously for faster retrieval of large files!) who claims to have it. You don't need to trust them because you can verify the data they send you matches the CID.
Another way to put it, is that IPFS will do the magic of finding who has the CID. That magic is called content routing.
Content Routing (or I have the CID, who has the 🐶 jpeg?)
So if multiple people have the picture addressed by a CID, how do you find them? This is what content routing solves.
Concretely, for a given CID, I want to know a list of IPs that I can connect to and ask for the file.
Note: In reality, things are slightly more complicated. In IPFS, nodes/peers have a unique Peer ID and each peer can have multiple network addresses, i.e. (for each IP, port, and protocol).
In IPFS nomenclature, we call peers who have data for a CID, providers.
Each CID can have multiple providers, and each provider can have multiple addresses.
IPFS supports multiple ways to find providers for a given CID:
- Kademlia Distributed Hash Table (DHT). Amino is the main public DHT (but it's possible to run separate and even private DHTs)
- Delegated routing over HTTP (somewhat similar to BitTorrent trackers)
- Bitswap (technically Bitswap is a data transfer protocol but it's also used to discover data through WANT requests)
To learn more about how Kademlia works, check out this podcast with Petar the author of the paper about Kademlia
One important note about the Amino DHT is that it's public and open (with at least 4 language implementations), averaging around 25k online DHT servers. It's also global, meaning that all provider records are in a shared namespace. Perhaps most fascinating, is that 99% of lookups take under 1.7 seconds (admittedly, publishing provider records to the DHT is slower and a common source of content routing problems).
Delegated routing over HTTP is not a routing system but a general API to offload routing work. This is useful in browsers and other constrained environments where it's infeasible to be a DHT server. More broadly, it enables experimentation and innovation in content routing while maintaining interoperability.
Transferring — let the bytes flow
We started with the CID of the dog. Content routing found you the IPs of all the providers of the CID. Now it's time for you to fetch it.
In the spirit of optionality and modularity, IPFS supports more than transfer protocol. The two most broadly used transfer protocols are:
- Bitswap: a message-based protocol for exchanging blocks (addressed by CIDs) of data. With Bitswap, an IPFS node can download blocks of a large file or directly from multiple peers simultaneously.
- IPFS Gateways: a req/resp HTTP API for fetching CID. Like delegated routing over HTTP, useful for the web and constrained and short-lived environments.
That's it for my overview of IPFS. It skims over many details for the sake of brevity.
What the blog post gets right about IPFS
It’s a bit different to the torrent, in that all files exist in a global namespace that anyone can publish to. It’s like one big share drive.
IPFS requires that nodes republish the content ID’s that they are hosting every 24h to the peers. This is quite network-intensive.
IPFS-in-practice is developing more centralized solutions to this problem, like IPFS Network Indexers (IPNI).
That is true. As mentioned above, IPFS favours modularity and the network indexer is an example of that.
What the blog gets wrong about IPFS
There are generally two classes of IPFS nodes - those that are public resources, where you can publish files and it will host them for free (Cloudflare runs a node for example) - and nodes that only host content they are interested in, which is referred to as “pinning”.
This distinction isn't very helpful. While it's true that IPFS has some free public good IPFS gateway like ipfs.io and cloudflare-ipfs.com that can be used to fetch pretty much any available CID. But you can't publish to them or use them for free hosting. They may cache the CIDs contents for a period, but none of them provide guarantees about this, and that may get garbage collected at any time.
Content is hosted in the style of the Kademlia DHT, where it is replicated on the closest 10 peers. The “closeness” proximity metric is the XOR function of Kademlia - ie. the dist(content, peer) = content_hash XOR peer_id - so generally speaking content is distributed uniformly across the peers.
It's not content that is hosted on the DHT. It's only provider and peer records that help you find who has the CIDs contents, i.e. blocks, and how to connect to them, i.e. IP addresses.
IPFS also features a “mutable file system” in the form of IPNS.
IPNS is better thought of as a mutable link system.
The way this works is that your “IPNS name” is the hash of your public key used for publishing.
Public keys are only hashed if they are too long. IPNS supports Ed25519 keys which are short enough to fit into a CID. For example, the IPNS namek51qzi5uqu5dhp48cti0590jyvwgxssrii0zdf19pyfsxwoqomqvfg6bg8qj3s has the public key inlined.
IPFS runs [peer discovery] over TCP. Theoretically, BT finds peers much quicker due to the overhead in TCP handshaking + libp2p encryption.
This may be true, but it's hard to compare results from different benchmarks of decentralised networks. I will say that thanks to libp2p, connections in IPFS are transport agnostic and the most widely-supported transport on IPFS is QUIC which is over UDP.
Broadly-speaking, IPFS exhibits a federated network architecture, whereas BitTorrent is more maximally decentralized.
Actually, IPFS with the DHT and BitTorrent with the mainline DHT are pretty similar in that respect. However, the modular nature of IPFS does allow for parts of the network to be more "federated".
An IPFS node is largely more intensive in every way by default.... BitTorrent nodes by comparison, are extremely lightweight.
Some BitTorrent implementations may be more efficient in some respects, but there isn't anything fundamental about IPFS nodes that makes them more intensive. Granted, Bitswap is very chatty. But if you are using Bittorrent with the mainline DHT, you still have the overhead of interacting with the DHT.
In IPFS, you are providing a chunk of your storage to everyone for free (the commons).
This is just wrong. The confusion may arise from the fact that IPFS uses one global namespace. But that's just regarding
In practice, IPFS content gets banned from certain nodes and gateways
Public gateways are a double edged sword. On one hand they provide a bridge that any can use from their browser. On the other, they become targets of abuse easily.
Seeders on BitTorrent also regularly disappear due to legal actions following copyright infringement. After all, the IPs on both BitTorrent and IPFS are public and seeding or providing copyrighted content can have legal depending on the legal jurisdiction you (or the node) are in.
the IPFS protocol is hugely resource-intensive.
IPFS is not just one protocol. Some of the specs/protocols are indeed resource-intensive. There's ongoing work to improve many facets of these challenges that are well understood at this point.
One team who was building an alternate high-performance IPFS implementation, iroh, has since broken rank and moved in a new direction for many of these same reasons.
I'm very excited about Iroh and have deep respect for the number0 team. I should note that there's a bit more nuance to this. They are experimenting with building a new kind of IPFS system. They still use CIDs and while the interoperability story with other implementations isn't fully fleshed out just yet, they are actively participating in the IPFS implementors working group. What's more, they are demonstrating just how compelling content addressing may be for synching bytes.
Liam, if you read this, thank you for writing that blog post. It's a useful resource and it provoked a lot of thoughts, despite some errors.
The IPFS ecosystem can be complex to navigate. But there are a lot of exciting things happening. The IPFS Community Calendar is a good way to find out about working groups and events.