Blockchain for the Lab

Blockchain for the Lab

Most people are still sceptical about Blockchain, and to them, the talk about the "Blockchain Revolution" or "the new Internet" is just a means to further hype cryptocurrencies. However, laboratories, research and development, pharma, chemistry and biotech should look into the topic, because the technology of decentralization has tremendous potential, particularly in regulated areas.

I have to admit: Until a few months ago, I had no idea about cryptocurrencies and little idea of ​​Blockchain technology.

What sparked my interest sometime in 2016 was the idea of ​​using a blockchain for a decentralized system for storing laboratory data and results. 

Researchers have an enormous fear to entrust their data to anyone else. I do not want to go into detail of why this makes sense or not but of course the concerns are reasonable, to some the degree. Data is the capital of each and every researcher.

But what impact does this fear have on innovations in the scientific software market? Quite simply, software vendors have to offer solutions that can be installed locally in the researcher's domain - and that's expensive. Developing and maintaining on-premise solutions, managing different installations and multiple codebases, as well as additional support is much more costly than just having to maintain a single cloud software. That's why it's harder for a young, innovative companies to offer low prices in this market than it would be in other markets. For example: A software that is only offered as a cloud solution (one codebase, one release plan, one support ...) comes at a price of 10 € per user per month. However, renowned and valuable customers can only be won if they offer an on-premise solution. The mere effort of maintaining this solution can make it necessary to increase prices to 15 € per user per month. Not to mention slower release cycles and the considerable effort for initial development of the on-premise solution. For software, which should actually facilitate collaboration across organizational boundaries, there are also significant implications on functionality, since features for data exchange from a standalone installation to another standalone installation are nearly impossible to implement - at least not if an affordable price level is to be kept.

Of course, these high prices and constraints make it more difficult to reach a critical mass of paying customers. Especially in the academic field, where budgets are scarce, and I personally believe that many good, innovative ideas fail because of this dilemma.

That's why every innovation driven company in the scientific software community has this burning question: how can you solve this on-premise problem? What can convince researchers, pharma, chemistry and biotech companies as well as large scientific institutions to trust the storing data in the cloud?

Blockchain technologies could be a possible answer: there have been ambitious blockchain startups such as Sia, Storj, MaidSafe and Shift for quite some time now, which want to decentralize cloud storage, make it secure and even cheaper using Blockchain technologies. There are approaches to revolutionize data storage in general, such as the "Interplanetary File System" (IFPS) or the Swarm protocol for the Etherium Blockchain.

All of these approaches have the potential to make the research community rethink their extremely conservative views on data storage, thus, opening the door to faster innovation cycles in scientific software.

Decentralized cloud storage

Actually, the history of decentralized cloud storage begins with peer-to-peer (P2P) sharing, Napster & Co. Of course, this Wild West period of the web is shaping the perception of an entire generation of Internet users: Data is accessible for everyone, mostly on dubious websites.

I’m mentioning this because this generation may now have to deal with the evolution of P2P networks and blockchain technology, perhaps in very serious areas such as the research industry. And I strongly suspect that the perception shaped in the early days of P2P networks makes it difficult to accept blockchain technology, cryptocurrencies etc. as a serious approach to solve problems in a serious industry.

In contrast to the former hosting solutions, P2P networks always had a decisive advantage: they were much more decentralized and provide a unparalleled redundancy. That's why it was - and still is - so difficult or impossible to stop these networks.

In my opinion, the perception of P2P networks as a legal gray area, comes from the fact that the good "sharing is caring" ideology, from which these networks originated, wears off eventually. At some point there are many "Leechers", which only download data, without giving back something in return. By comparison, there are only a few "seeders" that host data and keep it available. And that's exactly what leads to centralization. In addition, the seeders, who still participate in such an environment, are no longer driven by ideology, but develop dubious methods to finance themselves (some keywords: gaming, violence, nudity). And that's what shapes the perception of P2P networks.

In the meantime, big hosters like AWS also have been developed a decentralized infrastructure: The data is stored redundantly in many data centers. However, the key difference to P2P is in the "contract" between data provider and the hoster: In case of AWS and others there is only one third party to trust, only one party you have to trust.

And that's actually what researchers fear: I have to entrust my data, my most important asset, to someone else - with an emphasis on "one" .

This is exactly the problem that blockchain-based P2P cloud storage providers try to tackle: To creates incentives, so that the P2P hosters are paid for storage space and keeping data accessible. They get Bitcoins, Filecoins or other cryptocurrencies for seeding data. These incentives are thought to prevent centralization, as it happened in Napster and Bittorrent times. Instead of paying € 10,000 a month to AWS, € 1 per month could be paid to 10,000 Seeder Nodes and, at the same time, a much higher level of security and independency could be achieved. The decentralized cloud storage providers also promise that they are also much cheaper than the large hosters.

Technology-wise, documents (even large amounts of data) are locally encrypted and stored on different computers of the P2P network. This grants access control and security.

The integrity of the data is granted through the blockchain's consensus mechanism: the data (or its hashums, i.e. the unique, cryptographic representation of the data) is checked at each event that involves the file ("transaction"). If the hash doesn’t match the majority of the hashes on one or more nodes after the transaction, he data on those “malicious” nodes will be overwritten with the correct data (consensus mechanism). Depending on the system, malicious nodes can also be dismissed from the network or pay a penalty. Thus, unlike in the Wild West period of Napster & Co., the participants of the network close some kind of multilateral contract that is created, maintained and controlled by immutable lines of code.

Recap: Now, what is our use case again?

One thing we need to clarify at this point: The use case described in the beginning (how to make the cloud more attractive to researchers?) does not only include the decentralized storage of data or documents. We want to store and run an entire program, which then works with decentrally stored data. Therefore, we’re looking at the use cases "decentralized data storage" plus "decentralized application".

For decentralized data storage in science, we want to

  • store large documents, or at least allow access to large files over the network,
  • precisely control access to the documents,
  • efficiently search for documents, thus documents must be stored in a structured manner,
  • quickly upload and download documents,

We expect a decentralized application for the management of research to

  • work with decentrally stored data, i.e. to create and edit new data and documents and possibly also to delete them, in cases where good scientific practice or regulations are not violated,
  • allow us to search for data and aggregate data,
  • provide everything offered by a centralized application i.e. a great user experience, good performance, etc.

To recapitulate the problem just one more time: We have this use cases only because ...

  1. science, pharma, chemistry and other regulated areas are still extremely afraid of cloud applications and cloud storage,
  2. this results in the the need for on-premise solutions, local, shielded installations, etc. which then in turn results in
  3. increased complexity, higher costs and slower pace of innovation.

My hypothesis is that new blockchain-based or in general, decentralized approaches can help overcoming this fear, reduce complexity and thereby reduce costs and increase the pace of innovation.

What approaches are there, what solutions that startups?

I have to admit that even after pretty intense reading and research, it is still rather vague for me which approach is really relevant and which is not. Many approaches play into the topic, but are actually intended as solutions for other areas. As in the early days of the Internet, many startups focus on the consumer sector and it’s difficult to tell which ideas and concepts can be applied in the B2B sector. I’m sure that everyone is kind of in this situation of vagueness because the topic is still very new. Once you spend some time studying the subject however, you realize the true game changing potential, and find yourself generating thousands of new ideas and potential opportunities.

I believe all those ideas and opportunities will consolidate eventually. However at present, I think it makes sense to share an unconsolidated, broad view on all approaches out there.

In that respect, which approaches have crossed my path?

Sia, Storj, MaidSafe, Shift:  To my understanding, these are all decentralized alternatives for established cloud storage solutions such as Dropbox, box & Co. However, the storage is not provided by a central organization, but by many peers in a network. Unlike the traditional P2P networks, these peers are rewarded for providing storage space and access to files, thus centralization should be prevented. Cryptographic methods are used to ensure to ensure a high service level and reliability. The approaches are mainly intended for the storage of static documents, so they can be used to solve the use case 1, but won’t help much to implement a complex, decentralized application for working with decentralized documents (use case 2).

IPFS, Interplanetary File System. The name of this project is pretty awesome, and after reading extensively about it I would definitely not describe the name as megalomaniac! The main idea behind IPFS is that documents in the network no longer have a URL - that is, an address that points to the storage location on a specific server or server cluster (eg, AWS). Instead, they get a unique cryptographic name, a hash. But this is more than a mere renaming: the hash is an unique ID for each document that is uploaded. After upload, the document is copied and distributed to different peers / nodes / "mini-hosters", and all copies retain the exact hash. When requesting or using the file, your computer doesn’t look for a specific address anymore (IP address), but looks for the mini-hoster closest to you that lists a document with the ID. In that way, the file is nowhere specifically, but still comes from somewhere as soon as you request it. You can not decide from which node you get a copy of the file, the network decides because of the proximity to the hoster. This is why the project creators call it "Interplanetary": When we will become an interplanetary species, it’s going to be enormously important to find the document that is closest to you and not to follow a specific address that may be too far away to be quickly accessed, e.g. on a distant planet while a space station closer to you hosts the same file. The current Internet is quite inefficient in that respect because it works with many unnecessary copies of documents, all of which have different addresses.

To prevent decentralization, the same methodology is used as in Sia & Co: Everyone can provide storage space and for providing storage space, reliability, etc. the hosters get crypto money, in this case Filecoin.

Furthermore, the method for exchanging files is based on the Bittorrent technology, meaning not complete files, but only small chunks of large files are exchanged between Nodes. This bypasses bottlenecks and can handle large documents extremely efficiently.

One drawback I've read about but have not verified in detail is that anyone with knowledge about the hash or File-ID has access to a file. For that the authors of the article suggest you could layer a blockchain in between, which controls the access to the hashs on a second level.

However, IPFS is all about the exchange of files, that is, unstructured, static documents that are no longer being worked on. Thus, relevant for use case 1, but (currently) not relevant for use case 2.

Swarm. This protocol is actually the equivalent of IPFS for the ethereum blockchain, with the same pros and cons. Thus, currently only relevant for use case 1.

DaMaHub Data Management Hubdata is an implementation of IPFS for the decentralized storage of science. This seems to be an interesting community that has already gained the British Library as a partner. However, based on IPFS, currently not relevant for both use cases.

Arweave is a new competitor for the previous mentioned decentralized file-storage players like Sia and Storj (I was about to write "competitor to the established players”, but obviously, nothing in this space is really established yet...). Advantages compared to the services mentioned above are said to be that Arweave does not reward for providing storage space, but for making a file accessible when it  is requested. This leads to a more efficient use of storage space, so that Arweave doesn’t charge a subscription fee per month based on storage space used, but charges a one-time fee for the permanent storage of a file. I’m mentioning Arweave separately because first, there is something pretty interesting for the scientific community: Arweave cooperates with CharitĂ© Berlin, Europe's largest university hospital. Biomedical researchers from CharitĂ© want to use the Arwave Protocol to set up a new scientific journal to tackle the reproducibility crisis. A second reason why I find Arweave pretty interesting is that the team addressed issues for Decentralized Collaboration on Data and Decentral Authorship. In any case, I see a clear relevance for our use case 1 and with the concepts for decentralized collaboration and authorship, I see a lot of potential for being relevant for use case 2.

BigChainDB claims to be a database with blockchain characteristics that provides decentralized control. I'm not enough a computer scientist to be able to fully evaluate the possibilities and limitations, however this article here suggests the concept is failing in one essential aspect, namely that you would need be able to trust all the peers on which the database is distributed. Every node has permissions to write into the db, without a consensus mechanism controlling what's right and what's wrong writing. This is not a fail in terms of the BigChainDB implementation. The article is referring to the CAP theorem and if I got this right, the decentralized work on data in a database is not possible without sacrificing either Availability (A), Availability, or Consistency (C, Consistency). With a secure, performant and decentralized database both use cases would be perfectly dealt with, however the question seems to be: "Is a decentralized database feasible at all?" And if so: "How?" ... and in case we can clarify the "How?" we’re still left with the "When?", “At which costs?”, and “Under which conditions?” to validate this decentralized database in terms of our use cases.

Cryptowerk A German startup that solves the scaling of public blockchains like Ethereum by compressing encrypted data from customers, increasing the number of transactions up to 600,000 per second. For comparison: The Bitcoin Blockchain manages to do 7 Tps, Ethereum 15 Tps. As use cases presented by Cryptowerk suggest, their application was not designed to handle large, dynamic files (use case 2), as would be necessary for the collaboration on decentrally stored laboratory data. As I already presented a number of solutions for decentralized storage of large files (use case 1), my decision to cover Cryptowerk was based on the fact that the innovative lab tech startups Essentim and Innome apply the Cryptowerk solution for transferring data not only from sensors in their software platforms, but at the same time store it on a public blockchain. This allows to immutably store measurement data as well as the history of data transfers (in this case the Bitcoin network is used). The transactions include small chunks of data only (sensor measurements), which remain static of course, so the main benefit is providing immutable proof of data integrity. Even this is not really relevant for both of our use cases, I think this is really interesting because you could elaborate on this approach in terms of an integrated laboratory audit trail. Within an IoT-based laboratory environment of connected instruments and software, a consistent, immutable history of all the data transfers between the various peers of the network (devices, software, memory, etc.) would be extremely helpful.

IOTA Because I covered Cryptowerk, I’ll also cover IOTA. IOTA is actually a cryptocurrency that does not use a traditional blockchain, but a so-called "tangle". I will not deep-dive into the technology, but as with Cryptowerk, the Tangle technology allows many more transactions per second than traditional blockchains. Therefore, IOTA is designed to enable the rapid communication and payment between machines in context of IoT e.g. for the collection and processing of sensor data. The approach seems to be very promising, since by now, Volkswagen, Bosch and InnoEnergy have started cooperations with IOTA.

Because I’ve been deviating from covering approaches to tackle the original use case anyways, I’ve decided to quickly mention the Think Tank "Blockchain for Science". In this article, the researchers Sönke Bartling and Benedikt Fecher summarize how blockchain-based systems could help in the distribution of grants, the reproducibility of research results and awarding of credits.

What are the roadblocks for blockchain for scientific applications?

To be precise, one should ask what roadblocks exist for secure decentralization. Some of the approaches covered above don’t use conventional blockchains, but rather similar or derived technologies.

Actually, looking at those approaches tells a lot about what the main hurdles are at the current state of technology: As already mentioned, the classic blockchain is slow - Bitcoin is the most prominent example. But it’s meant to be, it is slow by design! Blockchain technology was not designed to be fast or to handle large chunks of data, but to be extremely secure without the need for a centralized instance. This tradeoff is totally ok for the original use of case blockchain technology, namely for handling transactions containing the information "who has sent how much money to whom?”. However as soon as you want to handle large files (e.g. in decentralized file-storing) and more frequent events (e.g. a lot of sensor data), you run into a scalability problem.

Any of the new approaches presented above tries to solve the scalability problem in order to apply the technology of decentralization to different business areas.

Conclusion 1: Large, Static Data: Yes. Small, dynamic data: Yes. Big dynamic data: (currently) no.

In brief: For storing and working on large, dynamic documents as well as searching for encrypted content in a network, there is currently no real alternative than to trust the big hosters such as AWS. This means that for a research data management tool like labfolder we have to wait for the infrastructure for building decentralized applications to mature. But just taking into account the technical side of things, my gut feeling tells me we won’t have to wait much longer.

For small, less complex data, many approaches also work with traditional blockchains, whether static or dynamic. For large, static documents, blockchain-based or related technologies also are a serious alternative.

What’s definitely worth elaborating on is the topic of an integrated, blockchain-based audit trail: each device, software, electronic laboratory notebook and analysis software could be a node in this "labchain", and help to validate the data transmissions within the network.

Also for archiving of completed experimental data decentralized file storage could be a valid solution. This data is inherently static, and whatever storage system you choose, it needs to ensure the data cannot be modified after archiving. In this respect, the new, decentralized file storage solutions such as Storj, Arweave and IPFS are certainly a very serious alternative to local data storage.

Conclusion 2: Wrong Perception Inhibits Mindshift

The second part of my conclusion is, that a real mindshift needs to happen for harvesting the already existing potential and despite the hurdles, begin to explore the potential of this new internet.

This mindshift doesn’t need to happen in "low-tech" areas only,, but definitely in high-tech areas such as science.

To illustrate my hypothesis on the current mindset, here's an illustration of how cloud storage is currently being perceived in the lab community:

Researcher: I don’t want to entrust my data to a third party. I want to remain in “physical possession” of my data, there is no trusted third party party (eg hosting provider) who can take better care of it than me or the IT department in my organization.

Here's how decentralized data storage or blockchain inspired technologies should be perceived:

Researcher: I entrust many third parties with my securely encrypted data. These third parties mutually control each other and if anyone is behaving maliciously or data is being corrupted, the defect is corrected automatically, and the “black sheeps” are expelled or get a penalty. There is no third party that controls this mechanism, but the system is controlled by a "benevolent, autonomous machine that never forgets".

Perhaps that's the most important thing to understand for everyone: Blockchain is no longer about trusting a human or a group of people - it's about trusting a machine or a network of machines. But since we are humans and don’t know nothing else than to trusting other humans using complex legal systems, I guess that’s why we cannot comprehend this important fact.

Therefore, this is my hypothesis on how decentralization is actually being perceived by 99% of the population:

Conclusion

In particular, researchers are certainly not too stupid to understand the technology, and I believe that initiatives such as "Blockchain for Science," DaMaHub, and Arweave shows that some scientists or organizations with ties to the scientific world have understood the technology and begun to capitalize on the potential.

Nevertheless, the majority of the research and laboratory community also displays the behaviour to not apply what hasn’t been 100% understood by themselves. The potential of the research community is the ability to grasp complex matters, but it is also a serious roadblock: As long as the the researcher hasn’t understood something, he arguments against it with everything he’s got.

So let’s hope for more content that makes the topic even more understandable - I'll definitely keep writing about it!

Good Heise article from November 2017: "Distributed memory: Documents at the Blockchain" (German): https://www.heise.de/developer/artikel/Verteilter-Speicher-Dokumente-an-der-Blockchain-3880570 .html page = all2017:?

Great article on advantages and disadvantages of decentralized file storing and distributed apps: Sep 2017  https://github.com/TiesNetwork/ties-docs/wiki/Where-do- Decentralized-applications-store-their-data% 3F

Another recently published article (Jun 2018), which deals with the disillusionment with Sia, Storj & Co. - but hypes Filecoin and IPFS: https://hackernoon.com/a-crypto-traders- diary-week-11-storage-coins-51da93530623

Article about Arweave (Jun 2018): https://www.forbes.com/sites/shermanlee/2018/06/08/blockchain-is-critical-to-the-future-of-data-storage-heres-why/ # 7526096433e9

Theory of IPFS, video (German, Jan 2018): https://d.tube/v/sempervideo/a0r7uhpf Fun Fact: The video is made available via DTube, the first "crypto-decentralized video streaming platform", which also employs IPFS (https://about.d.tube/). I watched the video once from Berlin - no problem. Then two days later I tried to watch it from the vicinity of Solingen, and the following message was displayed: "The media could not be loaded, either because the server or network failed or because the format is not supported." - Maybe a hint that we’re still pretty far from "Interplanetary".

Great video on IPFS (Oct 2017): https://youtu.be/BA2rHlbB5i0

Post on combining IPFS and Blockchain, even employing a lab record / healthcare data as an example, Feb 2018: https://medium.com / @ mycoralhealth / learn-to-secure-share-files-on-the-blockchain-with-ipfs-219ee47df54c

About Ethereum Swarm and what it's used for (2016): https://ethereum.stackexchange.com/questions/375 / what-is-swarm-and-what-is-it-used-for 


Author