Mobile Menu

Down the Rabbit Hole – DNA Data Storage, James Banal

James Banal is a Postdoctoral Researcher at Massachusetts Institute of Technology (MIT) in Professor Mark Bathe’s laboratory. While Banal is a chemist by training, one of his current projects is harnessing nucleic acids for some unusual purposes, such as DNA data storage. Banal is also the Technical Founder of a biotech start-up, Cache DNA, which aims to provide a unique solution for century-scale storage and access to nucleic acid samples.

Please note the transcript has been edited for brevity and clarity.

FLG: Hello, everyone, and welcome to the latest interview in the Down the Rabbit Hole series. Today, we’re going to be talking about a really cool area, which is DNA data storage with James Banal. So, James, if you could just introduce yourself and tell everyone a little about what you do as well.

James: I’m James Banal. I am a postdoc currently at MIT Professor Mark Bathe’s lab. I am a chemist by training and tried to dabble a little bit with biology. I’m here now in MIT working on some of the wacky stuff we can do with biology, like quantum computing and the topic today DNA data storage.

FLG: What does your main research focus on at the moment? Or, more importantly, what is your favourite part of what you do?

James: Yeah, so what’s funny is that DNA data storage, this project, was my backburner project, it wasn’t my main project. So, the actual main project I’m working on was how we can use DNA to imitate what photosynthetic systems look like. And essentially trying to figure out whether there are any quantum effects in there. DNA data storage was this backburner project that I took over from another postdoc and we were trying to figure out how we build essentially a Google version for data storage.

The most exciting thing I think, for me being a chemist by training, is how powerful biology is. I’m always very overwhelmed by the fact that you can do so much in biology. In chemistry, we have to mix chemicals, but with biology you have another hand that can do a lot of stuff for you essentially, because you have bacteria and stuff like that. That’s the most exciting thing for me, being able to re-engineer the biology for some unusual purpose.

FLG: As you mentioned you are a chemist by training, so what has it been like shifting to biology? You must have to learn on the job.

James: Yeah, yes, definitely learning on the job. It was a struggle at the beginning, simply because the language we use is different. The language of chemistry versus biology is sort of different, and the methods are different. I’ve also worked a little bit on physics. One of the craziest things is you’re talking to computer scientists, you’re talking to biologists, you’re talking to physicists, and you have to somehow bridge the vernacular between the two different areas. And so, it was quite tough in the beginning, but like you said, you got to learn when you are on the job and so having the chemistry background did help. But it was a struggle in the beginning.

FLG: Nucleic acids are often considered in relation to encoding our genes and proteins, and the building blocks of life. But the past year, particularly with mRNA vaccines, has shown us that they have other applications. What are some of the exciting ways in which nucleic acids are being used for therapeutics?

James: Actually, our lab is very active in that area. One thing that you can think about is, the way we think about nucleic acids. It is basically beads on a string, and you have four different beads of different colours. But, in reality, you can also think about it as a polymer, like a chicken wire that you can bend in different ways. And that sort of thinking actually led us to a pretty smart fella named Nadrian Seeman in NYU who discovered a way in which we can structure DNA using some of the motifs found in biology. He created this novel idea to how we can structure DNA in different shapes. Then, Paul Rothemund at Caltech developed a very cool way to fold DNA. So that’s what we call DNA origami. They can structure different shapes and sizes of dimensions of DNA nanoparticles. When you think about that, it’s a scaffold. And so, that’s what my work on photosynthetic systems was about, exploring how I could mimic some of the structure and nanoscale organisation of stuff using a DNA scaffold.

From a therapeutic sense, they look like viruses. We actually made this DNA origami that looks like a virus, it’s like icosahedron. If you play a D&D game, you can actually make some of those dices using DNA origami. Then, because it’s DNA, you can synthetically modify at the base resolution, because of all the advances made in DNA synthesis. You can present antigens, control the different orientations of those antigens, which can actually create a whole different spectrum of immune response, or you could probably deliver CRISPR. I see it as what people don’t know yet is this area that’s burgeoning in the field of how we can use DNA origami as a therapeutic vehicle. We’ve heard about liposomes and mRNA. But what if we could structure the mRNA and just essentially deliver it to the cell. I think this is the most exciting thing thinking about that as a vehicle. So that’s an alternative way.

There’s a lot of different ways you can also use DNA that’s not yet therapeutic. We are going to talk about DNA data storage later on, but I think the other exciting thing is how we can you use it to spatially position stuff on the nanoscale. This is something I’m very actively working on in the lab. Because the ability to structure DNA very, very precisely because of this Watson-Crick-Franklin base pairing allows us to programme exactly the desired interactions. And then we know the mechanics of DNA. And so, we can have these 2D structures, which you can use to build up materials, and there’s just a lot of activities in trying to build electronics from the ground up. So that’s actually one of the other exciting areas besides therapeutics.

FLG: What are some of the challenges in this area of nucleic acid-based therapeutics?

James: There are a fair few challenges. One of the challenges is with scalability. Can you manufacture on a large scale? There’s been work in our lab that shows that we can create nucleic acids that are essentially the scaffold. So how DNA origami works is you take a really, really long strand, like several 1,000 bases, and you fold it using small strands of DNA. The long strand of DNA, that’s not very easy to make. You have to have specific sequences. The innovation in the lab is how we can make that in milligram scale with almost nearly arbitrary sequence specificity on anything from 1,000 to 10,000 base pairs using bacteria. And so, with that problem, the scalability issue should be solved.

Then, your next challenge would be how do you actually attach stuff on the DNA origami at scale. There’s a lot of chemistry that’s been involved in that. There’s a lot of the click chemistry that people are probably very familiar about. And so, a colleague in my lab, has actually developed a way to functionalise the DNA origami particles using click chemistry and demonstrated that in a very beautiful way regarding how you can put different stuff on the DNA origami.

The next challenge then is the stability. Stability is a little bit challenging; I think it’s very early stages. Because it’s DNA, if we put it into the body, it might get degraded by some nucleases or it could be hidden somehow. There’s a lot of different things we don’t know yet about these DNA nanoparticles for delivery for therapeutic purposes. I think we need to make huge effort on stability. And the next step is what therapeutic targets do we actually go after.

FLG: Another exciting area is using nucleic acids for data storage. For those who don’t know, would you be able to summarise what this is?

James: In a nutshell what DNA data storage is, is essentially converting the zeros and ones, the binary digits or bits of digital data, into the four letters of DNA – the A’s, C’s, T’s and G’s. Now, that’s an oversimplification with all the processes involved, but essentially, at the core, they’re using the beads in a strand to yield data. That has a tremendous impact on how you can actually store a lot of data in a very tiny space. So, that’s why there was a lot of motivation to think about the data. If we take a step back and ask why people are actually working on this. It is actually because the amount of data we’re generating is just immense. We’re going to run out of space to store it. With all the internet and Tik Tok, especially last year, it’s insane how much data we’re going to generate. It’s just going to grow exponentially.

FLG: What are some of the processes involved in DNA data storage?

James: For example, you take a picture of a cat, and you want to convert that into DNA. The process essentially is you convert that to a DNA sequence. Now, there’s a lot of technical detail there. Because we are limited by the length of strands due to the current limitations of DNA synthesis, you need to figure out a way to segment those in multiple strands and indexes so you know the order of how they should be read. And then you add addressing barcodes so that you know what file it is, e.g.  is it a picture of a cat, is it a video of a Tik Tok dance or something like that. That’s how we get to the DNA sequence. That’s synthesis. There’s are a lot of different ways to do synthesis. One of the most common ways is to use a microarray. Here, you grow the DNA polymer from the surface one letter at a time. They can do that in a very parallelised way using a lot of different technologies. Actually, a lot of it’s from printer technology. Most people don’t think about it as like the inkjet printing. And so, growing those strands one letter at a time is how you synthesise from sequence to the actual physical DNA. And you store them. Once you’ve stored them, the way to access them is the current way that most people use is polymerase chain reaction. You attach primers, a bunch of Velcro molecules that only stick-on specific sequences. It primes and amplifies the DNA, and you get many copies and everything that doesn’t have those primers won’t get amplified and that’s what you put through a sequencer. Then, there’s two flavours, one is Illumina and the other one is Nanopore sequencing. You read the sequence and then you convert it back to the digital file.

Because there’s so many steps, there’s a lot of things that can go awry. One of them is when you actually try to synthesise it, there’s still errors when you write it. And so, there’s a lot of different ways to mitigate that. There’s a lot of error correcting codes. Because you don’t synthesise only one molecule of DNA, you synthesise many molecules of the same strand, so you can win by that. But then there will be some errors. So, you need to figure out a way to correct for those using some error-correcting codes. So, essentially, you need to add that, and all those error-correcting codes will help downstream. For example, say you lose that one strand that is important, you can figure out the pieces that you need to fill in those blanks once you actually start decoding it. There’s a lot of technical details there. But essentially, that’s currently how it’s perceived to be done.

FLG: What are the benefits of specifically using DNA to store data?

James: DNA is pretty cool material. I actually was super impressed by it. One we’re storing bits of data on the molecular level, we’re talking about roughly 30 atoms, each nucleotide is like 20-30 atoms. If you think about current devices, like in your phone, that’s a lot of atoms, it’s 1,000 atoms easily for a few bits. That’s how the density is, that’s how you get the density down from silicon, that’s big, to something that’s really tiny.

The other thing is the longevity of DNA. DNA is such a sturdy molecule. We’ve been able to extract nucleic acids from fossils. And those fossils, by the way, have experienced the harshest environments you can think of, and the data still survives. Think about your phone, or your disk drive, you put them in that same environment, they won’t survive that. They’d be gone. So, the durability and longevity of data that can be retained is very unique to DNA.

The other thing that I actually really like is that you can create a lot of copies very, very quickly. I’ve talked about PCR, but there are things you can use like bacteria. We’re not talking about just 2, 3, 4, or 5 copies, we are talking about trillions of copies in the order of Avogadro’s number. I think that’s one of the most compelling things. The reason why you want to do that is because if you want to preserve data, you want to have multiple copies of that. Maybe you want to put it somewhere in Mars, ask Elon to ship some of our internet in Mars, so all the tweets from Trump or whoever gets preserved forever. Put it everywhere! You can actually do that with DNA, because it’s durable, you can make a lot of copies. But if you start to do that with other devices, then you can’t really do it very easily because they won’t be able to survive and making copies is just too much.

The other thing is computation. That’s actually something that our lab is very interested in. It’s because the way we think about computation is very two dimensional, you move electrons around.  With molecules, you’re taking the fact that these molecules are jiggling in solution trying to find each other through diffusion. And that’s really fascinating. Because you basically have the ability, in theory, if you do it right, you can have a number of CPUs in the order of Avogadro’s number [19.06] in a very tiny space. It’s hard to find something like that in silico. I think, to me, this will be the biggest reason why DNA is probably going to be the next data storage to look at.

FLG: What are our current capabilities in terms of what we can store? What are we hoping to store in the future?

James: Currently, because the exorbitant cost of DNA synthesis at the moment, it’s limited to pictures, videos, clips. Someone has already done a music mp3 file. Someone actually has put a computer virus too and a whole operating system, which is actually quite exciting because we can literally preserve anything. And so, we can literally preserve anything now in a sense that all those files can in theory be converted to DNA. In practice, the only limitation is the cost of DNA synthesis.

The vision that we published on recently was, could you actually make it searchable. That’s something that I actually think is going to be important. Google’s value didn’t rise because of their ability to write data and the internet. It wasn’t that. Their value was being able to search stuff. So, I think the random access bit is important, which we worked on and are still working on to make it faster or better. It’s essentially like Google’s with your search engine. And so, we wanted to do the same thing and reach that level of complexity on how to search stuff. And that’s random access, because in theory, if we stored all the internet on DNA, you don’t want to sequence the entire DNA just to find a picture of a cat, that’s just not going to be a practical thing to do in terms of time and money. 

Then, finally, on the sequencing side, there’s a lot of challenges there too. It’s still not cheap. We’re not talking about $10 per run, it’s probably still $100 or something like that per run. That is not counting the steps to actually make it compatible with the sequencer. Speed is another problem. Writing data on DNA is still slow. I think we’re in the megabytes now. Who knows, we don’t know about other companies. But I think we’re close to having two megabytes per day. Not per second but per day, which is just horrendous to think about. So, we’re not there yet.

The search thing is also slow. We’re at gigabytes per second now. But if you’re going to look at a whole exabyte, that’s going to take you forever. So, we have to make that faster, finding the needle out of a haystack needs to be faster. I think we can get there actually a lot easier compared to the other problems. Then, Nanopore and Illumina sequencing reading is also still quite slow. So, we need to make that fast. There are a lot of obvious ways to make that faster that I think these companies are working on.

FLG: You and your team at MIT recently developed a new retrieval technique. Would you be able to discuss why you developed this technique and how it works?

James: Yeah, the inspiration is actually Google. So how Google manages information is basically it has a file and underneath that file is a lot of metadata, like cat is orange, cat is domestic, cat is from New York or from Cambridge, UK. The metadata is how it actually populates the dataset because it matches the language or query that you put in the query box. That’s exactly what we wanted to do, we wanted to see if we could implement the same thing. One of the biggest things that we had to figure out was how do we separate the data file, the main data DNA sequence, from the metadata DNA sequence. Currently, I mentioned about the PCR, so the metadata and main data are all in the same strand. The problem with that is as you scale from exabytes to yottabytes, what’s going to happen is that you’ll have some challenges on how many barcodes you can actually use. You have a limited length of DNA, so your real estate is very limited. So, you have to figure out how many barcodes will uniquely identify it.

We wanted to separate that problem altogether. Because we want to be able to attach as many barcodes as we want without the restrictions of DNA synthesis. And so, that’s where the encapsulation came in. I’m a very bioinspired researcher, I’ve be doing photosynthetic stuff, so why not take how the cell actually manages our nucleic acids. The cell puts the main data, basically the DNA data, in our genome and nucleus and outside is where all the fancy signalling happens. And so, we thought can we do the same thing synthetically. And essentially, that’s what we did. We used encapsulated in silica, which is a very hardy molecule, it is glass essentially. We put multiple barcodes on the surface, different barcode combinations. We took from this pool of 240,000 DNA barcodes that we know are orthogonal, so we only have to do the computation to do that. If you think about that 240,000, just 10 combinations of those barcodes, can actually get you to a combinatorial explosion. You can basically, in theory, store every atom in the observable universe with that number of combinations.  

That was a really powerful idea because that was the problem for DNA storage in terms of random-access scalable indexing. But we wanted to take that to the next level and actually build a file system. Because the analogy here is, it doesn’t matter if you can read and write data, if you can’t do anything with it other than just read and write. If you can compute it, then that becomes the most powerful. What we demonstrated essentially is we can do Boolean search logic, basically, how would you search something on Google, but now on DNA. We’ve done it experimentally, not theoretically. Initially, we had a lot of issues with it, we had to figure out a lot of different stuff to make it happen. But then, when we figured it out, when we did the first search, I was so happy that we made it a reality. Because everyone thought it couldn’t be done. But here we are.

FLG: Once sequencing and DNA synthesis costs decline and the techniques are refined, how will this be made accessible to the everyday person? How are people going to use it?

James: There are a lot of models for this, like the Dropbox model. You essentially just put in your archival data. One thing I should say is that we will never be able to read out DNA faster than your computer in terms of searching and that. But if you allow for latency, it will be limited to data that’s archival, your backups or things that you don’t actually use frequently. There’s a lot of examples of that, like your photos from 2012, which you don’t even look at often. But essentially, what would happen is that a normal person would just upload something on the web, and then it gets converted to DNA and encapsulated and barcoded, based on the features, like how Google would do it. But in a very different way, in a sense that you have to print the DNA and store it archivally. I think that’s the way the average person would use it. Then, when you want to query you basically have to allow for say 24 hours for return of your file. The model for that is actually already there like AWS Glacier. Basically, if you want to store something archivally using the current paradigm of data centres, then you put it there but it’s not always hot. So, it really depends on how often you will be accessing it. I think that’s how other folks would figure out which one to do. Because it’s a lot cheaper to save on DNA because you don’t have to pay monthly fees like you would do for other services.

FLG: What role do you think machine learning will play in this area?

James: A lot actually. There’s two ways to think about it. How machine learning will affect DNA and how DNA can actually affect machine learning. So, let’s start with the first one – machine learning on DNA. There’s a lot of interesting ways of thinking how we can design DNA sequences that are orthogonal. Designing orthogonality is actually not a very trivial problem, most people don’t know that. There’s actually a lot of thermodynamics and a lot of physics involved. And doing that from ab initio, from first principles, is actually computationally expensive. Since we know from reality there are sequences that are orthogonal and we basically have experimental models that can predict that to a certain extent, and merging those two together, generating those, can create a library of orthogonal barcodes very easily using machine learning. Because it’s just computationally faster than just doing it ab initio using models. I think there are people working on that area.

The other one is more exciting, in my view, because how could this personalisation of reactions actually help machine learning. One thing to think about DNA is it’s two molecules that bind to each other, and at some certain strength, the strength of that can be tuned. And so, what’s really interesting is that’s basically like machine learning. That the relationships between, say, two different images or files can be connected together by some relationship, you can also do it on DNA. And so, if you can do that, then all of a sudden, you can now use molecules. The reason why you want to do that is because, like I’ve said, it is this jiggling of molecules that you won’t be able to do in an in-silico computer. I think this where it is very interesting to think about that. Because one, it’s more energy efficient because they’re just using water and just jiggling molecules. And two, given the parallelisability, the amount of things you can put in that pot all together. If you want to make bigger and bigger files, you just make a bigger and bigger pot. Not necessarily like bigger data centres or faster CPUs. So, I think that’s an exciting area. I think that’s where I see the industry going in the next few years beyond, of course, technical issues, but I think those can be solved and companies are trying to solve them.

FLG: You are involved in the start-up Cache DNA – why was this set up? What are the ambitions of the company?

James: The conversation about building Cache DNA was set up back in 2019. That’s when we actually had the first validation of the approach and the technology, and how we were going to show it was going to be scalable. We convened some of the pioneers in the field. One of them being George Church. And Paul Blainey, who is a microfluidicist and a really awesome all-around person to talk to. Jeremiah Johnson is a chemist at MIT and my PI, Professor Mark Bathe. We convened and talked about how there is a compelling technology here. That there is a problem, but we think we can solve it. Basically, the proof of concept is there, it’s all about just scaling it up, making it faster, making it more efficient, outside of academia. So, we talked about building a start-up. So that’s how it started as a conversation. We became Cache DNA LLC in February 2020. That’s the genesis of that.

The overall vision of Cache is about indexing the world. I made that analogy about Google’s value being able to index information. That’s where I see our technology being very important. Because we actually bridge the gap between DNA writing and reading for data storage and being able to index all that information is going to be the most important thing. Imagine a scenario where you don’t have that, and you have to sequence the whole yottabyte of information every time. That’s it. So, for me, it’s not even just the ability of random access but being able to do some computation that’s very unique to computation using the metadata. I think there’s a lot of power in there. I want to be the Microsoft to the IBM, that’s where I see Cache being a key player in the area. And essentially making data storage a reality, helping those companies who are working on the other problems that are essentially making it useful to the average person.

Beyond just waiting for those technologies to come in to write and read, you can think about how there’s already prewritten DNA. Our DNA is already pre-written. There’s a lot of value in consolidating them and being able to search them. That opens up a whole new area. If you think you don’t need to rely on synthetic DNA, why don’t you just capture DNA from everyone and use it for medical purposes, like epigenetics? Or biotech surveillance. What if we could stop the next pandemic before it even starts? Because we are able to have a library of DNA sequences from the environment and be like, ‘Oh, look, we found a bat there that could have a virus that could be fatal to humans’, so we should probably figure that out and maybe send it to Moderna and say, ‘Here’s the sequence before it even starts’. Maybe we could stop the next pandemic that way. Having the ability to survey the environment for those is probably going to be important the more we try to destroy environments, because I think it’s going to keep happening. So, I think having something like that is important.

FLG: How do you think the DNA storage field will evolve?

James: The next exciting thing is actual people using it. I think when we get there, all the pessimists and all the people who were like, ‘It’s never going to work. You have a lot of problems’. I think that’s going to be the most exciting thing, just showing them like ‘Yeah, we made it work!’ For me, it’s sort of a parallelism with silicon and germanium during the early days of transistors, like everyone says, ‘Oh, silicon is too expensive’. But who won now! I’m very optimistic about my peers being able to solve and get us to a point where we can start storing information not just toy systems but real data from people who are needing it. I see maybe in the next five to 10 years, you will have like a Dropbox for your data and then you upload your file and it’s going to be in DNA. You can have a copy of it, we can send a copy to Mars if you want to for forever so aliens can see your information.

I think that’s the most exciting area is that and being able to search it. I think the ability to store information forever so that you never have to delete again. It’s the most grand vision. It’s going to be hard to get there but I think we’re going to get there, optimistically.

FLG: Thank you so much for joining me today, James. This field is so exciting, and I really hope that you prove all those people who didn’t believe you wrong!

James: Thank you so much, Shannon.