In June, Beth Schechter from the Open Cannabis Project (OCP) published an article on Medium titled, “There’s a light and a dark side to everything (And other takeaways from my time at Open Cannabis Project).” In May, the OCP board of directors had made the painful decision to shutter in reaction to the fallout from the Phylos Affair, and Schechter felt compelled to share her thoughts and feelings about the experience.
Toward the end of the piece she included in a list of things she wanted to have happen, “…it would be amazing to see a public marker analysis of everything in the [Phylos] Galaxy. That would both be helpful to growers, and also helpful for people to understand precisely how useful (or unuseful) it is to the breeding project.”
Unknown to Schechter, Bertrand Vick and Cody Markelz, two Oakland, California-based plant scientists unrelated to the Phylos controversy and at that point unknown to anyone from the Open Cannabis Project, read and then heeded her call to action. Now, three months later, Rev Genomics, their plant-focused biotech company, is releasing to the public under a creative common license a SNP map, or genetic marker dataset, using OCP data.
Rev Genomics does cannabis breeding with a focus on developing rare cannabinoids such as THC-V, CBG, CBC, and CBD-V. “We essentially license strains to cultivators,” said Vick. “Given the competitive environment in the industry these days, we want to help the farms we partner and work with to help differentiate them from all the other cannabis companies out there, and we think our strains can help them do that. Of course, as our name implies, we have a strong data science-slash-genomics component to what we do.”
“Articles published by Future Cannabis Project and others concerning the Open Cannabis Project were of considerable interest to us,” he added. “Then we read one article in which Beth Schechter from the OCP specifically wanted someone to pick up the ball and do a global SNP analysis of the data that had been collected. So, that’s what we’ve done, and we’re making it freely available to the cannabis industry.”
The team is not just regurgitating data but repurposing it for further use. “We’ve done a bunch of analysis, provided a bunch of statistics, and aggregated all the data,” explained Markelz, who did the actual analysis. “We’ll link back to the raw data if people want to do their own thing, but it’s really the analysis that we’ve done and the synthesis that we’ve done that we want to release.”
“This is not about us getting customers, but doing something good for the community,” said Vick. “I can’t say it was not partly in response to the Phylos debacle because some of the blowback affects companies like us—bioscience companies that on the surface seem like they’re maybe doing some of the same stuff, or similar things—and we just wanted to do solid for the community.”
“To do marker-assisted breeding, you need a lot of data,” he added. “You can’t just get data from one strain but from a lot of strains, and there are certain companies in the value chain that are set up particularly well to collect that data, potentially, if we’re thinking of it in that way. And I think it should definitely be known to people who are participating that their data is being collected, just like we all believe now that if we’re going to take part in Facebook, we need to know that our data is being used in a particular way. And so yeah, definitely, data and its use or misuse is a huge deal for all society, and it’s starting to come to bear on the cannabis industry.”
No Strings Attached
“Welcome to version 1.0 of the free Cannabis genetic marker dataset,” reads an introductory README file released with the dataset. “This genetic marker dataset was derived by analyzing publicly available Cannabis sequence data available for free here. The sequencing data was mapped to the publicly available Purple Kush genome resource available here published as part of Laverty et al. 2019. We provide these 23500+ SNP and 2200+ InDel molecular markers from 1358 cultivars to the cannabis community to facilitate innovation in breeding efforts and the creation of new cultivars in this amazing plant.”
For the Rev Genomics team, these sorts of initiatives are essential if the cultivator community intends to realize its potential. “I feel like the people who submitted their strains for sequencing and gave information to OCP are particularly aggrieved,” said Vick. “They have gotten some sort of general analysis of their strains, but I know they didn’t get a SNP map, which acts as a sort of a molecular fingerprint that discerns their strain from other strains.
“For instance,” he continued, “there could be a breeder out there who came up with a great strain and wants to sign a licensing deal with a big company that wants to commercialize the strain. The SNP map could help them identify the strain down the road. If they suspect that there’s been some foul play with their strains, it can help identify their strain, and it can also help identify strains that come from their strains.”
“The SNP map is just an ordering of the SNPs onto the publicly available genome,” added Markelz. “Rather than just having all these SNPs exist somewhere in the genome, it’s actually the ordering of them. So, if you have 10 SNPs on a chromosome, then you could say SNP one is next to SNP two, is next to SNP three, and so on.
“You’re literally ordering them,” he continued. “Then, for the strain samples that were submitted to OCP, and with some of the other publicly available resources, you can genotype them and be able to say, for instance, ‘Purple Banana Kush has a SNP at this location that’s different from Blue Dream at the same location. What you can conclude from that, if you were doing a big analysis, is that they’re different than that location in the genome, and therefore the genes that are around that SNP are also probably different.”
Of course, the DNA alone provides an incomplete picture that needs to be supplemented. “We have a big burden because the raw sequence data is not useful at all unless you have lots of phenotypic data,” explained Markelz. “But I’ve assembled everything into a SNP map, and we plan to release all the raw data as well so people can do whatever they want with it. And I think that this will be helpful for breeders if they want to develop molecular markers around different locations in the genome. So, let’s say for example they discovered a gene that was involved in botrytis resistance or something like that. What you could do in the dataset is you could say, ‘Oh, well, these 25 strains have the resistance gene, so I’m going to try to find genetics that have this resistance gene and then develop a marker for it and use that in my breeding program so that we’re only selecting for individuals that have that sequence. That’s what is referred to as marker-assisted breeding.”
“The main thing is that aggregated data is really powerful,” said Markelz when asked about the ultimate value of the work they’re releasing to the public. “Any one individual sample that someone submitted is not really very valuable, but the aggregated data is quite valuable. So in my example of the disease resistance thing, if someone wanted to look through this, and they knew that their strain was disease resistant and the disease resistance was caused by this gene at this location, they could go through and select individuals from this overall data set that have the same gene. So, you know, this is really the backbone of a breeding program, I would say.”
The dataset and tutorial are licensed under a Creative Commons Attribution 4.0 International License, which essentially stipulates that people are allowed to:
Share — copy and redistribute the material in any medium or format.
Adapt — remix, transform, and build upon the material for any purpose, even commercially.
“The thing that’s nice about it is that each of the SNPs will have a location on the genome that it’s mapped to, so if people want to go in and say, ‘Oh, this SNP is useful for me and not for anyone else,’ but then I can go to the genome and I can design PCR-based primers around that location,” said Markelz. “You know, people will probably use it in many ways that we can’t even conceive.”
There was ample thought put into how exactly to release the dataset and analysis, and whether to make it available to everyone. But both Markelz and Vick believe that doing it this way helps to tip the scale at least a little to the benefit of the small cultivator and breeder.
“It’s always a double-edged sword with this sort of stuff, because companies can still use it, but so can everyone else,” said Markelz. “I would say that as far as huge companies go, even if I release it, they’re going to have people on staff that can do this work, too. They’re going to do their own analysis. It levels the playing field much more in the direction of the regular growers and breeders that have been doing this for a long time.”
They see their work as helping growers across the spectrum. “We like working with cultivators of all sizes,” said Vick. ”We have been working mostly with medium-sized greenhouse cultivators, and that’s maybe based on the strains that we’ve been making. For instance, we’re coming out with a potent Sativa with a six-week flowering time that’s of obvious interest to indoor cultivators. Ultimately, however, we see ourselves working with outdoor cultivators and smaller cultivators in Mendocino and Humboldt. We’re open to all comers and see ourselves as the genetic partner to the industry. We think what we’re working on can help lift all boats, so to speak.”
The current dataset and analysis release underscores Rev Genomic’s philanthropic intentions, concurred plant patent attorney Dale Hunt. “If they release this material without restrictions, it’s a dedication to the public for the public benefit, and even though they could choose to copyright their original analysis, and even though they have a copyright on their original analysis, they’re choosing to dedicate that without restriction to the public by making this posting the way they have, and in doing so giving up rights they could choose to retain. It’s extremely altruistic of them.”
The current release is referred to as v1 because there could be more work required on the data depending on the wishes of the industry. “We’ll release summary statistics files, which are .txt files,” said Markelz. “Then, we will have variant calling format files that are .vcf files, a common format for this type of data. I’ve also made a bunch of summary plots of the data that we will release with it, in addition to a file set that says this individual Gorilla Cookies, for instance, relates to this .vcf file and also this raw piece of data that came from the NCBI database. So, if people want to go back and do their own analysis, they can, and it will be much easier for them to do. We’re going to use this to design experiments for our own focus on rare cannabinoids, but that doesn’t prevent other people from using it however they want. The most important thing is just to make the data open; that’s what we mean.”
Beth Schechter was contacted for comment but is away on vacation.