Thursday, 24 December 2015

12th genome of Christmas: The platypus

In 1799 George Shaw, the head of the Natural History Museum in London, received a bizarre pelt from a Captain in Australia: a duck bill attached to what felt like mole skin. Shaw examined the specimen and wrote up a description of it in a scientific journal,  but he couldn't help confessing that it was "impossible not to entertain some doubts as to the genuine nature of the animal, and to surmise that there might have been practised some arts of deception in its structure." Hoaxes were rife at the time, with Chinese traders stitching together parts of different animals - part bird, part mammal - to make artful concoctions that would trick European visitors. Georgian London was becoming rather skeptical of these increasingly fantastical pieces of taxidermy.

But the duckbilled platypus is no hoax. It is one of the last extant remnants of the egg-laying mammals, monotremes (along with the far more commonplace, but less exotic, Echidnas - hedgehog-like mammals). Duckbilled platypuses have a bizarre set of features: they are the only mammal with a specialised venom organ (one or two shrews have developed poisonous saliva; the duckbill has a specialised claw), with venom strong enough to incapacitate a human. They hunt in muddy streams and have a sixth sense: electrosensation, as in fish, thought to be sensed via their leathery bill. And of course the female lays leathery eggs, out of which hatch tiny little (and super-cute) “hatchlings”, which will first feed from milk patches on the female.

So in the middle of the first decade of this century, when genome sequencing was becoming marginally more routine, it seemed obvious that at least one monotreme should be on the list. The duckbill simply had to take a star turn. Echidnas are also cute, but, frankly... far less weird.

And the genome did not disappoint. One complicating factor was platypus sex chromosomes. Even before the genome was sequenced, it was clear that platypus sex was no simple affair. Like all other mammals, platypuses have sex chromosomes, but there are 10 of them in five pairs, rather than the usual two sex chromosomes in one pair. This could lead to 25 possible sexes, but it doesn’t seem like there's much diversity in platypuses. As it turns out, at the key point in meiosis (the process of making sperm and eggs) the five X chromosomes all line up together with the five Y chromosomes in a spectacular act of chromosomal ballet, and divide as one, such that each sperm either gets five X chromosomes or five Y chromosomes, but a mixture, say 3X and 2Y in one direction, and 2X and 3Y in the other. This means that each sperm was either all X or all Y.

The genome sequence was even more surprising. Birds also have genetic sex determination (in contrast, many reptiles and fish do not). However, the avian system is on different part of the genome (there is no standard way of doing sex chromosomes). It's the other way around from mammals; females have the different chromosomes (ZW) whereas the males are the homogeneous set (ZZ). (An oddity in the naming system for sex chromosomes is that they are always called either XY or ZW, even though the X chromosome in mammals has absolutely no relationship to X in, say, fruitflies. It’s just a convention.) The bizarre split-five-ways sex chromosome in platypus is mainly similar to the Z chromosome (which is similar to the human chromosome 9). It's as if the platypus, genetically, is some mixture of bird and mammal: a bird-like sex determination, but flipped the other way around like mammals.

The platypus genome also definitively put milk as the major mammalian innovation, before bearing live young. The thin milk produced by platypuses was unclear in its origin (it is also quite hard to study this, as the female is protective of her hatchlings) but the milk caesin gene is clearly in the same location in platypus as placental mammals.

For me, working on the platypus genome drove home both the diversity of life (egg laying, milk producing, weird sex chromosome mammalian poisoner, anyone?) and its arbitrary nature. Perhaps there is an alternative planet Earth where egg-laying mammals are the dominant species, and the live-young-bearing placental mammals were the oddities. If we had these bizarre sex chromosomes, I am sure there would be all manner of speculation about how this system was somehow linked to our intelligence or dominance.

But on this Earth, platypuses are the strangers, and serve every day as a reminder that biology is far, far more imaginative than we are. And we are all the richer for that.

Wednesday, 23 December 2015

11th genome of Christmas: Us

Ever since the discovery of DNA as the molecule responsible for genetics, in particular when it became clear that the ordering of the chemical components in this polymer was the information that DNA stored, scientists have dreamt about determining the full sequence of the human genome. For Francis Crick, who co-discovered the structure of DNA (along with James Watson, using data from Rosalind Franklin) this would be the final step towards unifying life and chemistry: demystifying the remarkable process that leads to us and all other living creatures. Back in 1953 this was a fantasy, but slowly and steadily over the ensuing decades it became a reality.

The first step was developing a routine way to determine the order of the chemicals in the DNA polymer: sequencing. Fred Sanger, a gifted scientist and the only person with two Noble prizes in the same field under his belt, developed dideoxy-sequencing (a.k.a. “Sanger sequencing”) at the LMB in the 1970s. His laboratory, along with neighbourghing LMB labs including Sydney Brenner’s, produced a new generation of scientists: John Sulston, Bart Barrell, Roger Staden and Alan Coulson, who forged ahead towards the seemingly unobtainable goal of sequencing whole organisms – with human in their sights. First, they did the different bacteriophages (see my First Genome of Christmas). Then, in the 1980s John Sulston and colleagues started on mapping then sequencing the worm (see the Second Genome of Christmas).

Of course this was not just a UK effort; many US scientists were involved in genomics. A scientist and technology developer , Lee Hood, looked at how to remove the radioactivity that came with Sanger sequencing, and created flourophore based terminators. These were far safer and, importantly, amenable to automation. This led to the ABI company's production of automated sequencers, which featured a scanning laser-based readout. Back in the UK, Alec Jeffreys made a serendipitous discovery: microsatellites – highly variable regions in the human genome that provided easy-to-determine genetic markers. This led to the rise of forensic DNA typing (first done for a criminal case near Alec’s native Leicester to provide evidence in a double murder case). A group of enterprising geneticists in France, led by Jean Weissenbach, used these microsatellites to generate the first genome-wide genetic map, based around Mormon families in Utah, who had kept impeccable family records. Clinician scientists were starting to use genetics actively: the first genetic diseases to be characterised molecularly were a set of haemglobinopathies (blood disorders such as sickle cell anaemia). In these cases, the clinicans were lucky that it was easy to track the protein itself as a genetic marker. A landmark breakthrough, by Francis Collins and colleagues, was the cloning of the gene for cystic fibrosis, using only DNA-based “positional” techniques, without knowing the actual defective protein. This was, at last, a clear, practical application of genomics.

From 1985 through the first part of the 1990s, all of these technologies and uses of DNA were improving, and it became increasingly clear that it was at least possible to consider sequencing the entire genome. However, this was still more of a sheer cliff than a gentle slope to climb. The human genome has three billion letters, a million-fold larger than bacteriophages and 30 times larger than the worm. If the human genome was going to be tackled, it was going to take a substantial, coordinated effort. Debates raged about the best technologies and approaches, the right time to invest in production vs developing better technology, and who, worldwide, would do what.

By the mid 90s things had settled down. The step-by-step approach used in the worm was clearly going to succeed, and there was no reason not to see the same approach working in human. The approach of mapping first, then sequencing was also compatible with international coordination, whereby each chromosome could be worked on separately without people treading on each other's toes. There was some jostling about which groups should do which chromosomes (the small ones were claimed first, unsurprisingly), and some grumbling about people reaching beyond their actual capacity, but it was all on track to deliver around 2010.

Five large centres offered the biggest capacity: 
  • The Sanger Centre (now the Sanger Institute), led by John Sulston with Jane Rogers and David Bentley as key scientists, funded by the Wellcome Trust, a UK charity; 
  • US Department of Energy (DOE)-funded groups around the Bay Area in California (now the Joint Genome Institute, JGI), with Rick Myers in the early stages and Eddy Rubin pulling the configuration together;
  • Three US National Institutes of Health (NIH) centres, with oversight from Francis Collins, director of the NIH's National Human Genome Research Institute: 
  • The Washington University genome center in St Louis, led by Bob Waterston with Richard Wilson and Elaine Mardis as key scientists (this was the Sanger's sister group on the worm as well); 
  • Mathematician-turned-geneticist (and part time entrepreneur), Eric Lander, who formed the Whitehead Genome centre as part of MIT (now the Broad Institute); 
  • An Australian transplanted into Texas, Richard Gibbs, at the Baylor genome centre. 
Two other groups claimed a chromosome in its entirety: Genoscope in France, led by Jean Weissenbach, had its sights on Chromosome 14, and a Japanese-led consortium took on Chromosome 21. 

Very often, the genome would be depicted with tiny little flags superimposed, as if it had territories to claim. But happily there was an early landmark agreement, the Bermuda Principles, that stipulated all data would be put into the public domain within 24 hours.

For a few years, the Human Genome Project followed a steady rhythm: large-scale physical mapping followed by sequencing. Chromosome 22 was the first to be sequenced, by the Dunham team at the Sanger Centre. I remember poring over the sequence and gene models of this tiny human chromosome and thinking just how big the task ahead of us was. Chromosome 21 was heading to completion, and many other larger chromosomes were slowly being wrangled into shape.

Then, the sequencing world was turned upside down.

Craig Venter, a scientist/businessman had been around the academic genomic world for sometime, and realised perhaps better than anyone else the potential impact of automation. He had already published the first whole-genome shotgun bacteria and, inspired by a paper from Gene Myers (a computer scientist working on text analysis, and converting to biology) realised that a similar approach could work on human. Craig assembled an excellent set of scientists - Gene Myers, Granger Sutton and Mark Adams among others - and persuaded leading technology company ABI to set up a new venture to sequence the human genome - privately. This was at the end of the 1990s, at the start of the dotcom boom when it was anyone's guess what a viable business model would be. Certainly, holding a key piece of information for biomedical research 10 years before the public domain effort looked a pretty good bet. Celera was born, raised a substantial amount of money on the US stock market and purchased a massive fleet of sequencers and computers. 

Naturally, this was quite a shock to the academic project. I remember John Sulston gathering all of the Sanger Centre employees in the auditorium (I was a PhD student at the time) and telling us that this was a good thing - but complex. Behind the scenes there were all manner of discussions, best read about in one of the numerous books that came out. By my own recollection, there was a sneaking respect for Craig's sheer chutzpa, coupled with a massive sense that one simply couldn't have one organisation - and certainly not a company - own this key information. 

I later discovered that the Wellcome Trust, the large UK charity behind the Sanger Centre, took the important step of backing John Sulston to sequence the entire genome if necessary, to ensure it would be put it into the public domain (the US academic components were being asked whether their effort was value for money for the taxpayers). The ability for this charity to "buy in" the genome sequence to the public domain was critical to keeping the genome open (in fact, the US academic projects continued, but it is unclear what would have happened had this stance been taken). More publicly, there were some quite unseemly spats, for example on the feasibility of the whole-genome shotgun approach.

The academic project also responded to the new, higher-pressure timeline. Rather than keeping with the map-first, sequence second approach, people switched to sequence-and-map as one scheme, but still with mid-size pieces (BACs - around 100,000 letter regions) rather than reads (only 500 letters at a time). This was a half-way point towards whole-genome shotgun and, critically, allowed the five major centres to accelerate their production rate. The nice map with flags across the genome basically disappeared (though each chromosome would then be mapped and finished) and the five centres ploughed onwards, leaving footprints all over the nice, tidy, well-laid plan.

But this acceleration of rate caused another problem: bottlenecks in the downstream informatics. Celera started to crow a bit about their depth of human talent in computer science and the size of their computer farm. This became a real issue. The public project was facing a very real headache of having thousands of fragments of the genome without any real way to put them together. My supervisor, Richard Durbin, was the lead computational person at Sanger and stepped up along with other academic groups, notably the creative, enthusiastic computer scientist David Haussler in Santa Cruz. David and Richard had worked on and off on all sorts of things, bringing in parts of computer science methods into biology, and they - with us, their groups - began to try and crack this problem.

The first problem was assembly. Previously, we were guided by a "physical map" and assembly was effectively done by hand on a computer-based workbench. This needed to change. David was joined by ex-computer-gaming programmer Jim Kent, who felt he could do this. I remember discussing the details of assembly methods and concepts on a phone call, with Jim enthusiastically claiming it was doable and everyone agreeing that Jim should come to the Sanger Centre for a while to absorb the details of overlaps, dispersed repeats and other Sanger genome lore. He packed his bags and left that day, appearing 12 hours later in Hinxton: a jovial, very definitely west-coast Amercian, ready to get to work. Jim worked constantly for about six months (back in Santa Cruz) solid to create the "golden path assembler", which provided the sequence for the public projects. Jim also created the UCSC Browser, which remains one of the premier ways to access the human genome (though of course I am partial to a different, leading browser...).

And it didn't stop there. The public project and the private Celera project were now really swapping insults in public, and Celera said that even if the public project could assemble their genome, they wouldn't be able to find the genes in this sequence. Thankfully, three of us - Michele Clamp, Tim Hubbard and myself - had already started a sort of 'skunk-works' project at Sanger to be able to automatically annotate the genome. The algorithmic core was a program I had written, GeneWise, which was accurate and error-tolerant but insanely computationally expensive. Tim had a (in-retrospect, bonkers) cascading file system to try to match the raw computation with the arrival of data in real time. Michele was the key integrator. She was able to take Tim's raw computes, craft the right approximation (described as "Mini-seq") and pass it into GeneWise. This started to work, and we made a website around it: the Ensembl project, which provided another way to look at the genome. (Mini-seqs and GeneWise still hum away in the middle of Ensembl gene builds, and are responsible for the majority of vertebrate and many other gene sets.)

Even more surreally for me, the corresponding Celera annotation project was also using GeneWise (I had released it open source, as I would do everything), so I would have a list of bugs and issues from Michele and Ensembl during the day, and then a list of bugs and issues from Mark Yandell and colleagues from Celera overnight. The friendliness and openness of the Celera scientists - Gene, Mark Adams and Mark Yandell - was at complete odds to the increasingly bitter public stance between the two groups.

It was an intense but fun time. Michele and I worked around the clock to provide a sensible model of the genome and features (using - radically at the time - an SQL backend), and there were constant improvements to how we computed, stored and displayed information. We'd often work all day, flat out, and then head back to Cambridge, often in Michele's house where we'd snatch a quick bite and watch the latest set of compute jobs fan out across the new, shiny compute farm bought to beef up Ensembl's computational muscle. Michele's partner (now husband) James ran the high-end computers, so if anything went wrong, from system through algorithm to integration - one of us was on hand to fix it. As the first jobs came back successfully, we would slowly relax, and eventually reward ourselves with a gin and tonic as we continued to keep one eye on the compute farm.

Eventually it became clear that both projects were going to get there - pretty much - in a dead heat. Given that the public project's data could be integrated into the private version, Celera switched data production efforts to mouse, much to Gene Myers' annoyance as he wanted to show that he could make a clean, good assembly from a pure whole-genome shotgun. There was a brokering of a joint statement between Celera and the public project, and this led to a live announcement from the White House by Bill Clinton, flanked by Craig Venter (private) and Francis Collins (public), with a TV link to Tony Blair and John Sulston in the UK.

One figure in this announcement came from our work: the number of human genes in the genome. This is a fun story in itself - I can't do justice to it now - involving wild over-estimation for over two decades followed by extensive soul-searching as the first human chromosomes came out. I ended up running a sweepstake for the number whereby, in effect, we showed that in the absence of good data, even 200 scientists can be completely wrong. For the press release, it was our job to come up with an estimate of the number of human genes, so Michele launched our best-recipe-at-the-time compute. Bugs were found and squashed, and I remember hanging around, providing coffee and chocolate to Michele as needed (there is no point really in trying to debug someone else's code in a pressurised environment). Eventually an estimate popped out: around 26,000 protein-coding genes.

We looked at each other and shook our heads - clearly too low, we thought, and went into the global phone conference where the good and the great of genomics said "too low" as well. So we went back and calculated all sorts of other ways there could be more protein coding genes (after all, a biotech called Incyte had been selling access to 100,000 human genes for over five years). We ended up with the rather clumsy phrase, "We have strong evidence for around 25,000 protein-coding genes, and there may be up to 35,000."

In retrospect, Michele and I would have been better sticking to our guns, and going with the data. In fact, we now know there are around 20,000 protein-coding genes (though there are enough complex edge cases not to have a final number, even today).

The human genome was done in a rush, with enthusiasm, twice, in both cases in such a complex way that no other genome would be done like this again. In fact, Gene Myers was right. Whole-genome shotgun was "pretty good" (though purists would always point out that if you wanted the whole thing, it wouldn't be adequate). The public project, John Sulston above all, was right that this information was for all of humanity, and should not be controlled by any one organisation. 

With all the excitement and personality of the "race" for the human genome, it is easy to forget what the lasting impact was. As with all of genomics, it is not the papers, nor the flourishes of biology or speculation about the future that makes the impact, but two features of this data: the genome is finite, and all biology, however complex, can be indexed to it.  This is doesn't mean that knowing the genome somehow provides you with all the biology - quite the opposite is true. It is often the starting point for efforts to unravel biology. But there this was a major phase change in molecular biology, between not knowing the genome sequence and knowing it.

I was very lucky to be at the right place at the right time to be a part of this game-changing time for human biology. Crazy days.

Tuesday, 22 December 2015

10th genome of Christmas: The laboratory mouse

After human, the most studied animal, by a long margin, is mouse. Or, more strictly, the laboratory mouse, which is a rather curious creation of the last 200 years of breeding and science. 

Laboratory mice originate mainly from circus mice and pet “fancy” mice kept by wealthy American and European ladies in the 18th century. Many of these mice had their roots in Japan and China, where their ancestors would have been kept by rich households. Unsurprisingly, the selection of which mice to breed over the centuries came down to habituation to humans and coat colour rather than scientific principles. 

The founding genetic material for the lab mouse was not just one species, the European house mouse (Mus musculus domesticus), but three: Mus musculus domesticus, Mus musculus musculus (mainly Asian) and Mus musculus castaneus. Because mice have been following humans around for thousands of years, the history of these three species or strains (everything gets a bit murky here, as mice mate if they meet - but Asia to Europe is quite a distance if you are a mouse) is complex, to say the least.

Mice got their start in the genetics laboratory in a rather eccentric collaboration between a Harvard Geneticist (W. E. Castle) and a fancy-mouse breeder (Abbie Lathrop), who provided a series of mice with specific traits, such as Japanese Waltzing mice. Abbie arguably ran the world’s first-ever mouse house on her farm in Massachusetts. A student of Castle, C.C. Little, got involved in studying mice and transformed a small hamlet on the coast of Maine, Bar Harbor, into a research laboratory, later named the “Jackson Laboratory” after a generous donor. The Jackson lab (shortened to “Jax”) is still one of the world’s premier mouse research sites.

Mice are excellent mammalian models: they really do have all the cell types, tissues and organs that human has, and so many features (though not all) of human biology, from cellular to physiological, can be replicated and studied in this animal. But it is the detailed control we have over the mouse genome that makes it an exceptional species for helping us understand biology. This control is thanks to two key developments. First, because mouse embryonic stem cells can be produced so easily, there are mouse cells (which you can keep in a petri dish) that can be coaxed into making viable embryos. These embryos can be implanted in pseudopregnant mice, and become full grown individuals. Second, one can swap pieces of DNA in and out in these stem cell lines at will - almost as easily as in yeast (and certainly more easily than in fly or worm). 

The ability to swap, not just insert, DNA segments (“homologous recombination”) is key. This unique-in-animals genomic control of genetics means there are elegant, precise experiments that are only feasible in mouse. For example, one can 'humanise' specific genes (i.e. swap the human copy in for the mouse copy), or trigger the deletion of a gene at a particular developmental time-point by using a variety control elements, ending up with molecular 'cutters' that will turn on only when you want them to. Mice are far more than just a 'good' model for human - they are arguably the premier multi-cellular organism over which we have the most experimental control. 

Given its importance to a massive community of researchers, mouse was clearly going to be the most important genome to sequence, after human.

The Black6 strain (Full name: C57BL/6) from the original breeding of C.C. Little was chosen as the strain to sequence, because it was the most inbred and the one most often used in experiments. Indeed, in the public/private race to the human genome (more on this in a later post), the company Celera switched to sequencing mouse when it was clear that the public human genome project was matching the Celera production rate. 

Both the Celera mouse data and the public mouse genome data were based on a whole-genome shotgun sequencing approach. This was standard fare for Celera, but signalled the start of whole-genome shotgun sequencing for 'big' genomes academically (at least for 'reasonable' draft genomes). The inbred nature of mice, Black 6 in particular, simplifies the assembly problem for whole genome shotgun. It’s bad enough trying to put together a 3 billion-letter-long genome from 500 letter fragments - it’s even worse when you have two near-but-not-quite-identical 3 billion-letter-long genomes to reconstruct. 

But in many ways, the mouse genome brought us into a new era of genome sequencing: one of routine, 'pretty good' drafts from whole-genome shotgun, with fairly routine automated annotation. This was in stark contrast to the step-by-step approach taken with previous genomes, coupled with a more involved, manual annotation. 

Given the importance of mouse to researchers, both the genome and the annotation have been regularly upgraded. Though they had broken the back of the big-genome quandary, like many problems, the last 10% of the work, sorting things out, has turned out to be as annoying and involved as the first 90% of the job. After the first draft mouse genome, the next five years was about nailing down the frustrating ~10% of the genome that wasn't easy to assemble from shotgun, and attending to all the details.

Mouse is also likely to lead us in future to a more graph-based view of reference genomes. As there are inbred lines of mice, one can really talk about "individual" genomes in a solid way, knowing that others can 'order up' the same strain and work on them. Thomas Keane and colleagues have been building out the set of mouse strains beyond Black6, and doing increasingly independent assemblies, strain by strain. The resulting set of individual sequences absolutely shows the complex origin of laboratory mice; at any point, some mouse strains are as divergent as two species, and some are more like two individuals from a population. This complex web is best represented as a graph of sequences, rather than a set of edits from one reference, which is the current mode. 

In 1787 Chobei Zenya (from Kyoto) wrote a book, "The Breeding of Curious Varieties of the Mouse", which apparently had "recipes" for making particular coat colours for breeding strategies. There are far earlier documents from China on mouse strains, including the "waltzing" mouse (which we now know is a neurological condition). In some sense this is both the rootstock of this laboratory species and part of the motivation for and discovery of evolution and genetics (though Darwin spent more time looking at pigeons than mice). 

Given the laboratory mouse's flexible genetic manipulation, we will studying this species for at another 200 years.

Monday, 21 December 2015

9th genome of Christmas: Medaka and friends

My ninth genome of Christmas is a bit of an indulgence: the gentlemanly, diminutive Medaka fish, or Japanese rice paddy fish.

When Mendel’s laws were rediscovered in the 1900s, many scientists turned to local species they could keep easily to explore this brave, new world of genetics. In America, Thomas Hunt chose the fruit fly. Scientists in Germany explored the guppy and Ginuea pigs. In England, crop plants were the focus of early genetics. In Japan, researchers turned to the tiny Medaka fish, a common addition to many of the ornamental ponds maintained in Japanese gardens. 

Medaka fish are regular tenants of rice paddies and streams all through east Asia, from Shanghai through the Korean peninsula and the islands of Japan, with the exception of the very northern set of islands in Japanese archipelago. (Naturally, every country has a different name for this fish, but it is most widely used for study in Japan so I am using the Japanese terms.) Fishing for Medaka is as common for Japanese children as fishing for guppies or fry is for European children, and is widely depicted in 19th century Japanese wood blocks.

Medaka also has the honour of being the first organism to show us that cross-over on the sex chromosomes does occur. We now know this to be commonplace, but at the time of its discovery this was a novel observation.

As genetics developed, Japanese researchers continued to inbreed Medaka fish, creating one of the most diverse set of inbred individual invertebrates from a single species in the world. Being fish, they have all the cell types and nearly all the organs that a mammal has: tiny, two-chambered hearts, livers, kidneys, muscles, brains, bones and eyes. Conveniently, one can keep lots and lots of them, far more cheaply than mice, and they reproduce regularly, with a generation time of around three months.

But then a different fish rose to prominence in molecular biology in the 1980s. Zebrafish, native of the Ganges, was chosen by the influential Christiane Nusslein-Volhard as the basis for redoing her Nobel-Prize-winning forward genetic screens in Drosophila, this time in a vertebrate. 

I’ve not yet asked Christiane whether she ever thought about using Medaka rather than Zebrafish, but I am sure that a couple of details to husbandry made Zebrafish very attractive: it lays 1000 eggs at a time, providing for excellent single-female progeny, and is transparent during its embryonic stage, allowing for easy light microscopy of the developing fish. In contrast, Medaka lay only around 30 eggs, and they stick to the female rather than being spurted out, so harvesting them is somewhat complex. Plus, the eggs have an opaque glycoprotein layer, which skilled scientists can remove but again makes it harder to study the embryo

So why am I so interested in Medaka? Well, I was having a beer with my colleague Jochen Wittbrodt, who is one of the rare Medaka specialists outside of Japan, and we were discussing the next stage of experiments. Medaka fish has a neat trick by which one can introduce foreign DNA (e.g. human) coupled to a reporter (green fluorescent protein from jellyfish is a favourite - easy to pick up using a microscope). Even on the first injection, the foreign DNA will often go into every cell. For most other species, you have to get lucky for the foreign DNA to go the germline, and then hope it will breed true. Jochen had done a number of successful reporter experiments based on designs from my group, and we were discussing whether we could draw on the long history of Medaka research with its rich tapestry of inbred lines to explore the impact of natural variation on these reporter experiments. So, I asked him how many inbred Medaka lines there were, and Jochen nonchalantly replied that he had no idea - after all, his colleague, Kiyoshi Naruse, made one or two new lines from the wild every year or so.

My jaw hit the floor. From the wild? I checked. Jochen confirmed. And then I explored some more, and discovered that there was a whole protocol for creating inbred individual Medaka from the wild.

This might sound trivial, but it is not. Keeping vertebrates in a laboratory is hard. Keeping them in a laboratory when they are inbred, such that their diploid genome is identical everywhere, is extremely difficult. Doing this routinely from the wild is basically unheard of (although this 
“self’ing” happens all the time in plant genetics). 

Standard theory holds that every individual, whatever the species, has a number of recessive lethal alleles, which will kill the animal if you make them the same. The trick to making an inbred line that is truly the same everywhere (i.e. homozygous) is regular brother-sister mating and an awful lot of patience, as at some point you have to find the combination of alleles in an individual that does not have a lethal effect. Normal animal husbandry lore would have it that this was such hard work, particular with wild individuals, that it would be best to just continue propagating the hard work carried out by the original founders of whichever organism you are using.

Now, this theory does not hold true for plants, and plant geneticists have enjoyed making inbred lines from the earliest days. And Trudy Mackay, looking at the tricks you can play, created a set of inbreds from wild Drosophila lines. One can study developmental changes by looking at different individuals from the same genetic line, but it has to be at different times. One can study the interaction of genes and environment by raising genetically identical individuals in different environments, but it must be done across a panel of strains that represent a wild population. The model plant Arabidopsis has been used by geneticists to do this for decades; fly geneticists are just starting to. 

This kind of work would have been considered madness in vertebrates. You can’t even keep one or two laboratory zebrafish lines fully inbred - you often need to add back a bit of diversity. There are established inbred laboratory mice, but from a weird multi-species hybrid. Single, wild-derived mice strains have been established, but not at scale - not least because of the complications inherent to keeping mouse facilities pathogen-free, which makes everyone a bit paranoid about wild mice in a laboratory setting. 

But in Medaka, it could be doable. Impressive.

Jochen introduced me to Felix Loosli, the best Medaka breeder outside of Japan, and Kiyoshi Naruse, one of the leading breeders in Japan. The four of us have undertaken to generate and characterise a Medaka inbred panel from a single wild population (unsurprisingly, very close to Kiyoshi’s lab, in Nagoya). 

The Medaka genome has of course been sequenced, in a relatively standard, somewhat quirky way by a Japanese group. This genome is a pretty standard fish genome, around the a third the size of human. Medaka are close to some other evolutionarily interesting fish: the stickleback, beloved of ecologists thanks to the numerous species that form in different river and lake systems; cichlids, with a similarly diverse set of species living around the African lakes and Fugu (and loved by sushi gourmands because of the powerful neurotoxin which, so long as it is only in trace amounts, produces an intriguing taste), and loved by genomicists as the vertebrate with the smallest genome. 

Together, these four funky fish will, I hope, push forward research into vertebrate genetics with evolution, ecology, and environment. Our own contribution is in creating the first ever inbred-from-the-wild panel in vertebrates.

Watch this space.

Sunday, 20 December 2015

8th genome of Christmas: the greatest chemists in the world.

You might think that the best chemists on earth are humans, living perhaps in Cambridge, Heidelberg, Paris, Tokyo or Shenzhen, beavering away in laboratories filled with glassware, extraction hoods and other human-made things. But then you would be discounting a multitude of bacteria that have cracked all sorts of chemistry problems over the course of their long evolution, and that still harbour secrets about how they manipulate molecules. One inventive clade of bacteria, the cyanobacteria, quite literally changed the world, and built the foundations of modern life.

Some 2.5 billion years ago, the ancestor of present-day cyanobacteria made a radical chemical innovation to improve the way they supplied the electrons that feed through various photosynthetic systems. Rather than drawing on more exotic sources of electrons, they used the ubiquitous water molecule. Stripping out the electrons and hydrogens from water could release molecular oxygen: a powerful, reactive molecule, which of course drifted away as a gas. For the first 200 million years or so after this innovation, this gas reacted with reduced inorganic things, for example iron deposits. We can see the resulting change in earth's oxidation state today by drilling down through sediments. But eventually all those sinks were used up, and oxygen started to accumulate in the atmosphere.

This was a massive change to our planet. Molecular oxygen (O2) is thermodynamically unstable; the vast majority of the time it wants to form molecules with other atoms (though the kinetics of these processes gave some opportunities). As oxygen built up in the atmosphere, pumped out by cyanobacteria, every other living organism had to either adapt to cope with (and often exploit) this radical new oxidising agent, or hide itself away in any anaerobic place it could find, which was usually deep inside the Earth. There was no middle ground.

Most life forms adapted. Indeed, they exploited the presence of this oxygen, particularly when it let them control the oxidation of other molecules (such as carbon) to capture energy. Cyanobacteria brought about the source of energy for most living organisms, by enabling carbon capture in combination with various creative uses of oxygen.

The cyanobacteria themselves had to adapt. It’s quite possible, too, that this oxygen crisis triggered some of the most successful collaborations on the planet: alphaproteobacteria worked out how to use oxygen productively, only to be engulfed by the bigger, more motile archaea-like proto-eukaryotes, emerging as mitochondria. Then, these eukaryotes joined forces with ancestors of cyanobacteria to form algae and plants, with the ancestral cyanobacteria becoming the chloroplast, which collects light energy and fixes CO2 for growth.

There is pretty much nothing in our current world, from the diversity of life through the energy we use every day, that is not dependent on cyanobacteria's great innovation.

This is just one of many chemical innovations brought to us by bacteria. Billions of years before Fritz Haber worked out how to capture gaseous molecular nitrogen and convert it into the very useful ammonia, bacteria had worked out how to crack into the kinetically resistant N2 gas. Interestingly, even after intense, concerted efforts we still don’t understand how bacteria pull this off at room temperature. (Many scientists are still at work to crack this; we know the genes involved and have some sense of the awesome redox potentially needed, but how it actually works is still a mystery.) Some bacteria produce hydrogen, which is consumed by other bacteria; some bacteria eke energy out of the redox shifts between the oxidation of metals - everything from iron through to uranium. Bacteria can live in the weirdest environments, from the “hot smokers” of volcanoes underground to the clouds drifting above us.

Bacteria are usually pretty efficient organisms. They live life close to the margin, and every carbon they don't spend on growth is considered a carbon wasted. They have far smaller genomes than the sloppy, energy-rich eukaryotes - and these days it is almost a trivial task to sequence bacterial genomes.But the challenge is neither the size nor the complexity of each genome, but rather simple incredible diversity of bacteria. They are everywhere, finding any possible option for growth. The first bacteria sequenced for the purpose of understanding its chemistry (rather than its laboratory behaviour, or to target it as an infectious agent against humans) was probably Synechocystis in 1997 by a Japanese group. But so many more have sequenced: - over 10,000 - that it is impossible even for the naming systems to keep up. 

Bacterial genomes don't magically tell us how they perform such innovative chemistry, but they do give us the building blocks of the proteins involved, and allow us to start to study them - and sometimes use them - separately. And we have only really started to explore bacterial diversity.

We often consider ourselves and our mammalian cousins as the apogee of evolution, but really the greatest success stories on this planet belong to bacteria, which have radically changed the world.

Saturday, 19 December 2015

7th genome of Christmas: Bread, Beer and Wine.

When you first think of domesticated organisms, dogs might come to mind (our earliest domestication), or perhaps wheat, or cattle or rice. But you might easily overlook single-celled yeast: the key active agreement in both bread and alcohol, and a great enabler of the agricultural revolution in Europe. 

Wild yeast lives on fruit and seeds, and is dispersed by the wind. The earliest use of these wild organisms involved capturing them to make alcohol (wine) and to make sourdough bread rise. For the routine production of beer and bread, brewers and bakers kept cultures of 'good' yeast, eventually selecting for specific strains of Saccharomyces cerevisiae: a single-celled fungus that can live both in aerobic (oxygen present) and anaerobic (no oxygen) conditions.

As genetics and molecular biology took shape, researchers fell in love with this miniature fungus. It is a eukaryote, with a nucleus, signalling pathways, cell division and other conserved features. From the laboratory husbandry point of view, it is closer to bacteria: you grow it on media plates, its commonest life cycle stage is haploid (one copy of the genome) rather than the more commonplace diploid. Despite its mainstay 'growth' haploid mode, yeast also has a sex life (becoming diploid), which you can manipulate and use for genetics. 

After E. coli, it probably has the most manipulable DNA, letting you swap in or out any piece of DNA (you can even insert entire chunks of DNA from other species if you want to, making “YAC”, Yeast Artificial Chromosomes).

So many basic molecular discoveries have their origins in yeast that it is impossible to list them all. Everything from understanding the cell cycle (though a separate African brewer’s yeast, S. pombe, took a star turn as well), through mapping intracellular signalling pathways, to laying down the fundamental aspects of transcription (making RNA from DNA) and translation (making proteins from RNA). Each discovery shows in some way that the vast majority of the cellular machinery one can study in yeast is pretty much at work - sometimes gene-for-gene - in each and every one of our own cells. 

So it's not surprising that yeast was an early target for genome sequencing. This life form was sequenced by a consortium of individual labs all over the world, using the early, more manual technologies. There was some automation and factory-like sequencing, but a lot of the work was done Old School: individual postdocs and technicians pouring gels and reading off each piece of DNA in a bespoke fashion. This was much in the tradition of crafting brewers-yeast-based beers, and as such can be considered an artisanal genome sequence. In 1996 it was published: the first eukaryotic genome.

Yeast is one of the most engineered of all species. One can order up any gene knockout, or a collection of all yeast genes knocked out, with barcodes. One can have any protein tagged, and use that tag for cellular imaging or mass spectroscopy. Huge systematic crosses, direct evolution or growing in controlled environments are feasible on a robotic scale with yeast - and the genome sequence provides a pivotal part of the infrastructure for this kind of work.

So raise a glass to yeast! Its genome sequence, coupled with its amazing genome engineering and ease of growth, has placed it firmly in the premier spot as an organism for both basic cellular biology and biotechnology. 

Friday, 18 December 2015

6th genome of Christmas: the deadly Plasmodium... plant?

If humans have an arch enemy, it might well be the tiny, blood-borne parasite Plasmodium falciparum. This nasty beast causes most of the malaria in sub-Saharan Africa and, together with its cousins, in many tropical zones throughout the world. It kills huge numbers of children every year, and constantly cycles through the bloodstreams of its many survivors. It has been with us since our explosive migration out of east Africa, and in fact many genetic diseases (including sickle-cell aneamia and thalassemias) are tolerated by human populations because they confer an advantage against this nasty parasite.

This intimate, long-standing, dysfunctional relationship makes it all the more weird that Plasmodium falciparum is, in part, ancient, degenerate algae.

Genomics made it possible to untangle this story. As people honed in on the DNA of the Plasmodium parasite, they noticed that the genome was very biased: there were far more A+T than G+C pairings. (The base-pairing rule says there must be the same amount of A+T and G+C because of the double-stranded nature of DNA, but the ratio of A+T to G+C can be different.) This bias caused all sorts of issues, but there was one bit of DNA that looked very different. 

In the 1970s and 80s, people thought this must be the mitochondrial DNA of the parasite. (Mitochondria, the power plants of cells, have their own tiny genome, a remnant of the ancient merging of their ancestors as free-living bacteria with eukaryotic cells. Plasmodium, being a eukaryote, must have mitochondria.) But PCR experiments on classic mitochondrial conserved regions did not turn up anything to support this hypothesis.

In the mid 1990s this "anomalous" part of the Plasmodium genome was cloned by a group from NIMR, the MRC institute in north London now merging to become part of the Crick. A real surprise was that this was not a mitochondrial genome at all - it was a plastid genome, that is to say, the photosynthetic organelle found in all plants and algae (look for the plastid in another Christmas-genome post). The chloroplast was also free-living bacteria before symbiosing with eukaryotes to give rise to plants and algae as we now know them. Furthermore, the whole set of parasites had this degenerate plastid (“apicoplast”), and so were promptly renamed “apicomplexans”.

Quite why a presumably free-living-algae-related organism decided to chuck in a photosynthetic, light-powered life to become one of the world’s deadliest parasites to many species, one can only speculate. 

The apicoplast seems to be important in the parasitic life cycle. One might imagine an organelle specialised for light gathering and carbon fixation might seem pretty superfluous for an endo-parasite (but apparently it's not). This does clear up why a number of anti-malarials, such as Quinine, also act as herbicides; their anti-plastid action hurts both plants and malaria (but of course not all herbicides can be anti-malarial drugs).

Given the importance of Plasmodium, sequencing and assembling the full genome was a priority. However, the traditional step-by-step approach, by which individual pieces of genome were cloned within E. coli, did not work. E. coli spat out the A+T rich DNA most of the time, or (even worse) chopped it up and rearranged it. 

So Bart Barrell and colleagues at the Sanger took to chromosomal sorting and whole-genome shotgun sequencing to sort out the Plasmodium genome - another epic undertaking for its time. With its extreme A+T richness, this genome was a weird beast. You could almost predict coding-sequence regions by eye, as there had to be more G+C to support the amino acids that were clearly present. Furthermore, the ends of chromosomes ('sub-telomeric regions', in the jargon) were freakishly similar, and full of a sophisticated molecular “chaff” (called 'Rifins') that sit on the outside of the parasite as any ever-changing coating, to confuse the host immune system and prevent an effective response from the host.

The community is continuing to sequence other species of Plasmodium, most notably in light of its specialisation to a specific host (i.e., us) - other Plasmodia are less fussy about their host (far broader choice of mammals will do for most parasites), but also less deadly. Furthermore, understanding variation in this parasite, in particular those variations that affect drug resistance, is a monumental, on-going effort at the heart of our struggle to defeat this malicious plant.

Thursday, 17 December 2015

5th genome of Christmas: The Fly

The humble fruit fly – Drosophila melanogaster, to be specific – has played a central role in the history of genetics and molecular biology and continues to be important in research. Championed by the legendary Thomas Morgan at the start of the 20th Century, Drosophila provided a practical foundation for genetics – long before the discovery of DNA as vehicle for passing down heritable information through generations. Morgan and colleagues developed the concepts of 'gene' and 'linkage', and so we have 'Morgans' (and more commonly, centi-Morgans, cM) as the basic units of genetic maps.

You could argue that even the modern approach to genetics and molecular biology research was formed around this creature. The fly has influenced the way laboratories choose a direction of study and the way they share materials and data internationally, which was as critical to the success of early genetics as it is now.

After this strong start, Drosophila kept its momentum during the discovery of DNA, molecular biology and early DNA cloning. Performing large-scale, 'forward genetic' screens, where (one hopes) every possible gene has been knocked out at least once so one can look for specific phenotypes, has unearthed a rich seam of genes involved in development. These days the innovation continues with, amazingly, fly-brain manipulation at a neuronal level.

You can see the footprints of Drosophila research everywhere. The playful Drosophila naming scheme allows for gene names such as “tinman” (mutant flies that don’t have a heart), “dunce” (unable to navigate simple fruit-fly mazes), and “Antennapedia” (antennae are swapped for legs), which permeate biology. The human gene “Sonic Hedgehog” is named after its “hedgehog” ortholog in fly. The “polycomb” in “polycomb repressive complex” (one of the key genome-switching mechanisms) comes from the subtle mutation that adds more bristles (i.e., a comb) onto the fruit-fly's back. Fly molecular biologists are part of a long and great tradition, and are understandably proud of their community’s impact and continuing influence.

This explains a bit why fly genomicists were feeling a bit frustrated in the late 90s, when it became clear that the worm – usually a bit of a 'junior partner' in the metazoan model-organism world – was going to have its genome completed well before the fly. The fly genome project had done quite a bit of groundwork: a century of research had produced excellent genetic maps, helped by a clever trick involving the salivary gland chromosomes (which, bizarrely, duplicate so much that you can see them easily under a microscope). But the project had not committed to the same step-by-step sequencing efforts that the worm community had.

And then came a golden opportunity.

Craig Venter had aligned both investors and technologists to “overtake” the public human genome project with a privately funded project led by the company Celera. To do so, he had assembled a group of scientists including the brilliant computer scientist Gene Myers, who claimed the piecemeal approach taken by the worm and human projects was not necessary. Instead, he posited that a whole-genome shotgun approach was computationally feasible (more on this in another 'Christmas genome' post). Many people didn’t believe him. Others who might have given him the benefit of the doubt found it to be too risky a strategy. Craig and team were ready to bet on it.

But they needed a test project - a genome that was not as big human, but complex and worth doing.

So the Great and the Good of Drosophila, notably Gerry Rubin and Michael Ashburner, pitched the fly to Celera. In 1998/1999, its genome was 'shotgunned' and the genome became the first large, whole-genome shotgun assembly – published in 2000. 

Although shotgun assembly and automatic (computational) annotation are now commonplace, at the time this was radical stuff. There was talk of the largest computational farm ever assembled for biology at Celera, of this whole upstart world of bioinformatics and computational biology being poised to revolutionise biology. This was the dot-com era, so at the same time people were talking about new business models, and how the internet was changing everything.

The Drosophila genome work was happening when I was just ending my PhD. I went to Celera for the Drosophila genome jamboree, and GeneWise - my insanely computationally expensive software for error-tolerant protein or protein HMM alignment - was run across the genome. I also met and chatted with Gene for a while, which was my first exposure to the guts of the assembly problem. But perhaps most of all I realised that the geeks were definitely at the top table - designing and creating the experiments, not just processing the data.

Wednesday, 16 December 2015

4th genome of Christmas: the hexaploid bread wheat genome.

The first technological innovation to radically change human society was agriculture. The ability to cultivate – rather than hunt or pick – food had a profound change on everything from our immune system to our societal structures. It encouraged specialisation, favoured robust, complex inter-generational knowledge transmission and enabled the explosive growth of this bipedal ape.

Arguably, the centrepiece of agricultural innovation is wheat. If you look at the ancestral grass from which it was bred, wheat looks just like… grass. With tiny seeds sticking out of its head at harvest time. Some 10,000 years ago in Anatolia, enterprising farmers bred the biggest, most consistent of these grasses year after year. By selecting for the size of the wheat ears, they brought about changes in the genome that gave rise to larger and larger wheat. One type of change was duplication, by which two individuals (often different subspecies) were bred together without first splitting their genomes in half. In ancient times, single duplications like this gave rise to varieties like emmer, or durum wheat, and the first duplication looks like a wild, common place process. More recently, a second merger with a third subspecies was introduced in bread wheat in more "modern" times (around 5000 years ago), making its genome three times the size of the basic grass genome. (In case you’re new to genomics, the genome comes in pairs: one maternal and one paternal, so this three-way increase is described as hexaploid, three times the normal diploid.)

The basic Anatolian grass genome is about the twice size as the human genome: around 6 billion bases (6 Gigabases), and the 3-fold hexaploid wheat is around 16 Gb. Annoyingly, every bit of DNA in wheat has three pretty similar copies, even when the strain is completely inbred (for outbred wheat plants, one expects 6 loci, 2 from each triplicated loci). In technical terms, this is described as a “nightmare” for genome assembly and analysis. For a long time, the wheat genome was a seemingly unobtainable goal for agricultural genomics research.

It’s not uncommon for plants to duplicate their genome in the wild (indeed, this seems like the starting point of the ancient wheat), but it’s a regular practice in agriculture when selecting for larger fruits/seeds. Strawberries are octoploid (four duplicates). Brassicas (cabbages, broccoli, cauliflower and friends) are all tetraploid strains from different mixtures and tweaks of three different base lines genomes. It makes my head hurt just thinking about the genetics. Commercial sugar cane is duodecaploid (6 duplications) and, as we propagate it using cuttings, even its cells have completely lost the desire to even keep track of their chromosomes.

Despite its fiendish complexity, the community has finally, slowly and steadily, tamed the big, bad bread-wheat genome. First-off-the-mark survey skims of the wheat genome were generated, then compared against the smaller Brachypodium grass genome. The Barley genome (far saner but still annoyingly big) followed. Heroically slogging through chromosomal sorting, people started to tease apart the specific components of the genome. And just recently, excellent work by Matt Clarke and colleagues at TGAC brought us a solid, draft assembly using a clean sequencing protocol, a custom-tweaked assembly algorithm for wheat and a very large computer.

I am delighted that this has happened for many reasons. First, the work is a tour-de-force by Matt and his team. Second, I know a good draft genome will unleash a whole series of experiments – from diversity panels to chip-seq. Thirdly, it allows wheat to come into the fold of species-we-have-a-reasonable-genome for, so we don’t need to treat it like a special case any longer with tricky, bespoke systems (though there is still a need for these, given wheat’s endless annoyances - for example, it is very important to know the relationship between the 3 copies).

This draft bread-wheat genome is just a step along the way to a very high-quality wheat resource. After all, wheat is far too important to the health of people on this planet to skimp on quality. But it is a great step forward, and I hope will be a transformative one in our long and fine tradition of innovating – starting with this very first piece of technology, agriculture.