Digital Adventures


What’s at the bottom of the biodiversity data mine? by kewdmt

As the quantity of data available online reaches ever greater volumes, particularly structured or ‘linked’ data, questions of what value can be derived from that data, and how much that might be, are increasingly interesting.

Working at the Royal Botanic Gardens, Kew, I’m particularly interested in biodiversity data, what it might be used for, by whom, and to what end. This is both an academic interest and a pressing need given the impending crisis that threatens biodiversity around the world.

Superabundent data

Many people, including Professor Nigel Shadbolt, Professor of Artificial Intelligence at Southampton University, describe the supply of online data as a ‘superabundant’ deluge of information. In a paper on semantic responses, Professor Shadbolt and his colleagues estimated that the amount of data generated in 2010 would be around 1.2 million petabytes. To put that into some context: if you tried to read through this data, assuming an average reading speed of say 1,000 characters (or approx 1 kilobyte) per minute, then it would take you about 2 trillion years! So techniques of data mining (not a new term, it has been around for several decades) are increasingly essential in locating and making sense of this incredible data mountain.

Biodiversity data is a subset. There is no reliable estimate of the quantity of biodiversity data available, but it is huge – GBIF (the Global Biodiversity Information Facility) reported in 2010 that it has 216 million records of primary biodiversity data available through its portal, and it estimates that the data records available at partner institutions run into several billion. Kew alone has tens of millions of data records.

Scientific discovery

So what is the role of data mining in making sense out of the information  we have recorded about the planet’s biodiversity?

At Kew, some important aspects of scientific discovery rely on identifying patterns from large data sets. Biodiversity data usually includes accurate location-based information (for example the location a specimen was collected), providing a powerful opportunity to mine data by location. A good example is the work conducted recently to assess the risk to plant life around the world, expressed as the Sampled Red List Index (SRLI) for plants. Researchers at Kew, the Natural History Museum, ZSL and the International Union for Conservation of Nature (IUCN) took a representative sample of 7,000 species from around the world. Using a combination of bespoke and existing tools such as Google Earth, they mined data from the partners’ collections, remote sensing data from satellites, and other sources such as GBIF to arrive at the final assessment.

Explore the state of plant life

Interactive map displaying IUCN Sampled Red List Index (SRLI) for plants data

Another important use of biodiversity data is to derive models that can be used to make ecological predictions, for example when modelling climate scenarios. Projects such as TRY work through a global partnership of institutions that provide primary biodiversity data, which is mined to derive traits used in these models.

More miners – and mines

In a growing number of cases, people are mining data sets that nominally have nothing to do with biodiversity, to reveal new information. A fascinating example is that of a citizen scientist from Maine who used tourist images from Flickr to track the migration journey of a humpback whale from Brazil to Madagascar, publishing her results in the Royal Society’s Biology Letters. With Facebook now reaching over 500 million people, there is bound to be some useful biodiversity information to mine.

From eBird to iSpot, there are no shortage of opportunities for citizen scientists to invest time in documenting biodiversity – and they are doing it in large numbers. Increasingly data providers are also finding ways of making their data available to these groups. In the UK, the National Biodiversity Network offers a set of web services that enable use of data by developers creating applications, and GBIF similarly offers a number of services into its global data. The Encyclopedia of Life (EOL) aggregates biodiversity data aimed at a broader range of audiences, for which there is now an API (application programming interface) that can be used to build apps.

eBird website

Screenshot from eBird website

Although we’re not inundated with applications based on this kind of data, there are signs that both specialists and amateurs are approaching the data more creatively. There are for example some good illustrations of what is possible using data visualisation such as species heatmaps, or Google Earth layers showing species distribution. And there are certainly developers keen to get their hands on new data. Take the realm of civil data, for example, where organisations like MySociety create hugely popular apps out of freely available data.

So why are there not more biodiversity apps? Well the data is certainly harder to decipher, and in some cases includes concepts that simply don’t make sense to a non-specialist. So perhaps closer partnership between data publishers and app developers might stimulate more activity – maybe in the form of hack days or so-called ‘crowd-sourced’ projects.

If this happened, what would they build? Perhaps field guides compiled on the fly for a user-defined region, food-chain or ecological modelling, visualisation of the effects of man-made structures such as roads to habitats? The possibilities may be endless, and in some cases could prove genuinely insightful.

The value of more diverse communities using this data may be in the serendipity that it creates. The example of whale tracking via Flickr is a case in point. Not only will different communities look to new data sets with which to combine the primary data (even perhaps social networks such as Facebook or Twitter), but they may also approach the problem from new angles.

A partnership of miners

Although I believe that getting a broader base of people interested in biodiversity data could have significant benefits, I suspect that the cutting edge of mining biodiversity data will remain with the specialists.

Without expert involvement, mining data can lead to misinterpretation or false conclusions, especially where the data is complex and opaque. Only within the bioinformatics communities do you find the combination of taxonomic, GIS and regional expertise needed to make major breakthroughs in understanding from this data. In fact, many of the potential apps imagined above would probably need expert input to create genuinely valuable products.

And there is an important footnote – the data does not digitise itself, curate itself and offer itself up for use. It is an expensive (although valuable) function to create and maintain usable datasets that can be mined. Kew and other institutions are having to consider how to ‘biocurate’* their data for future use.

But where I think we could all benefit is by creating more opportunities for citizen scientists, experts and the public to engage together with this critically important data.

- Mike Saunders -**

—————————————————————————————–

References & Notes

* Howe, D. et al. (2008) Big data: The future of biocuration. Nature 455, 47–50 for review of biocuration

** This article was originally published on nature.com blog, 7 March 2011.

About these ads

4 Comments so far
Leave a comment

Hi,

Thank you Mr Saunders for this post.

I am a strong believer of citizen science. I have been working very hard for the last two years in the development of a web application called the BIOAPP. A toll to give institutions, researchers, collectivities or individuals, easy access to a network of observers personalized and adapted to there needs. How I see it, it’s not only about data, it’s also about outreach and establishing communication between the scientific community and all the keen observers.

Today, three websites use the BIOAPP. The Underwater Observers Network (http://www.rosm.ca), the Shark Observation Network (http://geerg.ca/sharksonline/autres/index.php) and the Plant Watch Network (http://rspee.glu.org/) and all have been very useful to their owners.

I know that important discoveries can be made from these observations and to me an observation that is not logged somewhere, it is a lost observation!

– Blaise

Comment by Blaise Barrette

GBIF, EoL, IUCN and others have the advantage of using the Catalogue of Life as a common index to species. This improves the ability to cross map information among data holders. These projects are now partners in i4Life http://www.i4life.eu/ and are working together to further develop a shared list of life on Earth. This should further help data mining.

Comment by Alastair Culham

Mike, you ask “So why are there not more biodiversity apps?”, suggesting that making apps is hard. I’d argue the bigger problem is that data providers, including Kew Gardens, insist on locking data up behind restrictive licenses, the recent Plant List being an example (see iphylo.blogspot.com/2011/01/why-won-plant-list-won-let-me-do-this.html). Until people are prepared to set the data free (and I mean genuinely free), a rich ecosystem of apps, etc., won’t happen. Compare this with genomics, where people are moving terrabytes of data around and creating vast numbers of derived tools and databases, in large part because the data is in the public domain.

Comment by Roderic Page

Rod, many thanks for your comments. Kew really is trying to make its data available for research purposes. We’re actively reviewing options on how we publish data, and we welcome a range of views in doing so. Our aim is to provide free access to all our data for research purposes. In most cases we find that it is useful – both for Kew and for the user – if we know who they are, so that we can notify them of replacements etc. In some cases, the data includes contributions from partners whose views also need to be taken into account – and sometimes we simply don’t have the permission to make it available.

Comment by Mike Saunders




Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s



Follow

Get every new post delivered to your Inbox.

%d bloggers like this: