Case Study. Northern Spotted Owls, a federally protected species, live in the Oregon forests — forests impacted by human activity such as loggers wanting to cut down trees and harvest the timber. The areas where the owls live are protected from the loggers, but which areas are they? Data can show the way, enabling sustainable timber harvesting, other human activity, and even displacement protection from competing species.
Recorded wild life sounds, such as birdsong, are stored with Quobyte file, block and object software and interpreted using AI. This data shows where the owl habitats are located
Forests of Douglas Fir, Ponderosa Pine, Juniper and other mixed conifers cover over 30.5 million acres of Oregon — almost half of the state. Finding out where the owls live is quite a task. The Center for Quantitative Life Sciences (CQLS) at Oregon State University, working with the USDA Forestry Service, has deployed and is tracking 1500 autonomous sound and video recording units deployed in the forests, gathering real-time data. The aim is to to monitor the behaviour of wildlife living in the forests of Oregon to ensure that the logging industry’s impact is managed and allow for other interventions.
The CQLS creates around 250 terabytes of audio and video data a month from the recording units, and maintains around 18PB of data at any given time. It keeps taking data off and reusing the space to avoid buying infinite storage.
Over an 18-month period it devised an algorithm to parse the audio recordings and identify different animal species. The algorithm creates spectrograms from the audio, and processes those spectrograms through a convolutional neural net based on the video. It can identify about thirty separate species, distinguish male from female, and even spot behavioural changes within a species over time.
The compute work takes place on an HPC setup comprising IBM Power System AC922 servers, collectively containing more than 6000 processors across 20 racks in two server rooms that serve 2500 users. The AC922 architecture puts AI-optimised GPU resources directly on the northbridge bus, much closer to the CPU than conventional server architectures.
CQLS needed a file system and storage solution able to keep massive datasets close to compute resources — as swapping data in and out from external scratch resources doubled processing time.
At first it was looking at public cloud storage options, but the costs associated were considered “outrageously expensive”.
CQLS checked a variety of storage alternatives and settled on Quobyte running on COTS hardware, rejecting more expensive storage appliance alternatives which could need expensive support arrangements.
The sizes of individual files vary from tiny to very large and everything in between. The Quobyte software is good when dealing with large files, as opposed to millions of highly similar small files. This is advantageous when working on AI training, where TIFF files can range from 20 to 200GB in size.
Concurrently, those files may need to be correlated with data from sensors, secondary cameras, microphones, and other instruments. Everything must flow through one server, which puts massive loading on compute and storage.
Quobyte’s software uses four Supermicro servers with two Intel Xeon E5-2637 v4 CPUs @ 3.50GHz and 256G RAM (DDR4 2400). There are LSI SAS3616 12Gbit/s SAS controllerd running two 78-disk JBODs. These are filled with Toshiba MG07ACA14TA 14TB, SATA-6Gbit/s, 7200rpm, 3.5-inch disk drives.
The entire HPC system is Linux-based and everything is mounted through the Quobyte client for x86-based machines and NFS for the PPC64LE (AC922) servers.
Many groups of users access the system. A single group could have millions or hundreds of files based on the work they do. Most groups leverage over 50TB each and currently there is 2.6PB loaded on the Quobyte setup.
Data ingest detail
Christopher Sullivan, Assistant Director for Biocomputing at CQLS, said; “We have all kinds of pathways for data to come into the systems. First off all research buildings at OSU are connected at a minimum of 40Gbit/sec network and our building and incoming feed to the CGRB (Center for Genome Research and Biocomputing) is 100Gbit/sec and a 200Gbit/sec network like between OSU and HMSC (Hatfield Marine Science Center) at the coast.
“To start some of the machines in our core lab facility (not the sequencers) do drop live data onto the system through SAMBA or NFS-mounted pathways. Next, we have users moving data onto the servers via a front-end machine, again providing services like SAMBA and SSH with a 40Gbit/sec network connection for data movement.
“This allows for users to have machines around the university move data automatically or by hand onto the systems. For example, we have other research labs moving data from machines or data collected in greenhouses and other sources. The data line to the coast mentioned above is used to move data onto the Quobyte for the plankton group as another example.”
What about backup?
Sullivan said: “Backup is something we need on a limited basis since we can generally generate the data again cheaper than the costs of paying for backing up that large amount of space. Most groups backup the scripts and final output (ten per cent) of the space they use for work. Some groups take the original data and if needed by grants keep the original data on cold drives on shelves in a building a quarter-mile away from the primary. So again we do not need a ton of live backup space.“
Quobyte vs public cloud
Sullivan said: “We found that using public clouds was too expensive since we are not able to get the types of hardware in spot instances and data costs are crazy expensive. Finally, researchers cannot always tell what is going to happen with their work or how long it needs to run, etc.
“This makes the cloud very costly and on-premises very cost-effective. My groups buy big machines (256 thread count with 2TB RAM and 12x GPUs) that last 6–7 years and can do anything they need. That would be paid for five times over in that same time frame in the cloud for that hardware. Finally, the file space is too expensive over the long haul, and hard to move large amounts of data on and off. We have the Quobyte to reduce our overall file space costs.”
Seeing the wood for the trees
This is a complicated and sizeable HPC setup which does more than safeguard the Northern Spotted Owl’s arboreal habitats. That work is done in an ingenious way — one that doesn’t involve human bird-spotters trekking through the forests looking for the creatures.
Instead, AI algorithms analyse and interpret recorded bird songs, recognise those made by the owls and then log where and when the owls are located. That data can be used to safeguard those areas of forests allowing sustainable logging, and reducing impacts of other human activity and competing species. And the owls can live in peace.