A Pipeline for Deep Learning with Specimen Images in iDigBio - Applying and Generalizing an Examination of Mercury Use in Preparing Herbarium Specimens

التفاصيل البيبلوغرافية
العنوان: A Pipeline for Deep Learning with Specimen Images in iDigBio - Applying and Generalizing an Examination of Mercury Use in Preparing Herbarium Specimens
المؤلفون: Matthew Collins, Gaurav Yeole, Paul Frandsen, Rebecca Dikow, Sylvia Orli, Renato Figueiredo
المصدر: Biodiversity Information Science and Standards 2: e25699
بيانات النشر: Zenodo, 2018.
سنة النشر: 2018
مصطلحات موضوعية: Spark, iDigBio, deep learning, General Medicine, image
الوصف: iDigBio Matsunaga et al. 2013 currently references over 22 million media files, and stores approximately 120 terabytes worth of those media files co-located with our compute infrastructure. Using these images for scientific research is a logistical and technical challenge. Transferring large numbers of images requires programming skill, bandwidth, and storage space. While simple image transformations such as resizing and generating histograms are approachable on desktops and laptops, the neural networks commonly used for learning from images require server-based graphical processing units (GPUs) to run effectively. Using the GUODA (Global Unified Open Data Access) infrastructure, we have built a model pipeline for applying user-defined processing to any subset of the images stored in iDigBio. This pipeline is run on servers located in the Advanced Computing and Information Systems lab (ACIS) alongside the iDigBio storage system. We use Apache Spark, the Hadoop File System (HDFS), and Mesos to perform the processing. We have placed a Jupyter notebook server in front of this architecture which provides an easy environment with deep learning libraries for Python already loaded for end users to write their own models. Users can access the stored data and images and manipulate them according to their requirements and make their work publicly available on GitHub. As an example of how this pipeline can be used in research, we applied a neural network developed at the Smithsonian Institution to identify herbarium sheets that were prepared with hazardous mercury containing solutions Schuettpelz et al. 2017. The model was trained with Smithsonian resources on their images and transferred to the GUODA infrastructure hosted at ACIS which also houses iDigBio. We then applied this model to additional images in iDigBio to classify them to illustrate the application of these techniques to broad image corpora potentially to notify other data publishers of contamination. We present the results of this classification not as a verified research result, but as an example of the collaborative and scalable workflows this pipeline and infrastructure enable.
وصف الملف: text/html
URL الوصول: https://explore.openaire.eu/search/publication?articleId=doi_dedup___::ff613da8258dd73f41e4e31ba67fa5e6
https://zenodo.org/record/1309069
حقوق: OPEN
رقم الأكسشن: edsair.doi.dedup.....ff613da8258dd73f41e4e31ba67fa5e6
قاعدة البيانات: OpenAIRE