A graph kernel for protein-protein interaction extraction

June 28, 2017 | Autor: Tapio Pahikkala | Categoría: Protein-Protein Interaction, Cross Validation, Dependence Graph
Share Embed


Descripción

A Graph Kernel for Protein-Protein Interaction Extraction Antti Airola, Sampo Pyysalo, Jari Bj¨orne, Tapio Pahikkala, Filip Ginter and Tapio Salakoski Turku Centre for Computer Science and Department of IT, University of Turku Joukahaisenkatu 3-5 20520 Turku, Finland [email protected]

Abstract

evaluation resources, metrics and strategies make direct comparison of these numbers problematic. Further, the results gained from the BioCreative II evaluation, where the best performing system achieved a 29% F-score (Hunter et al., 2008), suggest that the problem of extracting binary protein protein interactions is far from solved.

In this paper, we propose a graph kernel based approach for the automated extraction of protein-protein interactions (PPI) from scientific literature. In contrast to earlier approaches to PPI extraction, the introduced alldependency-paths kernel has the capability to consider full, general dependency graphs. We evaluate the proposed method across five publicly available PPI corpora providing the most comprehensive evaluation done for a machine learning based PPI-extraction system. Our method is shown to achieve state-of-theart performance with respect to comparable evaluations, achieving 56.4 F-score and 84.8 AUC on the AImed corpus. Further, we identify several pitfalls that can make evaluations of PPI-extraction systems incomparable, or even invalid. These include incorrect crossvalidation strategies and problems related to comparing F-score results achieved on different evaluation resources.

1

Introduction

Automated protein-protein interaction (PPI) extraction from scientific literature is a task of significant interest in the BioNLP field. The most commonly addressed problem has been the extraction of binary interactions, where the system identifies which protein pairs in a sentence have a biologically relevant relationship between them. Proposed solutions include both hand-crafted rule-based systems and machine learning approaches (see e.g. (Bunescu et al., 2005)). A wide range of results have been reported for the systems, but as we will show, differences in

The public availability of large annotated PPIcorpora such as AImed (Bunescu et al., 2005), BioInfer (Pyysalo et al., 2007a) and GENIA (Kim et al., 2008), provides an opportunity for building PPI extraction systems automatically using machine learning. A major challenge is how to supply the learner with the contextual and syntactic information needed to distinguish between interactions and non-interactions. To address the ambiguity and variability of the natural language expressions used to state PPI, several recent studies have focused on the development, adaptation and application of NLP tools for the biomedical domain. Many high-quality domain-specific tools are now freely available, including full parsers such as that introduced by Charniak and Lease (2005). Additionally, a number of conversions from phrase structure parses to dependency structures that make the relationships between words more directly accessible have been introduced. These include conversions into representations such as the Stanford dependency scheme (de Marneffe et al., 2006) that are explicitly designed for information extraction purposes. However, specialized feature representations and kernels are required to make learning from such structures possible. Approaches such as subsequence kernels (Bunescu and Mooney, 2006), tree kernels (Zelenko

1 BioNLP 2008: Current Trends in Biomedical Natural Language Processing, pages 1–9, c Columbus, Ohio, USA, June 2008. 2008 Association for Computational Linguistics

prep_of> prep_of> conj_and>

interaction of P1 and

Further, we rigorously assess our method on five publicly available PPI corpora, providing the first broad cross-corpus evaluation with a machine learning approach to PPI extraction. Finally, we discuss the effects that different evaluation strategies, choice of corpus and applied metrics have on measured performance, and conclude.

P2

Lihat lebih banyak...

Comentarios

Copyright © 2017 DATOSPDF Inc.