Deep feature selection: theory and application to identify enhancers and promoters

Par Conseil national de recherches du Canada

DOI	Trouver le DOI : https://doi.org/10.1089/cmb.2015.0189
Auteur	Rechercher : Li, Yifeng; Rechercher : Chen, Chih-Yu; Rechercher : Wasserman, Wyeth W.
Format	Texte, Article
Sujet	Deep feature selection; Deep learning; Enhancer; Promoter
Résumé	Sparse linear models approximate target variable(s) by a sparse linear combination of input variables. Since they are simple, fast, and able to select features, they are widely used in classification and regression. Essentially they are shallow feed-forward neural networks that have three limitations: (1) incompatibility to model nonlinearity of features, (2) inability to learn high-level features, and (3) unnatural extensions to select features in a multiclass case. Deep neural networks are models structured by multiple hidden layers with nonlinear activation functions. Compared with linear models, they have two distinctive strengths: the capability to (1) model complex systems with nonlinear structures and (2) learn high-level representation of features. Deep learning has been applied in many large and complex systems where deep models significantly outperform shallow ones. However, feature selection at the input level, which is very helpful to understand the nature of a complex system, is still not well studied. In genome research, the cis-regulatory elements in noncoding DNA sequences play a key role in the expression of genes. Since the activity of regulatory elements involves highly interactive factors, a deep tool is strongly needed to discover informative features. In order to address the above limitations of shallow and deep models for selecting features of a complex system, we propose a deep feature selection (DFS) model that (1) takes advantages of deep structures to model nonlinearity and (2) conveniently selects a subset of features right at the input level for multiclass data. Simulation experiments convince us that this model is able to correctly identify both linear and nonlinear features. We applied this model to the identification of active enhancers and promoters by integrating multiple sources of genomic information. Results show that our model outperforms elastic net in terms of size of discriminative feature subset and classification accuracy.
Date de publication	2016-01-22
Maison d’édition	Mary Ann Liebert
Dans	Journal of Computational Biology 23, nº 5 : 322–336.
Langue	anglais
Publications évaluées par des pairs	Oui
Numéro NPARC	23000378
Exporter la notice	Exporter en format RIS
Signaler une correction	Signaler une correction (s'ouvre dans un nouvel onglet)
Identificateur de l’enregistrement	a3e7d7b1-dc03-4ce9-8d7e-0f7608d711ae
Enregistrement créé	2016-07-12
Enregistrement modifié	2020-03-16

Date de modification :: 2024-08-31