Jian GUO, Ph.D.

[Office] Room 3901, CFC Building, 5 Shihua Road, Futian District, Shenzhen, China
[Email] xxxyyyy@idea.edu.cn (where xxx is my last name and yyyy is my first name)
Openings

We are hiring all levels of research scientists/engineers (fulltime or intern) and postdoc researchers. We also welcome prospective students to apply for my Ph.D. at the Hong Kong University of Science and Technology (Guangzhou). Please feel free to send me email if you are interested in any of these positions.

  • Research Scientists and Research Engineers: we are hiring full-time research scientists/engineers as well as interns on deep learning, reinforcement learning, quantitative finance, financial machine learning, financial NLP model/system, high-throughput/low-latency finance systems etc. The job locates in Shenzhen.
  • Prospective Ph.D. Students: we are looking for talents in computer science/engineering/statistics to pursue a Ph.D. degree at the Hong Kong University of Science and Technology (Guangzhou). The Ph.D. thesis will work with advisors in both HKUST (Guangzhou) and IDEA. In particular, the most execellent students major in AI finance and deep learning will be co-advised by Dr. Harry Shum (Founder and Chairman of IDEA, Former Executive Vice President of Microsoft Cooperation, Foregin Member of US National Academy of Engineering) and me.
  • Postdoc Researchers: we are hiring postdoc researchers on deep learning, reinforcement learning, NLP, knowledge graph reasoning and computer system as well as their application in finance and investment.
  • Research Assistants: we are looking for research assistants who graduated from prestigous universities and plan to do 1-2 years research before pursuing a higher degree (e.g., Ph.D.).

Introduction

As a founding member, Dr. Jian Guo is the first technical expert assisting Dr. Harry Shum to found the International Digital Economy Academy (IDEA), and he is acting as the Executive President of IDEA as well as the Chief Scientist of AI Finance and Deep Learning. He is the founder of IDEA's AI Finance and Deep Learning Research Center (IDEA-FinAI), leading a number of research projects including FinAI-X (a.k.a Finance AI Brain), financial behavioral knowledge graph and relevant knowledge reasoning techniques, large-scale deep learning models for finance application, simulation of financial market micro-structure using reinforcement learning etc. In addition, Dr. Guo also serves as Adjunct Professor of Artifical Intelligence at Hong Kong University of Science and Technology (Guangzhou), Affiliated Professor in Shanghai Advanced Institute of Finance (SAIF) at Shanghai Jiaotong University, and Professor of Practice at Tsinghua University (Shenzhen International Graduate School).

Prof. Guo started his professorship (tenure-track) at Harvard University, being a faculty/expert in data science and machine learning. He published a number of research papers in statistical machine learning, deep learning, reinforcement learning, probabilistic graphical model and bioinformatics, and his research have been applied to various areas including recommendation systems, search engine systems, computational advertising, gene sequence analysis and genetic disease prognosis, credit risk analysis and quantitative investment. Dr. Guo is also one of the pioneering researchers who explored the application of deep/reinforcement learning techniques in finance investment and risk management, and he has over five years quantitative investment and entrepreneurship experience in financial market. Dr. Guo received his Ph.D. degree from Department of Statistics at University of Michigan (major in machine learning) in US, and completed his undergraduate study from Tsinghua University (major in mathematics and applied mathematics) in China.

Research Interest and Research Projects

My mission in the next 20 years is to build an "AI super brain" for finance industry. My current research interest includes:

  • Research/develop new deep/reinforcement learning algorithms and frameworks for finance time series prediction, trading execution, quantitative investment and risk management
  • Research new models/algorithms/systems for reasoning large-scale financial knowledge graphs, economic behaviors and economic events
  • Large-scale deep learning sequential modeling (aiming to 100B+ parameters) for financial market analysis and prediction
  • System1-system2 hybrid modeling and cognition machine learning driven by scenarios, applications and problems in finance


Selected Publications/Manuscirpts
Jian Guo, Saizhuo Wang, Lionel M. Ni, Heung-Yeung Shum (2022) Quant 4.0: Engineering Quantitative Investment with Automated, Explainable and Knowledge-driven Artificial Intelligence. Arxiv.

Quantitative investment (quant) has become one of the mainstream investment methodologies over the past decades. In this paper, we introduce Quant 4.0 and provide an engineering perspective for next-generation quant. Quant 4.0 has three key components: automated AI, explainable AI and knowledge-driven AI. We also discuss how to build a system that practices the Quant 4.0 concept, and propose ten challenging research problems for quant research in the future.

Jiashuo Sun, Chengjin Xu, Lumingyuan Tang, Saizhuo Wang, Chen Lin, Yeyun Gong, Lionel M. Ni, Heung-Yeung Shum, Jian Guo (2023) Think-on-Graph: Deep and Responsible Reasoning of Large Language Model on Knowledge Graph. International Conference on Learning Representations (ICLR 2024).

Although large language models (LLMs) have achieved significant success in various tasks, they often struggle with hallucination problems, especially in scenarios requiring deep and responsible reasoning. These issues could be partially addressed by introducing external knowledge graphs (KG) in LLM reasoning. In this paper, we propose a new LLM-KG integrating paradigm which treats the LLM as an agent to interactively explore related entities and relations on KGs and perform reasoning based on the retrieved knowledge. We further implement this paradigm by introducing a new approach called Think-on-Graph (ToG), in which the LLM agent iteratively executes beam search on KG, discovers the most promising reasoning paths, and returns the most likely reasoning results.

Tong Wu, Zhihao Fan, Xiao Liu, Yeyun Gong, Yelong Shen, Jian Jiao, Hai-Tao Zheng, Juntao Li, Zhongyu Wei, Jian Guo, Nan Duan, Weizhu Chen (2023) AR-Diffusion: Auto-Regressive Diffusion Model for Text Generation. The Thirty-seventh Annual Conference on Neural Information Processing Systems (NeurIPS 2023).

We introduce Auto-Regressive Diffusion (AR-Diffusion). AR-Diffusion ensures that the generation of tokens on the right depends on the generated ones on the left, a mechanism achieved through employing a dynamic number of denoising steps that vary based on token position. This results in tokens on the left undergoing fewer denoising steps than those on the right, thereby enabling them to generate earlier and subsequently influence the generation of tokens on the right.

Hang Zhang, Yeyun Gong, Xingwei He, Dayiheng Liu, Daya Guo, Jiancheng Lv, Jian Guo (2023) Noisy Pair Corrector for Dense Retrieval. EMNLP 2023.

In this paper, we explore an interesting and challenging problem in dense retrieval, how to train an effective model with mismatched-pair noise. To solve this problem, we propose a novel approach called Noisy Pair Corrector (NPC), which consists of a detection module and a correction module. The detection module estimates noise pairs by calculating the perplexity between annotated positive and easy negative documents. The correction module utilizes an exponential moving average (EMA) model to provide a soft supervised signal, aiding in mitigating the effects of noise. We conduct experiments on text-retrieval benchmarks Natural Question and TriviaQA, code-search benchmarks StaQC and SO-DS.

Haohan Zhang, Fengrui Hua, Chengjin Xu, Jian Guo, Hao Kong, Ruiting Zuo (2023) Unveiling the Potential of Sentiment: Can Large Language Models Predict Chinese Stock Price Movements?. IJCAI 2023.

We provide a rigorous and encompassing benchmark as well as a standardized back-testing framework aiming at objectively assessing the efficacy of various types of LLMs in the specialized domain of sentiment factor extraction from Chinese news text data. To illustrate how our benchmark works, we reference three distinctive models: 1) the generative LLM (ChatGPT), 2) the Chinese language-specific pre-trained LLM (Erlangshen-RoBERTa), and 3) the financial domain-specific fine-tuned LLM classifier(Chinese FinBERT). W

Saizhuo Wang, Hang Yuan, Leon Zhou, Lionel M. Ni, Heung-Yeung Shum, Jian Guo (2023) Alpha-GPT: Human-AI Interactive Alpha Mining for Quantitative Investment. Arxiv.

One of the most important tasks in quantitative investment research is mining new alphas (effective trading signals or factors). Traditional alpha mining methods, either hand-crafted factor synthesizing or algorithmic factor mining (e.g., search with genetic programming), have inherent limitations, especially in implementing the ideas of quants. In this work, we propose a new alpha mining paradigm by introducing human-AI interaction, and a novel prompt engineering algorithmic framework to implement this paradigm by leveraging the power of large language models. Moreover, we develop Alpha-GPT, a new interactive alpha mining system framework that provides a heuristic way to ``understand'' the ideas of quant researchers and outputs creative, insightful, and effective alphas. We demonstrate the effectiveness and advantage of Alpha-GPT via a number of alpha mining experiments.

Xiao-Yang Liu, Z. Xia, H. Yang, J. Gao, D. Zha, M. Zhu, Christina D. Wang, Zhaoran Wang, and Jian Guo (2023) Dynamic Datasets and Market Environments for Financial Reinforcement Learning. Machine Learning Journal.

Building high-quality market environments for training financial reinforcement learning (FinRL) agents is difficult due to major factors such as the low signal-to-noise ratio of financial data, survivorship bias of historical data, and model overfitting. In this paper, we present FinRL-Meta, a data-centric and openly accessible library that processes dynamic datasets from real-world markets into gym-style market environments and has been actively maintained by the AI4Finance community. 1) provide hundreds of market environments through an automatic data curation pipeline. 2) provide homegrown examples and reproduce popular research papers as stepping stones for users to design new trading strategies. 3) provide dozens of Jupyter/Python demos organized into a curriculum and a documentation website to serve the rapidly growing community.

Feng Li, Hao Zhang, Shilong Liu, Jian Guo, Lionel M. Ni, Lei Zhang (2022) DN-DETR: Accelerate DETR Training by Introducing Query DeNoising. The IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR 2022). A more comprehensieve version is published in the journal IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI).

We present in this paper a novel denoising training method to speedup DETR (DEtection TRansformer) training and offer a deepened understanding of the slow convergence issue of DETR-like methods. We show that the slow convergence results from the instability of bipartite graph matching which causes inconsistent optimization goals in early training stages. Our method is universal and can be easily plugged into any DETR-like methods by adding dozens of lines of code to achieve a remarkable improvement.

Xiao-Yang Liu, Zechu Li, Zhuoran Yang, Jiahao Zheng, Zhaoran Wang, Anwar Walid, Jian Guo, Michael Jordan. (2021) ElegantRL-Podracer: Scalable and Elastic Library for Cloud-Native Deep Reinforcement Learning. Data-Centric AI Workshop, 35th Conference on Neural Information Processing Systems (NeurIPS 2021), Sydney, Australia.

In this paper, we present a scalable and elastic library GPU-podracer for cloud-native deep reinforcement learning, which efficiently utilizes millions of GPU cores to carry out massively parallel agent-environment interactions. Our GPU-podracer library features high scalability, elasticity and accessibility by following the development principles of containerization, microservices and MLOps.

Zechu Li, Xiao-Yang Liu, Jiahao Zheng, Zhaoran Wang, Anwar Walid, Jian Guo. (2021) FinRL-Podracer: High Performance and Scalable Deep Reinforcement Learning for Quantitative Finance. 2nd ACM International Conference on AI in Finance (ICAIF'21), November, 2021, Virtual Event, USA

In this paper, we propose an RLOps in finance paradigm and present a FinRL-Podracer framework to accelerate the development pipeline of deep reinforcement learning driven trading strategy and to improve both trading performance and training efficiency. FinRL-Podracer is a cloud-native microservices-based solution that features high performance and high scalability and promises continuous training, continuous integration, and continuous delivery of DRL-driven trading strategies, facilitating a rapid transformation from algorithmic innovations into a profitable trading strategy.

Xiao-Yang Liu, Jingyang Rui, Jiechao Gao, Liuqing Yang, Hongyang Yang, Zhaoran Wang, Christina Dan Wang, Jian Guo. (2021) FinRL-Meta: A Universe of Near-Real Market Environments for Data-Driven Deep Reinforcement Learning in Quantitative Finance. Deep RL Workshop, 35th Conference on Neural Information Processing Systems (NeurIPS 2021), Sydney, Australia.

In this paper, we present a NeoFinRL framework that includes tens of Near real market environments for data driven Financial Reinforcement Learning. First, NeoFinRL separates financial data processing from the design pipeline of deep reinforcement learning based strategy and provides open source data engineering tools. Second, NeoFinRL provides tens of standard market environments for various trading tasks. Third, NeoFinRL enables massively parallel simulations by exploiting thousands of GPU cores.

Qianggang Ding, Sifan Wu,, Hao Sun, Jiadong Guo and Jian Guo. (2020) Hierarchical Multi-Scale Gaussian Transformer for Stock Movement Prediction. Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) Special Track on AI in FinTech, Yokohama, Japan. pp. 4640-4646.

We improved the Transformer model and applied it to stock movement prediction tasks. Firstly, we propose a Multi-Scale Gaussian Prior to enhance the locality of Transformer. Secondly, we develop an Orthogonal Regularization to avoid learning redundant heads in the multi-head self-attention mechanism. Thirdly, we design a Trading Gap Splitter for Transformer to learn hierarchical features of high-frequency finance data. Compared with conventional Transformer and LSTM, the proposed method has the advantage to mine extremely long-term dependencies from financial time series.

Ying Jin, Weilin Fu, Jian Kang, Jiadong Guo, Jian Guo. (2020) Bayesian Symbolic Regression. The Ninth International Workshop on Statistical Relational AI at the 34th AAAI Conference on Artificial Intelligence (AAAI-2020), New York, USA

We propose a new method to fit symbolic regression (SR) under a Bayesian framework. Firstly, Bayesian model can naturally incorporate prior knowledge (e.g., preference of basis functions, operators and raw features) to improve the efficiency of fitting SR. Secondly, to improve interpretability of expressions in SR, we aim to capture concise but informative signals. To this end, we assume the expected signal has an additive structure, i.e., a linear combination of several concise expressions, whose complexity is controlled by a well-designed prior distribution. In our setup, each expression is characterized by a symbolic tree, and the proposed SR model could be solved by sampling symbolic trees from the posterior distribution using an efficient Markov chain Monte Carlo (MCMC) algorithm. Finally, compared with GP, the proposed BSR(Bayesian Symbolic Regression) method saves computer memory with no need to keep an updated 'genome pool'.

Jian Guo, Jie Cheng, Elizaveta Levina, George Michailidis, Ji Zhu (2015) Estimating Heterogeneous Graphical Models for Discrete Data with An Application to Roll Call Voting. Annals of Applied Statistics. 9(2): 821–848.

We consider the problem of jointly estimating a collection of graphical models for discrete data, corresponding to several categories that share some common structure. An example for such a setting is voting records of legislators on different issues, such as defense, energy, and healthcare.We develop a Markov graphical model to characterize the heterogeneous dependence structures arising from such data. The model is fitted via a joint estimation method that preserves the underlying common graph structure, but also allows for differences between the networks. We apply the method to describe the internal networks of the U.S. Senate on several important issues. We also establish consistency of the proposed method both for parameter estimation and model selection.

Jian Guo, Elizaveta Levina, George Michailidis, Ji Zhu (2015) Graphical Models for Ordinal Data. Journal of Computational and Graphical Statistics. 24(1): 183–204.

A graphical model for ordinal variables is considered, where it is assumed that the data are generated by discretizing the marginal distributions of a latent multivariate Gaussian distribution. The relationships between these ordinal variables are then described by the underlying Gaussian graphical model and can be inferred by estimating the corresponding concentration matrix. Direct estimation of the model is computationally expensive, but an approximate EM-like algorithm is developed to provide an accurate estimate of the parameters at a fraction of the computational cost.

Wenqiong Xue, Jian Kang, F. DuBois Bowman, Tor D. Wager, and Jian Guo (2014) Identifying Functional Co-activation Patterns in Neuroimaging Studies via Poisson Graphical Models. Biometrics. 70(4): 812–822.

Studying the interactions between different brain regions is essential to achieve a more complete understanding of brain function. In this paper, we focus on identifying functional co-activation patterns and undirected functional networks in neuroimaging studies. We build a functional brain network, using a sparse covariance matrix, with elements representing associations between region-level peak activations. We adopt a penalized likelihood approach to impose sparsity on the covariance matrix based on an extended multivariate Poisson model. Conducting a meta-analysis of 162 functional neuroimaging studies on emotions, our model identifies a functional network that consists of connected regions within the basal ganglia, limbic system, and other emotion-related brain regions.

Jian Guo (2011) Class-specific Variable Selection for Multicategory Support Vector Machines. Statistics and Its Interface. 4: 19–26.

This paper proposes a class-specific variable selection method for multicategory support vector machines. Different from existing variable selection methods, the proposed method not only captures the important variables for classification, but also identifies the discriminable and nondiscriminable classes so as to enhance the interpretation for multicategory classification problems. It minimizes the hinge loss of SVM coupled with a pairwise fusion penalty, which identifies nondiscriminable classes by imposing their associated coefficients in the decision functions to some identical value.

Jian Guo, Elizaveta Levina, George Michailidis, Ji Zhu (2011) Joint Estimation of Multiple Graphical Models. Biometrika. 98 (1): 1–15.

Gaussian graphical models explore dependence relationships between random variables, through the estimation of the corresponding inverse covariancematrices. In this paper we develop an estimator for such models appropriate for data from several graphical models that share the same variables and some of the dependence structure. In this setting, estimating a single graphical model would mask the underlying heterogeneity, while estimating separate models for each category does not take advantage of the common structure.We propose amethod that jointly estimates the graphical models corresponding to the different categories present in the data, aiming to preserve the common structure, while allowing for differences between the categories. This is achieved through a hierarchical penalty that targets the removal of common zeros in the inverse covariance matrices across categories.

Note: This paper won 2010 INFORMS Best Student Paper Award (First Place) sponsored by the Data Mining Section of Institute for Operations Research and Management Sciences. This paper also won 2010 ASA Student Paper Competition Award sponsored by the Statistical Learning and Data Mining Section of American Statistical Association.

Bee-Chung Chen, Jian Guo, Belle Tseng, Jie Yang (2011) User Reputation in A Comment Rating Environment. Proceeding of the 17th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD-2011), San Diego, California, USA.

We find that the quality of a comment judged editorially is almost uncorrelated with the ratings that it receives, but can be predicted using standard text features, achieving accuracy as high as the agreement between two editors! However, extracting a pure reputation signal from ratings is difficult because of data sparseness and several confounding factors in users’ voting behavior. To address these issues, we propose a novel bias-smoothed tensor model and empirically show that our model significantly outperforms a number of alternatives based on Yahoo! News, Yahoo! Buzz and Epinions datasets.

Liangjie Hong, Dawei Yin, Jian Guo, Brian D. Davison (2011) Tracking trends: incorporating term volume into temporal topic models. Proceeding of the 17th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD-2011), San Diego, California, USA

Text corpora with documents from a range of time epochs are natural and ubiquitous in many fields, such as research papers, newspaper articles and a variety of types of recently emerged social media. People not only would like to know what kind of topics can be found from these data sources but also wish to understand the temporal dynamics of these topics and predict certain properties of terms or documents in the future. In this paper, we introduce a real-world task, tracking trends of terms, to which temporal topic models can be applied. Rather than building a general-purpose model, we propose a new type of topic model that incorporates the volume of terms into the temporal dynamics of topics and optimizes estimates of term volumes. We combine state-space models with term volumes with a supervised learning model, enabling us to effectively predict the volume in the future, even without new documents.

Jian Guo, Gareth James, Elizaveta Levina, George Michailidis, Ji Zhu (2010) Principal Component Analysis with Sparse Fused Loadings. Journal of Computational and Graphical Statistics. 19(4): 947–962.

In this article, we propose a new method for principal component analysis (PCA), whose main objective is to capture natural “blocking” structures in the variables. Further, the method, beyond selecting different variables for different components, also encourages the loadings of highly correlated variables to have the same magnitude. These two features often help in interpreting the principal components. To achieve these goals, a fusion penalty is introduced and the resulting optimization problem solved by an alternating block optimization algorithm.

Jian Guo (2010) Simultaneous variable selection and class fusion for high-dimensional linear discriminant analysis. Biostatistics. 11(4): 599–608.

In many high-dimensional microarray classification problems, an important task is to identify subsets of genes that best discriminate the classes. Nevertheless, existing gene selection methods for microarray classification cannot identify which classes are discriminable or not by these selected genes. In this paper, we propose an improved linear discriminant analysis (LDA) method that simultaneously selects important genes and identifies the discriminable classes. Specifically, a pairwise fusion penalty for LDA was used to shrink the differences of the class centroids in pairs for each variable and fuse the centroids of indiscriminable classes altogether.

Jian Guo, Elizaveta Levina, George Michailidis, Ji Zhu (2010) Pairwise variable selection for high-dimensional model-based clustering. Biometrics. 66(3): 793–804.

Variable selection for clustering is an important and challenging problem in high-dimensional data analysis. Existing variable selection methods for model-based clustering select informative variables in a “one-in-all-out” manner; that is, a variable is selected if at least one pair of clusters is separable by this variable and removed if it cannot separate any of the clusters. In many applications, however, it is of interest to further establish exactly which clusters are separable by each informative variable. To address this question, we propose a pairwise variable selection method for high-dimensional model-based clustering using a new pairwise penalty for regularization.

Note: This paper won 2009 ASA Student Paper Competition Award sponsored by the Statistical Computing and Graphics Sections of American Statistical Association. This paper also won 2009 ENAR Distinguished Student Paper Award sponsored by the International Biometric Society.

Qiaozhu Mei, Jian Guo, Dragomir Radev (2010) DivRank: The Interplay of Prestige and Diversity in Information Networks. Proceeding of the 16th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD-2010), Washington D.C., USA. pages 1009-1018.

We propose a novel ranking algorithm, DivRank, based on a reinforced random walk in an information network. This model automatically balances the prestige and the diversity of the top ranked vertices in a principled way. DivRank not only has a clear optimization explanation, but also well connects to classical models in mathematics and network science. We evaluate DivRank using empirical experiments on three different networks as well as a text summarization task. DivRank outperforms existing network-based ranking methods in terms of enhancing diversity in prestige.

Man-Wai Mak, Jian Guo, and Sun-Yuan Kung (2008) PairProSVM: Protein Subcellular Localization Based on Local Pairwise Profile Alignment and SVM. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 5 (3): 416–422.

This paper introduces a new method—PairProSVM—to automatically predict the subcellular locations of proteins. The profiles of all protein sequences in the training set are constructed by PSI-BLAST, and the pairwise profile alignment scores are used to form feature vectors for training a support vector machine (SVM) classifier. It was found that PairProSVM outperforms the methods that are based on sequence alignment and amino acid compositions even if most of the homologous sequences have been removed. The overall accuracies on these data sets reach 75.3 percent and 91.9 percent, respectively, which are higher than or comparable to those obtained by sequence alignment and composition-based methods.

Xian Pu, Jian Guo, Howard Leung, Yuanlie Lin (2007) Prediction of membrane protein types from sequences and position-specific scoring matrices. Journal of Theoretical Biology, 247 (2): 259–265.

This paper introduces an integrative approach to classify membrane proteins based on protein sequences and protein profiles. These modules extract the amino acid composition of the whole profiles, the amino acid composition of N-terminal and C-terminal profiles, the amino acid composition of profile segments and the dipeptide composition of the whole profiles. In the computational experiment, the overall accuracy of the proposed approach is comparable with the functional-domain-based method. In addition, the performance of the proposed approach is complementary to the functional-domain-based method for different membrane protein types.

Jian Guo, Yuanlie Lin, Xiangjun Liu (2006) GNBSL: A New Integrative System to Predict Subcellular Location for Gram-negative Bacteria Proteins. Proteomics, 6 (19): 5099–5105.

This paper proposes a new integrative system (GNBSL – Gram-negative bacteria subcellular localization) for subcellular localization specifized on the Gram-negative bacteria proteins. First, the system generates a position-specific frequency matrix (PSFM) and a position-specific scoring matrix (PSSM) for each protein sequence by searching the Swiss-Prot database. Then different features are extracted by four modules from the PSFM and the PSSM. The features include whole-sequence amino acid composition, N- and C-terminus amino acid composition, dipeptide composition, and segment composition. Four probabilistic neural network (PNN) classifiers are used to classify these modules. To further improve the performance, two modules trained by support vector machine (SVM) are added in this system. One module extracts the residue-couple distribution from the amino acid sequence and the other module applies a pairwise profile alignment kernel to measure the local similarity between every two sequences. Finally, an additional SVM is used to fuse the outputs from the six modules.

Jian Guo, Yuanlie Lin (2006) TSSub: Eukaryotic Protein Subcellular Localization by Extracting Features from Profiles. Bioinformatics, 22 (14): 1784–1785.

This paper introduces a new subcellular localization system (TSSub) for eukaryotic proteins. This system extracts features from both profiles and amino acid sequences. Four different features are extracted from profiles by four probabilistic neural network (PNN) classifiers, respectively (the amino acid composition from whole profiles; the amino acid composition from the N-terminus of profiles; the dipeptide composition from whole profiles and the amino acid composition from fragments of profiles). In addition, a support vector machine (SVM) classifier is added to implement the residue-couple feature extracted from amino acid sequences.

Jian Guo, Xian Pu, Yuanlie Lin, Howard Leung (2006) Protein subcellular localization based on psi-blast and machine learning. Journal of Bioinformatics and Computational Biology, 4(6): 1181–1195.

Subcellular location is an important functional annotation of proteins. An automatic, reliable and efficient prediction system for protein subcellular localization is necessary for large-scale genome analysis. This paper describes a protein subcellular localization method which extracts features from protein profiles rather than from amino acid sequences. The protein profile represents a protein family, discards part of the sequence information that is not conserved throughout the family and therefore is more sensitive than the amino acid sequence. The amino acid compositions of whole profile and the N-terminus of the profile are extracted, respectively, to train and test the probabilistic neural network classifiers.

Jian Guo, Man-Wai Mak, Sun-Yuan Kung (2006) Eukaryotic protein subcellular localization based on local pairwise profile alignment SVM. Proceedings of 2006 IEEE International Workshop on Machine Learning for Signal Processing (MLSP-2006), Maynooth, Ireland, pp. 391–396.

This paper studies the use of profile alignment and support vector machines for subcellular localization. In the training phase, the profiles of all protein sequences in the training set are constructed by PSI-BLAST and the pairwise profile-alignment scores are used to form feature vectors for training a support vector machine (SVM) classifier. During testing, the profile of a query protein sequence is computed and aligned with all the profiles constructed during training to obtain a feature vector for classification by the SVM classifier. Tests on Reinhardt and Hubbard's eukaryotic protein dataset show that the total accuracy can reach 99.4%, which is significantly higher than those obtained by methods based on sequence alignments and amino acid composition. It was also found that the proposed method can still achieves a prediction accuracy of 96% even if none of the sequence pairs in the dataset contains more than 5% identity. This paper also demonstrates that the performance of the SVM is proportional to the degree of its kernel matrix meeting the Mercer's condition.

Jian Guo, Hu Chen, Zhirong Sun, Yuanlie Lin (2004) A novel method for protein secondary structure prediction using dual-layer SVM and profiles.. Proteins: Structure, Function and Bioinformatics, 54: 738–743.

A high-performance method was developed for protein secondary structure prediction based on the dual-layer support vector machine (SVM) and position-specific scoring matrices (PSSMs). SVM is a new machine learning technology that has been successfully applied in solving problems in the field of bioinformatics. The SVM's performance is usually better than that of traditional machine learning approaches. The performance was further improved by combining PSSM profiles with the SVM analysis. The PSSMs were generated from PSI-BLAST profiles, which contain important evolution information. The final prediction results were generated from the second SVM layer output. On the CB513 data set, the three-state overall per-residue accuracy, Q3, reached 75.2%, while segment overlap (SOV) accuracy increased to 80.0%. On the CB396 data set, the Q3 of our method reached 74.0% and the SOV reached 78.1%.