Deciphering complex patterns of class-I HLA-peptide cross-reactivity via hierarchical grouping

June 30, 2017 | Autor: Sumanta Mukherjee | Categoría: Immunology, Structural Biology, MHC class I, Immuno informatics
Share Embed


Descripción

Immunology and Cell Biology (2015), 1–11 & 2015 Australasian Society for Immunology Inc. All rights reserved 0818-9641/15 www.nature.com/icb

THEORETICAL ARTICLE

Deciphering complex patterns of class-I HLA–peptide cross-reactivity via hierarchical grouping Sumanta Mukherjee1, Jim Warwicker2 and Nagasuma Chandra1,3 T-cell responses in humans are initiated by the binding of a peptide antigen to a human leukocyte antigen (HLA) molecule. The peptide–HLA complex then recruits an appropriate T cell, leading to cell-mediated immunity. More than 2000 HLA class-I alleles are known in humans, and they vary only in their peptide-binding grooves. The polymorphism they exhibit enables them to bind a wide range of peptide antigens from diverse sources. HLA molecules and peptides present a complex molecular recognition pattern, as many peptides bind to a given allele and a given peptide can be recognized by many alleles. A powerful grouping scheme that not only provides an insightful classification, but is also capable of dissecting the physicochemical basis of recognition specificity is necessary to address this complexity. We present a hierarchical classification of 2010 class-I alleles by using a systematic divisive clustering method. All-pair distances of alleles were obtained by comparing binding pockets in the structural models. By varying the similarity thresholds, a multilevel classification was obtained, with 7 supergroups, each further subclassifying to yield 72 groups. An independent clustering performed based only on similarities in their epitope pools correlated highly with pocket-based clustering. Physicochemical feature combinations that best explain the basis of clustering are identified. Mutual information calculated for the set of peptide ligands enables identification of binding site residues contributing to peptide specificity. The grouping of HLA molecules achieved here will be useful for rational vaccine design, understanding disease susceptibilities and predicting risk of organ transplants. Immunology and Cell Biology advance online publication, 24 February 2015; doi:10.1038/icb.2015.3

Human leukocyte antigens (HLAs) play an important role of presenting peptide antigens to T lymphocytes to elicit cell-mediated immunity. Located on the surfaces of antigen-presenting cells, HLA molecules can bind a range of peptide fragments and present them to appropriate T cells, leading to the formation of peptide–HLA–T-cell receptor complexes and subsequent generation of immune responses.1,2 HLA genes are highly polymorphic, resulting in a high degree of sequence variation at the peptide-binding grooves. This enables them to recognize a large number of different peptides. More than 2000 HLA class-I alleles are reported in the International ImMunoGeneTics (IMGT) database.3 The peptide-binding groove is a concave depression created between the α1 and α2 domains that fold into a pair of parallel helices.1 Despite identifying polymorphic regions, genetic diversity does not map one to one with functional diversity. Function here refers to the peptide repertoire that an allele can recognize. This is because although peptide recognition is highly specific, multispecificity is a commonly seen phenomenon from both the HLA and peptide’s perspective, rendering HLA–peptide recognition a highly complex problem. Systematic epitope prediction exercise suggests that some HLA molecules bind to a large variety of peptides, whereas many others bind to a very few for a given affinity cutoff. Peptide ligands too exhibit a similar trend as some peptides are predicted to bind to a large

number of HLA alleles whereas some can bind only to one specific HLA allele. Available experimental data also reflect the same. A powerful classification scheme that explains complex patterns of peptide recognition becomes important for several reasons. First, it is useful for estimating potential antigenicity of a peptide and thereby of a whole pathogenic genome. Second, it helps in explaining differences in disease susceptibilities and disease severity in different individuals, and this can ultimately lead to the development of predictive models. Third, it is useful for comparing alleles and HLA genotypes between different individuals for predicting organ graft rejection. Finally, it provides a platform to integrate and interpret numerous experimental results reported in literature on HLA–peptide binding and allele profiling. Different methods have been attempted to describe HLA molecules and to organize them into groups. Serotyping is the oldest among them that groups together alleles based on reaction to HLA antisera.4 As sequences of various alleles became available, grouping HLA molecules on the basis of similarities in their sequences was explored.5 This grouping identified broad serotype classes, but could not resolve many ambiguities that existed regarding functional similarities among diverse classes. This is quite understandable as although HLA sequences provide accurate detailed information about different alleles, they do not directly reflect the antigen pools they can

1 IISc Mathematics Initiative, Indian Institute of Science, Bangalore, India; 2Manchester Institute of Biotechnology, The University of Manchester, Manchester, UK and 3Department of Biochemistry, Indian Institute of Science, Bangalore, India Correspondence: Professor N Chandra, Department of Biochemistry, Indian Institute of Science, Bangalore 560012, Karnataka, India. E-mail: [email protected] Received 4 November 2014; revised 20 December 2014; accepted 22 December 2014

HLA–peptide cross-reactivity via hierarchical grouping S Mukherjee et al 2

recognize. In an effort to overcome this limitation, the concept of supertyping was proposed by Sette and Sidney6 in 1999 that uses sequence motifs derived from structural knowledge to obtain a broad idea of antigen profiles that alleles can recognize. Using this approach, it has been suggested that HLA alleles can be grouped into nine supertypes.6 Various attempts have been made over the years to develop methods for predicting peptides that can bind to different HLA alleles. Most of these methods are typically data driven and utilize various bioinformatics approaches.7–10 In recent years, methods based on machine learning have become increasingly reliable9,11 that have been explored to obtain functional clusters of major histocompatibility complex (MHC) molecules. Although several allele-specific features have been obtained from these studies, understanding coverage of the total peptide-binding pools for HLAs and extent of overlap between them remains an open problem. More than finding a pattern, in the long term it is important to be able to get specific answers as to why a certain pattern in observed, and this continues to pose a challenge. Current methods used for classification are based on sequence or structural motifs with known associations with experimentally determined antigens that they recognize. Given the complexity of the problem, newer approaches are necessary first to capture the recognition specificities and then to compare them efficiently. The patterns of peptide–HLA associations, though complex, make them amenable to be studied using network approaches utilizing well-established graphtheoretical methods. Networks are being increasingly used to address a variety of problems in biology, particularly for the analysis of largescale molecular data from genomics and related technologies.12,13 A number of clustering algorithms are available that can be used for clustering various alleles or peptide pools into groups. They vary in their abilities to achieve biologically meaningful grouping, making it important to explore and use an approach appropriate for the data and the problem at hand. Here we explore a divisive clustering approach for classifying HLA molecules in a hierarchical manner. Many clustering methods are based on agglomerative approaches that suffer from lack of stability in the clusters.14 This is particularly true when the inherent noise in the data is high and separation is low, as is the case typically with large-scale biological data. Divisive clustering methods overcome this drawback and provide robust clustering. To obtain distances (or extent of dissimilarity) between pairs of HLA molecules, we focus on the structures of the binding sites and use a well-validated comparison algorithm PocketMatch (PM),15 developed previously in our laboratory. All-to-all distances obtained are further used for clustering, yielding a multistep classification of the whole HLA repertoire. Using a new feature-finding algorithm, we then identify discriminating features for each group that help in rationalizing the observed peptide recognition specificities.

RESULTS A multistep classification of HLA molecules using binding site structure comparisons The data set studied consists of structural models of 2010 HLA class-I molecules that includes 150 crystallographically determined structures (of 24 distinct HLA alleles) of HLA molecules bound to peptides. Given that all HLA molecules adopt the same fold and belong to the same structural superfamily and family, modeling their atomic structures is fairly straightforward and provides high confidence models. Binding pockets consisting of 57 residues in each are identified from the structures (see Methods). Figure 1 shows a sequence logo representing a multiple structure-based sequence alignment of the binding pockets in all 2010 alleles from A, B and C loci. The height of the letter indicates the extent of conservation of the given amino acid in that position in all alleles. Of the 57 residue positions in the extended pocket, it is clear from the figure that many are conserved. We used structure-based sequence alignments, as they are more sensitive and hence superior16 as compared with more common sequence-based methods for obtaining multiple sequence alignments. Only 21 residues were found to vary considerably, corresponding to a variation in the positional information entropy ⩾ 0.3, whereas other positions are conserved. All-to-all pairwise comparison of extended pockets was carried out using PM. Distances that capture site dissimilarities quantitatively obtained from PM were all observed to lie between 0.0 and ∼ 0.4 in a scale of 0 to 1, where 0 indicates identity and 1 indicates total dissimilarity. In other words, a baseline similarity of ∼ 60% is seen in any given pair. From this, it is clear that the binding sites have an overall similar structure in all molecules, as evident from PM scores. The task therefore narrows down to identifying subtle differences in their antigen-binding peptide grooves, so as to explain differences in their recognition specificities. Grouping of the alleles based on binding site similarities is represented as a dendrogram and shown in Figure 2a. Seven distinct branches are observed, of which five are major with many alleles each and two are minor, containing very few alleles. The branching pattern implies that the alleles within a branch show high similarities to each other in terms of the composition and spatial disposition of residues in their binding pockets. To understand the extent of similarity within a cluster, we project each cluster as a network where the graph has HLA molecules as nodes and binding site similarities between them as edges. Only edges greater than a PM score of 0.9 (corresponding to ∼ 90% similarity) are shown (Figure 2b). The network clearly has distinct components corresponding to the branches in the dendrogram. Furthermore, strongly connected components are evident within each cluster as well. This suggests that although molecules within a cluster share strong similarities to each other, reasonably distinct subclusters exist in them, thus warranting further clustering in a hierarchical fashion.

Figure 1 A sequence logo representing a multiple sequence alignment of the extended pocket of all 2010 class-I HLA molecules. The alignment was carried out by superposing the structural models using MATT24from which sequences at equivalent positions are output as a sequence logo using Weblogo software.35 The height of the letter is proportional to the extent of conservation in the entire set of alleles. Residue numbers correspond to those in A*0201 allele (1A07). A full color version of this figure is available at the Immunology and Cell Biology journal online. Immunology and Cell Biology

HLA–peptide cross-reactivity via hierarchical grouping S Mukherjee et al 3

Figure 2 HLA classification based on binding site structures. (a) A dendrogram presenting first-level classification of HLA molecules. Distinct branches are seen, where the branch length is proportional to the distance between them. Distribution of predominant alleles in each cluster in the dendrogram is shown as pie charts. Pie charts are sized on the basis of the number of distinct alleles in the cluster. (b) Representation of binding site similarities among all pairs of HLA molecules as a network illustrating the need for multilevel clustering. Nodes represent HLA alleles and edges represent ⩾ 90% structural similarities in their binding sites. Distinct clusters are seen in this network as shown in the inner (big) circle. The outer (smaller) circles are enlarged views of some of the large clusters from the inner circle as shown, indicating they can be further subdivided. A full color version of this figure is available at the Immunology and Cell Biology journal online.

To achieve this systematically, we use the divisive method of clustering, where the depth of clustering is decided adaptively based on the data. Thus, a given cluster gets subdivided into further clusters until convergence. Figure 3 indicates the hierarchical classification obtained from the recursive clustering exercise. It is clear from the pattern that there are seven first-level supergroups labeled S1 to S7, of which the first five are major and the other two are minor clusters, based on the number of alleles contained in them. Further subclustering is very clearly seen in all but the S5 cluster. The depth of clustering, which indicates the number of subdivisions in a cluster, varies for different clusters with a minimum of 1 and a maximum of 5. Firstlevel clusters are thus numbered preceded by the letter ‘S’. Subsequent levels are labeled in lexical and numerical orders in alternate levels. For example, a label S3.a.1 means that the cluster or supertype S3 divides into further subclusters. To render it more accessible to the community and to facilitate interactive querying, we have provided a web server available at http://proline.biochem.iisc.ernet.in/hlaclassify/. For any of the 2010 alleles, the precise classification can be obtained from this server. It must be noted that the clustering pattern is achieved solely from structural comparison of binding sites. To understand the biological relevance of the clustering pattern, we study two aspects: (1) consistency with the previously reported supertype classification of Sidney et al.17 and (2) ability to explain peptide specificities in a large comprehensive peptide library, discussed in the following section.

Comparison with earlier classification schemes We present here a hierarchical classification scheme that is derived purely using three-dimensional structural information. To analyze how our scheme performs in comparison with other studies, we have carried out systematic comparison against (1) known supertypes of Sidney et al.17 and (2) MHC cluster11 distances. The clusters obtained from this study are compared with the previously reported supertypes. The result of this comparative analysis is shown in Table 1. Clusters S1 and S3 are seen to be in strong agreement with supertypes A24 and B58 respectively. In the case of A02 and B62 supertypes, they appear to be predominantly but not totally in agreement with clusters S2 and S3 respectively. For other supertypes however, correspondence is seen with more than one cluster. Viewing it from the point of supergroups obtained from our classification, S4 and S5 correspond to single supertypes whereas others are seen to contain alleles classified under more than supertype from the earlier classification. The S3 cluster for example consists of HLA molecules from six different supertypes. These members of S3 indeed show similar epitope specificities indicating that the classification obtained here fares better in explaining the observed crossreactivities. The second exercise pertains to the comparison of distances obtained from our classification to that obtained from MHCcluster for corresponding allele pairs. Based on the availability of data from the latter, 65 distinct HLA alleles leading to 2080 allele pairs could be tested. We observe a significant correlation of (ρ≈0.75), showing an Immunology and Cell Biology

HLA–peptide cross-reactivity via hierarchical grouping S Mukherjee et al

.a

.a

.d

.b

.1

S3

S6

.a

.a

S3

S6

.d

.3

.1

S3

.d

.1

.d

.c

S3

S3

2.b

2.a

. c.

.2

S3 .

a. 1

. a. 3

.1.b

a. 1

.1.a

c

S3 .

S3 . a

. a. 1

.a .2 S3 .a .1

b

S3.a. 1.b.1

S3. a. 2.

S3 .a. 1.b .3

a. 2 .

. c.

S3

S3

.2 .a

S3 .

S3 .a

S3.a .1.b .2

4

.1

.2 S

S6

S6 S6 .

S6 .

. c.

. c.

.1

. S3

2.b

.a

b.

S3

.1

1.

a.

. b.

1

1.

b

2 . a.

2.a

. c. 4

c. 1

c. 3 .

.a

S3

S6

b 3.

S3 .

b. 1

S3 . b

c

. b. 1

.1.b

S3 .b

S6 .c.3

.a

S6. c. 3.

b

.3

S3 .b .2

S6 .e. 1

.a

S3 .b. 2.b

S6

S6.e. 2

.2

S8

S7

S7.b

S6 .d. 3

S7 .c S6 .d .1

S7 .a

.1 .a

S6 .

S2 . b

1.c

d. 2

S6

.a

2. . d.

.1.a

S4

. S6 . d

S2 .

S1

S6 .d

S2

.b

b

S2

S2

S6

.b

.2

. S6

S2 b.

1

S2

.2

.b

3

2

.1

.2

S4

.c

.2

S1.b .1.c

.a .2 .a

S1 .a .1 .c

S1 .a .1

. a. 3

.2.b

a. 1

.1.a

S1 .

.a

b

S1 . a

. a.

1.a

.1

. a.

S1

1.b

S1

.2

. b.

.1.

. a. 1

1.a

b. 1

.2 .a

.a .2 .b

S1 .b .1 .a

S1 .b .1

. b.

S1 .

S1

S1

1.

.b

b

S1

.a

S1

.a

S4

S1

.b

.1

.c

.a

. b.

b. 1

S5

S1

.a

. b.

Figure 3 Hierarchical clustering tree depicting the final multidepth classification. In the radial-tree layout shown, at the first level, seven supergroups are seen labeled S1 to S7. In subsequent depths, nodes that reflect subclusters are labeled with alternating lexical and numerical characters. At all levels, nodes that contain predominantly ‘A’ locus alleles are shown as squares, whereas the predominant ‘B’ and ‘C’ loci nodes are shown as filled and hollow circles respectively. An interactive version is made available as a web application, hlaclassify, http://proline.biochem.iisc.ernet.in/hlaclassify/.

agreement of ∼ 94%. A detailed list of those that do not agree as well as an additional figure are made available on the hlaclassify web server. Clustering validity by analyzing commonality in peptide pools Members of a cluster in principle should be capable of recognizing the same peptide. We therefore investigated what fraction of a cluster recognizes a given peptide. For this, we use a comprehensive data set of 23 858 epitopes that we predicted from the large combinatorial library. Given that one peptide can be recognized by multiple alleles and one allele can bind multiple peptides, the extent of cross-reactivity Immunology and Cell Biology

among peptides and alleles is very high. In all, 23 858 peptides make 1 117 345 strong binding interactions with the set of alleles. Thus, comparing commonality in peptide pools is not a simple task. To address this, we measure similarity between all pairs of HLA molecules by considering the ratio of the number of peptides they share to the total number of peptides in both. The similarity is then projected as a distance measure. Figure 4a illustrates a correlation plot between distances obtained by analyzing commonalities in the peptide pool with those obtained from site structural comparison. A clear trend is seen from the plot, indicating that as the distances between two

HLA–peptide cross-reactivity via hierarchical grouping S Mukherjee et al 5

structures increases, commonality in their peptide pools decreases. Agreement is seen at different levels of hierarchy. We project the data as a network where HLA alleles form the nodes and edges between pairs of them indicate the extent of commonality in their respective peptide pools. Upon visualization of the network as shown in Figure 4b, presence of different groups becomes immediately apparent. Different supergroups as described in the previous section are indicated in different shades of gray. Grouping of the HLA molecules

Table 1 Confusion matrix between supertype classification of human leukocyte antigen (HLA) alleles17 and the first-level supergroups obtained from this study Supertypes

Supergroups S1

A01

S2

S3

S4

S5

21

A02 A03

34

A24

24

S6 18

48 2

3 4

B07 B08

74 3

B27 B44

44 47

B58 B62

11 44

13 1 13

3 5 4

Numbers represent the number of alleles that are in agreement between the two classification schemes. Any given row represents the total number of alleles in the corresponding supertype,16 whereas any column reflects the total number of alleles in the supergroup, from the same set of alleles from this work.

based on similarity in their epitopes is thus seen clearly, consistent with the structure-based clustering. To analyze this further in a systematic manner, a clustering exercise based on distances between peptide pools using the same multidepth divisive approach was carried out and the obtained clustering pattern compared with that described in Figure 3. It is clear that there is high agreement in the two clustering patterns. With increasing depth of the cluster, hierarchy agreement between clusters obtained from both methods improves. This means that subclusters in Figure 3 have similar peptide specificities and thus validates the clustering pattern. Thus, classification obtained purely from binding site structures is capable of explaining binding specificities of peptide pools. Cross-reactivity between clusters Despite obtaining well-segregated clusters, given the pleiotropic nature of HLA–peptide binding, it is inevitable that some extent of overlap in the epitope pools or cross-reactivity is seen among clusters. At initial levels, low to moderate extents of overlap are seen, reflecting that some peptides but not many in the total pool are common for a given pair of clusters. However, as the depth of clustering increases, clusters become better defined. As a result, the individual groups at the full depth of clustering do not show any significant cross-reactivity with other groups at that level. This is also evident from Figure 4a that shows a few dark cells away from the diagonals in the correlation plot. In the first level, within-group commonality in the peptide pools is substantial, whereas across group the commonality is almost negligible. S6 supergroup in particular is an exception containing alleles from all three loci, which is seen to be the most promiscuous, sharing some overlap with peptide pools from other supergroups.

Figure 4 Comparison of supergroups of Figure 3 with epitope-based classification. (a) A correlation plot illustrating extent of commonality in peptide pools for all-pair combinations of the final 72 groups. A gray-scale gradient is used where darker shades indicate strong overlap between the epitope pools. (b) A network representation of HLA alleles to reflect consistency with peptide-based classification. Each node in the network represents a specific HLA molecule; nodes are sized proportional to size of their epitope pools (see Methods). An edge is drawn between two nodes if there is an overlap between their peptide pools. At least 10% of identical peptides in their pools are necessary to qualify for an overlap. The edges are weighted based on the extent of overlap in their peptide pools. The nodes are indicated in shades of gray representing the supergroups obtained from the structure-based clustering (S1 to S7 in Figure 3). The network is visualized using Cytoscape36. A spring-embedded layout, based on a ‘force-directed’ paradigm,37 implemented in Cytoscape is used, showing clustering of the seven supergroups. A full color version of this figure is available at the Immunology and Cell Biology journal online.

Immunology and Cell Biology

HLA–peptide cross-reactivity via hierarchical grouping S Mukherjee et al 6

Site and epitope signatures of the clusters Earlier reports on HLA classification in literature indicate that alleles of the three loci fall into separate groups. This is indeed the general trend from our analysis as well. There are however few exceptions that perhaps serve to explain similarities in peptide specificities better. At the first depth of clustering, clusters S1 and S2 are seen to predominantly contain A locus alleles, whereas clusters S3 to S5 contain B locus alleles and the minor S7 cluster has C locus alleles. S6 contains mainly C alleles, but also has alleles from A and B loci that, at further depths of clustering, separate out into distinct subclusters (additional information available at http://proline.biochem.iisc.ernet. in/hlaclassify). The distance between subclusters S6.a and S6.b containing A and B loci alleles respectively is quite low, consistent with them being subclusters of a common cluster. In order to study whether specific sequence signatures can be derived for different clusters, we analyzed multiple structure-based sequence alignment of the alleles. Figure 5 illustrates the alignment in the form of a sequence logo. Of the 57 site residues, 24 are totally conserved and 33 are significantly variable, of which 21 are found to be hypervariable, giving rise to differences in peptide-binding abilities. Similarly constructed sequence logos for members within a cluster should indicate much higher conservation of site residues. Indeed, this is what we observe. Sequence logos for all seven clusters are shown in Figure 5. Those of their corresponding peptide pools are also shown in the figure, illustrating (1) specific sequence signatures of the site residues within that cluster and (2) most preferred residues that turn out typically to be in the anchor positions in the pool of peptides that the alleles in the cluster recognize. As evident from Figure 5, maximum variation in the sites is seen mainly at residues 45, 67, 114 and 116. Cluster S1, containing alleles of A loci, predominantly has residues M, V, R and D at these positions, whereas cluster S2 has residues M, V, H and Y. Similarly, clear preferences are seen for other clusters as well, as summarized in Figure 5. It is clear that different combinations of residues at positions 9, 24, 62, 69, 70, 71, 74, 97 and 163 make up different classes of sites. Corresponding epitope signatures for all alleles are made available on the ‘hlaclassify’ web server. Unambiguous clustering and subgroup assignment was seen for 1470 out of 2010 HLA alleles. At the first level, only 3, all belonging to the HLA-C class, are not assigned a group. At the subsequent levels, there are a few that do not fit well into any of second- or third-level groups despite emerging from the same first-level clusters. As they are too few in each case to become separate subgroups, they are marked as ‘x’ subgroups and treated as outliers. Upon structural analysis, we observe that these outliers in general contain variations in positions that are predominantly conserved in all other alleles. Capturing binding sites as feature vectors Structure comparison of binding sites performed using PM allows us to find similarities in terms of geometry and chemical properties overall but does not pinpoint the precise property that differentiates a given pair of binding sites. It is therefore useful to represent binding sites using different physicochemical features. AA index is a useful repository that lists ∼ 544 indices for all 20 amino acids. Each residue at the site can therefore be represented by a set of features. Major advantages of using features are that (1) data can be represented in multiple dimensions, each dimension capturing a specific physicochemical property, (2) feature vectors being alignment invariant allow sites to be compared fairly efficiently and (3) more importantly, they readily provide insights about the physicochemical basis for classification. Immunology and Cell Biology

Identification of residues and features that impart binding specificity As many of the physicochemical properties are correlated with each other, an analysis was carried out to identify independent feature combinations, sufficient to describe a site, as described in the Methods section. Five feature classes were identified as independent. Each HLA residue is described in terms of these five feature classes. From the large combinatorial peptide library generated as described in the Methods section, 1.1 million peptide–HLA associations were predicted. From these association data, mutual information between putative peptide ligands and HLA-binding site residues was computed, such that (21 × 5 features) × 9 values are obtained for each of the associations among 21 HLA site residues and 9 positions of the peptide ligands. Each value thus obtained is then analyzed to identify whether a specific feature stands out as important for that association, either in a positive or in a negative sense. Figure 6 shows the information association of physicochemical properties of the residues in the HLAbinding site with the residue positions in the peptide ligands. The analysis shows that the second and ninth positions in the peptide ligands have the maximum biases for binding. These positions are well known to serve as ‘anchor residues’ from numerous experimental studies in literature.18 Amino-acid selection at these peptide ligand residue positions are skewed in the peptide ligands, as indicated in the form of individual histograms in the first panel of Figure 6. The contribution from other residues, 3 and 8 in particular, although lesser than the two anchors, is shown to be significant. Top 15 associations as indicated by mutual information are shown in Figure 6. Of the 21 variable site residue positions, 10 show strong biases or in other words contribute toward binding specificity. These are positions 9, 45, 62, 67, 70, 97, 114, 116, 156 and 163. Residue 9, for example, which is almost at the center of the binding groove, affects the ligand specificity primarily because of its extent of hydrophobicity. In another example, residue 62 was observed to influence the choice of residues for peptide at positions 3 and 5. The flexibility and polarity of this residue affects the binding in a negative and positive manner respectively. This means that presence of a polar residue such as [EQR] at position 62 will favor binding to peptides with N and T residues at positions 3 and 5, whereas presence of nonpolar residues at that position will prevent binding of many peptides. Of these 10 HLA positions, 7 can be associated with specific subgroups from the classification obtained in this study, listed in Table 2. Rationalizing HLA classification in terms of features Given the extent of information accumulated in this study on HLA alleles, their peptides and various physicochemical properties, in principle, it should be possible to rationalize the hierarchical clustering pattern of HLA alleles obtained in this study. Therefore, an independent exercise was carried out to identify discriminatory residues and features that explain the observed classification. Weights were assigned to the features and the maximally discriminatory feature set was identified as described in the Methods. Different residues contribute differently at each level of this hierarchical classification. At the first level, distinction between seven classes is mainly because of variations at positions 9, 24, 62, 69, 70, 71, 74, 97 and 163 in the HLA molecule (Figure 7). Those in 9, 62 and 97 affect peptide-binding specificities. In subsequent levels, discriminatory signatures become more specific, with as few as one to two residues in some cases. Thus, for S1.b, [EQ]62[KN]66[EA]76[IT]80[DY]116 serves as a discriminatory signature, whereas for S6.e, the [RW]97[DRW]156 signature suffices to discriminate from other members at the indicated level of hierarchy.

HLA–peptide cross-reactivity via hierarchical grouping S Mukherjee et al 7

Figure 5 Site and corresponding epitope signatures at the supertype level. For clusters S1 to S7 from the current classification. Sequence logos indicating sequence profiles of most variable 30 residues in members in the indicated cluster are shown. Corresponding peptide sequence profiles from the large pool of epitopes for all members of that cluster are shown as epitope signatures. Clear preferences for residue types are seen in most cases. The height of the letter indicates the percentage of members in that group having that residue at that position. A full color version of this figure is available at the Immunology and Cell Biology journal online.

DISCUSSION In this study, we analyze 2010 HLA alleles by structural comparison of their binding sites, identify similarities among them and thus obtain hierarchical classification of these alleles. The analysis brings out very clearly that all known HLA molecules group merely into seven broad supergroups, of which five are major groups with more than hundred

alleles in each. All member alleles in a subcluster or type are potentially capable of recognizing the same set of epitopes. This analysis would have been incomplete without considering all alleles as available crystal structures represent only a portion of the binding site space of HLA molecules. Hence, the study of structural models of all alleles was also included. Unique features of this work are: (1) presenting a new Immunology and Cell Biology

HLA–peptide cross-reactivity via hierarchical grouping S Mukherjee et al 8

Figure 6 Mutual information analysis between occurrence of HLA-binding site residues and peptide residues for detection of residues that influence peptidebinding specificities. For ∼ 1.1 million peptides predicted to bind strongly to one or more HLA alleles, mutual information is computed (please see text). Positional frequency of amino acids at the nine positions of the peptide is shown as histograms in the left most panel. The five physicochemical properties taken represent the five clusters obtained from the independent component analysis that are average flexibility, residue volume, hydrophobicity, buriability and polarity, respectively. The right most panel shows the average property value at different HLA residues at the binding site. All properties are scaled linearly between − 1 and +1. Mutual information is computed for each peptide–HLA pair association. Strong association is seen between residues 66, 67, 79, 71, 77 and 80 of HLA and residue positions 2, 8 and 9 of the peptide, represented in the middle panel as a bipartite network.

classification scheme covering all known HLA class-I alleles based on structural-level detail that explains both one-to-many and many-toone interactions between HLA molecules and their peptide ligands; (2) extraction of discriminating features for each type at atomic-level detail, providing a mechanistic basis for classification; (3) obtaining HLA site signatures and corresponding epitope signatures in the form of sequence logos and probability matrices; (4) extensive validation by comparing with a large library of experimentally known epitopes for 48 different alleles and then extending to all highconfident epitope predictions for 2010 alleles, besides comparisons with an earlier classification; and (5) rationalization of peptide recognition properties within clusters and understanding the basis for cross-presentation between clusters. This work presents a significant advance over previous reports of HLA classification not only because of the superior resolution of the clusters but also because classification is no longer a black box as it is based on binding site structures, further supported by identification of specific discriminating features. The idea of comparing structures of binding pockets has been explored before.19,20 However, these methods have either considered structure-derived sequence motifs6 or structure-derived physicochemical descriptors of the pockets and typically have studied much smaller number of alleles. Although these capture the broad essence of the Immunology and Cell Biology

structures of the binding sites, they can be approximate at best and do not have the power to resolve between closely related yet different sites. This is because only selected features of the sites are preidentified and compared, presumably because of the difficulty in carrying out atomic-level structural comparisons. In the recent years, our group as well as others have developed such algorithms to carry out large-scale comparisons of structures of binding pockets. PM, for example, has been useful to analyze pocketome at a genome scale and obtain a variety of insights ranging from function annotation to new drug discovery.21 PM has the advantage of being extremely efficient and sensitive to small changes either in geometric disposition of the residues or in their chemical properties and has been extensively validated.15 Here we use this algorithm to compare binding pockets in class-I HLA molecules, first for the set of alleles with known crystal structures and subsequently extend it to all known HLA alleles. For the latter, molecular models of the remaining alleles were obtained. Structural models of HLA molecules obtained using a standard method and binding site identifications in them are both guided by the available crystal structures and their multiple structure alignments, making it highly reliable. In order to understand the physical basis for such classification, we project binding sites as feature vectors. Features are weighted to explain the classification obtained from site-structure comparisons.

HLA–peptide cross-reactivity via hierarchical grouping S Mukherjee et al 9

Table 2 Residues contributing to peptide binding significantly based on mutual information computed for different peptide–HLA associations HLA residues

Physiochemical property

Peptide sequence specificity

Groups

9

Hydrophobicity

4[FW] 5[AVW] 6[FP] 7[FPY]

Root label, S6, S6.d

45 62

Residue Volume Flexibility

2[MRS] 3[N]4[I]5[T]

S3.a.1.b, S3.b.2, S3.c, S4, S6.b S1.a, S1.b, S1, S3.a.1.b

97 114

Buriability Polarity

5[LP]7[FQ] 4[MP]5[IN]

Root label, S1.a.1, S3.b, S6.c.3, S6.d S3.b.1.a

156 163

Hydrophobicity Buriability

6[KL] 1[GY]

S1.a.1.a, S1.b.1, S3.c.2, S6, S6.d S1.a.1, S3.d, S6, S6.c, S6.d

Abbreviation: HLA, human leukocyte antigen. Also listed are the HLA subgroups in which this property is found to be important. Position of the site residues, their corresponding physicochemical properties and the residue on the peptide ligand are listed.

Figure 7 Structural views of portions of HLA molecules containing the binding sites shown as cartoon representations. Starting from the root (R) at level 0, classification into seven supertypes, of which only S1 and S6 are shown as example. These supergroups are further subdivided into subclusters, of which two examples of each are shown and labeled. Binding site residues that stand out as most discriminatory between clusters are shown as sequence logos above each structural diagram, along with labeling of their sequence positions. In the cartoon diagrams of the structures, regions that maximally contribute to clustering are marked on the structure. The site sequence signatures are identified through an optimization exercise (please see Methods). A full color version of this figure is available at the Immunology and Cell Biology journal online.

Feature fingerprints are then derived for different classes. Features that constitute different sites are then interpreted in terms of the physicochemical properties they reflect. Signatures in the form of weighted set of features are written out for each class of HLA

molecules. Besides providing atomic-level explanations for understanding peptide-binding specificities and similarities among diverse alleles, the classification obtained here is useful for a number of applications. To begin with, it provides a platform to interpret Immunology and Cell Biology

HLA–peptide cross-reactivity via hierarchical grouping S Mukherjee et al 10

numerous experimental results on HLA–peptide binding and allele profiling. In peptide-based vaccine discovery, it is ultimately useful for estimating potential antigenicity of a peptide and indeed of whole genomes of pathogens. Third, the classification can be expected to be useful for understanding disease susceptibilities as well. Another major application of this work could be in HLA compatibility testing during organ transplant, so as to ultimately reduce graft rejection rates. Finally, the grouping information of different HLA alleles can be readily used during selection and prioritization of peptides for use in epitope-based vaccine discovery. METHODS Data resources Protein sequences for all 2010 class-I HLA alleles were obtained from the IMGT/HLA database, a comprehensive resource as part of the IMGT project used extensively by the community, containing sequence information of HLA molecules, along with WHO (World Health Organization) nomenclature designations.3 For 31 alleles, experimentally determined three-dimensional structures have been reported that were obtained from the Protein Data Bank (PDB).22 Structural models for the other alleles were built using Modeller23 using the closest sequence neighbor among the 31 alleles in PDB as the structural template (list of all templates are available online at http://proline. biochem.iisc.ernet.in/hlaclassify). The set of experimentally mapped ~ 14 500 distinct epitopes for different alleles were obtained from ‘the immune epitope database 2.0’ (IEDB).24

Structure and site comparisons Structure-based multiple sequence alignments using all residues of α1 and α2 domains were performed on all available crystal structures using MATT.24 Binding sites were identified from the crystal structures of peptide–HLA complexes by considering all residues within 4.5 Å of atoms of the bound peptides. Depending upon the exact peptide, the number of residues in the zone can vary, making comparisons difficult. To overcome this problem, we use an extended definition of the binding site as described. First, a structurebased sequence alignment is obtained using MATT. The binding site residues in each structure are flagged. All flagged columns are included in the extended site of all structures, hereafter referred to as the ‘binding pocket’ or the ‘bindingsite’. Thus, binding sites in different structures have the same number of residues. We have used PM15,21,25 to calculate the distances between pairs of binding sites. PM provides the similarity between two binding site in the scale of 0–1, where 0 signifies no similarity between the binding sites and 1 is identical binding site. We define the PM distance as 1 − PM score. Binding site trees are constructed from all to-all PM distances and visualized using FigTree.26

Divisive clustering A divisive algorithm is used to estimate clusters. The igraph package implementing this method is used for this purpose.27 This clustering method first considers all-to-all distance pairs, sorts them and progressively constructs neighbor topology graphs. Adding a distance pair generates one edge between two nodes in the graph. The algorithm uniformly adds distance pairs in ascending order incrementally, until the graph becomes connected. This process adaptively identifies the distance threshold for deriving appropriate clusters. Once a single component graph structure is achieved, graph community detection algorithms are employed to detect strongly connected components. For stable clustering we use both the Walk-trap algorithm28 and the fastgreedy algorithm29,30 and take the consensus of the two. In some cases, even after adding half of the edges, the graph still remains disconnected. In that case incremental addition is stopped and the graph is considered to have well separated clusters. Each cluster can be further divided into several levels. Divisive clustering is performed recursively until the cluster size reduces to 10 or no further division is feasible. Immunology and Cell Biology

Feature space representation of binding sites The binding sites are represented as feature vectors based on the amino-acid residues present in them. A knowledge-based index capturing various physicochemical properties are available in a public resource called AAIndex server.31 Of the various indices listed in this resource (~544), 216 are relevant for our analysis. Of these, some features are heavily correlated with each other and the feature set was therefore pruned to retain only independent features. To identify the independent features, a graph is constructed where each node indicates a distinct property index. An edge is drawn between two nodes only if there is very low correlation between the pair of indices (−0.05oρo0.05), where r represents Spearman’s rank correlation. Maximal complete subgraphs from this resultant graph therefore identifies a group of descriptors completely independent of each other. The computation is carried out using the igraph package implemented in R.27 For feature weighting, distances between a pair of HLA molecules in the feature vector representation is given as (X1 − X2)TDiag (W)(X1 − X2), where X1 and X2 signify feature vectors and W is the weighting vector. The process of weight determination is framed as an optimization problem. A silhouette index of a cluster is a good measure to obtain wellseparated cluster assignments, ranging from − 1 to 1, where 1 signifies an accurate and unambiguous cluster assignment. Weights assigned to features can influence the final evaluation of a silhouette index. To obtain the weights, so as to maximize cluster separation, we take the average silhouette index for all data points as our objective function. This optimization framework does not by default restrict assigning high weights to features that do not contribute significantly to clustering, which is overcome by parameter regularization. This is achieved by minimizing the norm of the weight vector, along with maximization of silhouette index. The final objective function is thus written P ðmx xi Þ as max l  avg ðSilðX; WÞÞ þ P m . We have considered the λ value as 100. w

x

We have used NLopt library to carry out the optimization. A derivative-free principal axis method is used for optimal weight determination. This process yields appropriate feature weighting resulting in the desired partitioning. Features important for resolving clusters therefore automatically attain higher weights.

Generating a peptide library and epitope detection We generate an unbiased exhaustive data set of epitopes for different alleles by using an orthogonal testing strategy. Using D-optimal design,32 we generate an exhaustive set of peptides, such that at any three positions, all possible permutations of 20 amino acids occur. This process leads to the generation of a peptide library containing ~ 74 000 unique peptides. From this library, possible epitopes for different alleles are then detected. Most HLA epitope prediction algorithms are restricted to the study of those alleles for which some experimentally characterized peptides are available. Thus, prediction of binding peptides is only possible for ~ 89 alleles. However, a prediction tool NetMHCpan is capable of predicting peptide binding for a much larger class of alleles by using HLA sequence profiles.9 In this study, we use NetMHCpan to predict binder peptides for all 2010 HLA alleles studied here. Peptides with predicted binding affinity of o50 nM were considered to be strong binders33 and used in this study. This process yielded 23 858 unique epitopes for 1830 alleles, of which 6306 epitopes are in the experimentally known epitope list as obtained from IEDB.

Grouping HLA molecules based on their cognate peptide pools As each allele can recognize many epitopes, it is meaningful to compare the entire set of peptides for different alleles. The set of all high-affinity peptides for a given allele is considered as its epitope pool. An all-to-all comparison of the epitope pools was carried out and similarities were described by using a setH 1 -H 2 theoretic distance measure as 1  min ðH 1 ;H 2 Þ. With this metric, the distance between two sets is 0 when one set becomes a subset of another. All-pair distances were then used to group HLA molecules into clusters, using divisive clustering methodology as described in the earlier section.

Cluster comparison The previous set of steps provide two types of clustering, first based on structural similarity of the binding sites of HLA molecules and the second based

HLA–peptide cross-reactivity via hierarchical grouping S Mukherjee et al 11 on similarity in the set of peptides that they recognize. Divisive clustering yields a hierarchical tree structure that partitions the whole HLA set into smaller subsets recursively. In order to find similarity between any pair of subtrees, we T 1 -T 2 use a set theoretic distance measure, 1  min ðT 1 ;T 2 Þ, where T1 is the subtree from the first classification and T2 is the subtree from the second classification. Consistency or accordance between two classification schemes is then computed and used for defining end points of partitioning.

Detection of discriminatory site residues using mutual information Next, we identified the set of residues in each HLA molecule that maximally contribute to binding the pool of peptides, by computing mutual information and entropy.34 By comparing such residues in different HLA alleles, we identify discriminatory residues that contribute to the observed peptide specificities. As described in the earlier section, we generate ∼ 1.5 million pairs of peptide–HLA associations. If a particular HLA residue causes a bias in the selection of a epitope residue at a particular position, a strong influence will be seen in their occurrences, captured using mutual information content   PP pðx;y Þ x y pðx; y Þlog pðxÞpðy Þ , where x denotes the occurrence of a particular amino acid at a particular position in the peptide ligand and y denotes the occurrence of a particular amino acid at a specific HLA-binding site. In addition, to understand the physicochemical basis for binding, we represent the HLA-binding site residues using distinct physicochemical properties. This analysis allows identification of prime residues at the HLA-binding site and their physicochemical properties that induces peptide ligand specificities.

Web server Interactive querying from the perspective of different alleles as well as different epitopes has been facilitated through a web server, made publicly available at URL http://proline.biochem.iisc.ernet.in/hlaclassify. The web server has been implemented using LAMP (Linux Apache MySQL PHP) architecture. For interactive querying, figures are generated through JQuery technology (http:// jquery.com/).

CONFLICT OF INTEREST The authors declare no conflict of interest.

ACKNOWLEDGEMENTS We thank Department of Biotechnology, Government of India, for financial support. We also acknowledge travel support from UK-India Education and Research Initiative (UKIERI). Author contributions: SM is a graduate student who carried out this work under the guidance of his advisor NC and JW. SM and NC wrote the paper. All three have approved the final manuscript.

1 Murphy KP, Travers P, Walport M, Janeway C. Janeway's Immunobiology (7th ed). Garland Science, New York, 2008. 2 Neefjes J, Jongsma MLM, Paul P, Bakke O. Towards a systems understanding of MHC class I and MHC class II antigen presentation. Nat Rev Immunol 2011; 11: 823–836. 3 Robinson J, Halliwell JA, McWilliam H, Lopez R, Parham P, Marsh SGE. The IMGT/HLA database. Nucleic Acids Res 2013; 41: D1222–D1227. 4 Tan J, Tang X, Xie T. Comparison of HLA class I typing by serology with DNA typing in a Chinese population. Transplant Proc 2000; 32: 1859–1861. 5 Hoppe B, Salama A. Sequencing-based typing of HLA. Methods Mol Med 2007; 134: 71–80. 6 Sette A, Sidney J. Nine major HLA class I supertypes account for the vast preponderance of HLA-A and -B polymorphism. Immunogenetics 1999; 50: 201–212.

7 Parker KC, Bednarek MA, Coligan JE. Scheme for ranking potential HLA-A2 binding peptides based on independent binding of individual peptide side-chains. J Immunol 1994; 152: 163–175. 8 Rammensee H, Bachmann J, Emmerich NP, Bachor OA, Stevanoviƒá S. SYFPEITHI: database for MHC ligands and peptide motifs. Immunogenetics 1999; 50: 213–219. 9 Hoof I, Peters B, Sidney J, Pedersen LE, Sette A, Lund O et al. NetMHCpan, a method for MHC class I binding prediction beyond humans. Immunogenetics 2009; 61: 1–13. 10 Hertz T, Yanover C. Identifying HLA supertypes by learning distance functions. Bioinformatics 2007; 23: e148–e155. 11 Thomsen M, Lundegaard C, Buus Sr, Lund O, Nielsen M. MHCcluster, a method for functional clustering of MHC molecules. Immunogenetics 2013; 65: 655–665. 12 Luscombe NM, Babu MM, Yu H, Snyder M, Teichmann SA, Gerstein M. Genomic analysis of regulatory network dynamics reveals large topological changes. Nature 2004; 431: 308–312. 13 Yu D, Kim M, Xiao G, Hwang TH. Review of biological network data and its applications. Genomics Inform 2013; 11: 200–210. 14 Steinbach M, Ertoz L, Kumar V (eds). The challenges of clustering highdimensional data. In: New Vistas in Statistical Physics: Applications in Econophysics, Bioinformatics, and Pattern Recognition. Springer-Verlag, Berlin/New York, 2003. 15 Yeturu K, Chandra N. PocketMatch: a new algorithm to compare binding sites in protein structures. BMC Bioinformatics 2008; 9: 543. 16 Menke M, Berger B, Cowen L. Matt: local flexibility aids protein multiple structure alignment. PLoS Comput Biol 2008; 4: e10. 17 Sidney J, Peters B, Frahm N, Brander C, Sette A. HLA class I supertypes: a revised and updated classification. BMC Immunol 2008; 9: 1. 18 Fruci D, Rovero P, Falasca G, Chersi A, Sorrentino R, Butler R et al. Anchor residue motifs of HLA class-I-binding peptides analyzed by the direct binding of synthetic peptides to HLA class I alpha chains. Hum Immunol 1993; 38: 187–192. 19 Doytchinova IA, Guan P, Flower DR. Identifiying human MHC supertypes using bioinformatic methods. J Immunol 2004; 172: 4314–4323. 20 Harjanto S, Ng LFP, Tong JC. Clustering HLA class I superfamilies using structural interaction patterns. PLoS ONE 2014; 9: e86655. 21 Anand P, Yeturu K, Chandra N. PocketAnnotate: towards site-based function annotation. Nucleic Acids Res 2012; 40: W400–W408. 22 Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H et al. The Protein Data Bank. Nucleic Acids Res 2000; 28: 235–242. 23 Sali A, Blundell TL. Comparative protein modelling by satisfaction of spatial restraints. J Mol Biol 1993; 234: 779–815. 24 Vita R, Zarebski L, Greenbaum JA, Emami H, Hoof I, Salimi N et al. The immune epitope database 2.0. Nucleic Acids Res 2010; 38: D854–D862. 25 Nagarajan D, Chandra N (eds)PocketMatch (version 2.0): A parallel algorithm for the detection of structural similarities between protein ligand binding-sites. Parallel Computing Technologies (PARCOMPTECH), 2013 National Conference on February 2013. 26 FigTree-graphical viewer of phylogenetic trees. http://tree.bio.ed.ac.uk/software/figtree/. 27 Csardi G, Nepusz T. The igraph software package for complex network research. InterJournal, Complex Systems 1695. 2006. 28 Pons P, Latapy M. Computing communities in large networks using random walks. JGAA 2004; 10: 284–293. 29 Girvan M, Newman MEJ. Community structure in social and biological networks. Proc Natl Acad Sci USA 2002; 99: 7821–7826. 30 Newman MEJ. Fast algorithm for detecting community structure in networks. Phys Rev E 2004; 69: 066133-. 31 Kawashima S, Ogata H, Kanehisa M. AAindex: amino acid index database. Nucleic Acids Res 1999; 27: 368–369. 32 Atkinson AC, Donev AN. Optimum Experimental Designs. Clarendon Press, Oxford, 1992. 33 Roomp K, Antes I, Lengauer T. Predicting MHC class I epitopes in large datasets. BMC Bioinformatics 2010; 11: 90. 34 Shannon CE. The mathematical theory of communication. 1963. MD Comput 1997; 14: 306–317. 35 Crooks GE, Hon G, Chandonia J-M, Brenner SE. WebLogo: a sequence logo generator. Genome Res 2004; 14: 1188–1190. 36 Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 2003; 13: 2498–2504. 37 Kamada, Tomihisa, Kawai, Satoru. An algorithm for drawing general undirected graphs, Information Processing Letters. Elsevier 1981; 31: 7–15.

Immunology and Cell Biology

Lihat lebih banyak...

Comentarios

Copyright © 2017 DATOSPDF Inc.