Human cytosolic sulfotransferase database mining: identification of seven novel genes and pseudogenes

Share Embed


Descripción

The Pharmacogenomics Journal (2004) 4, 54–65 & 2004 Nature Publishing Group All rights reserved 1470-269X/04 $25.00 www.nature.com/tpj

ORIGINAL ARTICLE

Human cytosolic sulfotransferase database mining: identification of seven novel genes and pseudogenes RR Freimuth1,4 M Wiepert2 CG Chute2 ED Wieben3 RM Weinshilboum1 1 Department of Molecular Pharmacology and Experimental Therapeutics, Mayo Graduate School-Mayo Clinic, Rochester, MN, USA; 2 Department of Health Sciences Research, Mayo Graduate School-Mayo Clinic, Rochester, MN, USA; 3Department of Biochemistry and Molecular Biology, Mayo Graduate School-Mayo Clinic, Rochester, MN, USA

Correspondence: Dr R Weinshilboum, Department of Molecular Pharmacology and Experimental Therapeutics, Mayo Graduate School-Mayo Clinic, Rochester, MN 55905, USA. Tel: 507 284 2246 E-mail: [email protected]

ABSTRACT

A total of 10 SULT genes are presently known to be expressed in human tissues. We performed a comprehensive genome-wide search for novel SULT genes using two different but complementary approaches, and developed a novel graphical display to aid in the annotation of the hits. Seven novel human SULT genes were identified, five of which were predicted to be pseudogenes, including two processed pseudogenes and three pseudogenes that contained introns. Those five pseudogenes represent the first unambiguous SULT pseudogenes described in any species. Expression-profiling studies were conducted for one novel gene, SULT6B1, and a series of alternatively spliced transcripts were identified in the human testis. SULT6B1 was also present in chimpanzee and gorilla, differing at only seven encoded amino-acid residues among the three species. The results of these database mining studies will aid in studies of the regulation of these SULT genes, provide insights into the evolution of this gene family in humans, and serve as a starting point for comparative genomic studies of SULT genes. The Pharmacogenomics Journal (2004) 4, 54–65. doi:10.1038/sj.tpj.6500223 Published online 16 December 2003 Keywords: sulfotransferase; SULT; sulfation; gene; pseudogene; annotation; database mining

4 Current address: Department of Medicine, Division of Molecular Oncology, Washington University School of Medicine, St Louis, MO 63110, USA

Received: 28 May 2003 Revised: 30 September 2003 Accepted: 13 October 2003 Published online 16 December 2003

INTRODUCTION Sulfate conjugation is an important pathway in the biotransformation of many exogenous and endogenous compounds including drugs, other xenobiotics, neurotransmitters, and hormones.1,2 The cytosolic SULT enzymes that catalyze these reactions are members of a gene superfamily. Cytosolic SULTs are usually active as homodimers and share two highly conserved regions of the amino-acid sequence, termed Regions I and IV.1,3 These regions of sequence are important in binding the cosubstrate sulfate donor molecule 30 -phosphoadenosine 50 phosphosulfate (PAPS).4–6 In addition, Region IV forms part of the dimerization interface region for these enzymes.7 SULTs can be classified into families whose members share at least 45% amino-acid sequence identity, and subfamilies that include enzymes that share at least 60% amino-acid sequence identity.3,8 It should be noted that the membrane-bound sulfotransferases which are located in the golgi and catalyze the sulfation of macromolecules form a superfamily distinct from that of the cytosolic SULTs,9 and were not included in the present study.

Novel human SULT gene identification RR Freimuth et al

55

In all, 10 cytosolic SULT genes are presently known to be expressed in human tissues: SULTs 1A1, 1A2, 1A3, 1B1, 1C1, 1C2, 1E1, 2A1, 2B1, and 4A1.3,10–15 While the cDNAs for some of those genes were cloned after purification and sequencing of the expressed protein, many were subsequently identified and cloned using homology-based techniques that took advantage of the highly conserved sequences in Regions I and IV. Eight of the human genes were characterized using PCR and DNA sequencing methods,12,16–21 but the genes for SULT1B1 and 4A1 were characterized by annotation of publicly available genomic sequence data generated by the Human Genome Project (unpublished observations22). All 10 of these genes share very similar gene structures, both with regard to the number and length of coding exons as well as the locations of splice junctions within the translated amino-acid sequence (Figure 1). In the present study, we took advantage of the genomic sequence data available in public databases, and set out to perform a comprehensive genome-wide search for novel human SULT genes and complete annotation of the known SULT gene clusters. We used two different but complementary approaches to search the public sequence database. The first approach utilized explicit SULT query sequences and BLAST23 searches, followed by the use of two different, more sensitive sequence alignment programs. The second approach used the HMMER24 and GeneWise25 programs, both of which used a hidden Markov model (HMM) profile26 as a query. In addition, we developed a novel graphical display to view the results of the searches, thereby improving the efficiency of the manual annotation step. A total of seven novel SULT genes were identified in addition to the 10 already known to be expressed in humans. Five of the seven were predicted to be pseudogenes based either on sequence mutations or exon rearrangement. Human SULT Genes 105

SULT1A1 ~4.4 kb SULT1A2 ~5.1 kb SULT1A3 ~7.3 kb

270

74 113

152

126

98

127

95

181

152

126

98

127

95

181

152

126

98

127

95

181

>164

129

98

127

95

181

172

126

98

127

95

181

126

98

127

95

181

126

98

127

95

181

209

127

95

178

209

127

95

181

127

95

139

347 88

113

SULT1B1 ~28 kb 432

SULT1C1 ~20.1 kb

542

SULT1C2 ~10.1 kb 97

SULT1E1 ~18 kb

154 215

SULT2A1 ~16 kb SULT2B1 ~48 kb SULT4A1 ~38 kb

152

544 248

131

95

81

Exon length (bp)

337 196 499 >113 361

730 167 955 291 1605

Figure 1 Previously-characterized human SULT gene structures. Structures for the 10 SULT genes that are known to be expressed in humans and that have been reported previously are illustrated. Black boxes represent coding region, white boxes represent UTR, and hatched boxes represent alternatively spliced regions. Lengths are listed for each exon (in bp) and for the overall gene (in kb).

Three of those five contained introns and two were processed pseudogenes—the first SULT processed pseudogenes to be described in any species. The remaining two genes, SULT1C3 and SULT6B1, contained no obvious inactivating mutations. Expression-profiling studies failed to detect any SULT1C3 transcripts; so this gene is probably more accurately described as a ‘putative’ gene. A series of alternatively spliced SULT6B1 transcripts were identified in the testis, confirming the expression of this gene in humans. SULT6B1 is also highly conserved in two primate species closely related to humans, the chimpanzee and gorilla. The results of this database mining exercise will aid in future studies of the regulation of these genes, will provide insights into the evolution of this gene family in humans, and will serve as a starting point for future comparative genomic studies of SULT genes.

RESULTS BLAST and Sequence Alignment Searches To perform the first search of the human genome sequence database, 17 SULT protein sequences were selected for use as query sequences in a TBLASTN search. Those query sequences included the 11 SULT isoforms encoded by the 10 genes known to be expressed in humans (SULT2B1 encodes two isoforms as a result of alternative splicing), four non-human SULT sequences that represented families or subfamilies that contained no known human orthologs or paralogs, and two different consensus sequences. These 17 TBLASTN searches produced a total of 1765 hits that were each annotated manually. In total, 996 hits (56%) corresponded to one of the 10 SULT genes known to be expressed in humans, 422 (24%) were hits to novel (previously uncharacterized) SULT genes, 71 (4%) were determined to be nonspecific, and 276 (16%) were ambiguous and retained for further analysis. The 276 ambiguous hits were reduced to 195 nonredundant hits that represented different regions of genomic sequence, as well as one hit from each of the 10 known human SULT genes. A 500 kb region surrounding each of those 195 hits was then searched using the FrameSearch and TFastA programs. Approximately 10 360 searches were performed with those two programs, generating 39 125 hits (29 395 from TFastA and 9730 from FrameSearch). To aid in the annotation of those hits, a filter was developed that removed 32 798 hits (84%), leaving 6327 for further analysis. This filter is described in detail in the Materials and methods section. After plotting and analyzing those hits, seven novel SULT genes were identified and annotated. After the novel SULT genes were annotated, we performed a second round of searches, using four of the seven novel SULT sequences as queries for TBLASTN. In all, 303 hits were returned and annotated. Of those, 256 (84%) corresponded to known SULT genes and eight hits were retained for further analysis. The follow-up analysis included 440 queries performed between the FrameSearch and TFastA programs, producing 1737 hits. Altogether, 1492 (86%) of those hits

www.nature.com/tpj

Novel human SULT gene identification RR Freimuth et al

56

were eliminated by the filter. No additional SULT genes were identified after annotating the remaining 245 hits.

Table 1 Gene

HMMER and GeneWise Searches In parallel with the first approach, a separate but complementary strategy that utilized HMMs was also used to search for novel SULT genes. Initially, a HMM profile was constructed using the sequences of the 11 human SULT isoforms known to be expressed in humans. That profile was calibrated and used with the HMMER program to search a six-fold translation of the human genome sequence, producing 181 hits. Annotation revealed that 111 hits (61%) corresponded to one of the 10 known or seven novel SULT genes identified using the initial strategy, 54 hits (30%) were nonspecific, and 16 (9%) were not readily classifiable and were included in the next step. To verify the HMMER results and refine those sequence alignments, the GeneWise program was used to search all genomic contigs identified by the HMMER search that contained either one of the 17 SULT genes or one of the 16 unclassified hits. GeneWise produced a total of 384 hits. A total of 119 (31%) hits corresponded to SULT genes and 265 (69%) were nonspecific. All seven of the novel SULT genes that had been found during the BLAST-based approach were also identified using the HMM strategy, but no additional novel genes were observed. Annotation of SULT Genes and Genomic Organization of SULT Gene Clusters Following identification of the seven novel human SULT genes, each of those genes was annotated at the sequence level using a manual homology-based approach. Local sequence alignments, as well as the knowledge of conserved SULT gene structures (Figure 1), were employed during that annotation process. Of the total of 17 SULT genes that are now known, 13 are found in gene clusters on chromosomes 2q, 4q, 16p, and 19q (Table 1). Each of the gene structures will be described subsequently, organized on the basis of chromosomal location. Annotations for the seven novel genes were deposited in the GenBank Third Party Annotation database under TPA accession numbers BK001431– BK001437. The Chromosome 2 Cluster and SULT6B1 The four human members of the SULT1C subfamily are located in a cluster at 2q12, and the SULT6B1 gene maps to 2p22.3. The gene structures for SULT1C1 (approximately 20 kb long) and SULT1C2 (about 10 kb in length) have been reported previously.21 Both the SULT1C1 and SULT1C2 genes contain seven exons that encode the protein (Figure 1). However, the SULT1C1 gene contains an additional 50 noncoding exon as well as a rarely expressed alternatively spliced exon located within intron 3. The structures for both of those genes were confirmed during the present study. In addition to those two genes, this study also identified two other genes located within the SULT1C gene cluster: SULT1C3 and SULT1C1P.

The Pharmacogenomics Journal

SULT1A1 SULT1A2 SULT1A3 SULT1B1 SULT1E1 SULT1D1P SULT1C1 SULT1C2 SULT1C3 SULT1C1P

Human SULT gene chromosomal localizations Chromosomal localization

Gene

Chromosomal localization

16p11.2–12.1

SULT1D2P

3q22.2

4q13

SULT3A1P

14q12

2q12.2

SULT6B1

2p22.3

SULT2A1 SULT2B1 SULT2A1P

19q13.33–13.4

SULT4A1

22q13.31

The chromosomal localizations for each of the 17 human SULT genes are listed. Obvious pseudogenes are indicated by a ’P’ after their name. In most cases, experimental evidence supports the map locations listed in the table (see text for details). However, the locations of several of the novel genes are based only on map locations for the genomic contigs from which they were annotated.

The nine putative protein-coding exons of SULT1C3 span approximately 18 kb (Figure 2a). While SULT1C3 shares many features with other SULT1 family genes (Figure 1), the structure of this gene also contains unusual characteristics. First, the initial coding exon for SULT1C3 contains three inframe ATG codons that would result in proteins 304, 294, or 284 amino acids in length—lengths similar to those of other known SULT isoforms. The sites that would produce the shortest two proteins conformed slightly better to the canonical translation initiation sequence27 than did the third site. Second, exons 7 and 8 are duplicated in SULT1C3 (Figure 2a); so alternative splicing might occur for exons 7a, 7b, 8a, and 8b. The peptide sequences encoded by exons 7a and 7b are 61.7% identical to each other, and the peptide sequences encoded by exons 8a and 8b are 78.4% identical. Each exon is flanked by GT-AG dinucleotides and maintains the ORF, so alternative splicing could potentially produce four SULT1C3 proteins. Each of these four potential allozymes was named according to the exons 7 and 8 that it contained: SULT1C3a (exons 7a and 8a), SULT1C3b (7a/8b), SULT1C3c (7b/8a), and SULT1C3d (7b/8b). A transcript containing exons 7b and 8a would require a nonstandard splicing process due to the genomic arrangement of those exons and, as a result, would seem quite unlikely. The three putative proteins encoded by SULT1C3a–d are 60–68% identical to the rat SULT1C1 sequence and 53–58% identical to human SULT1C1 and SULT1C2. SULT1C1P contains at least seven exons and spans approximately 18 kb (Figure 2b). However, this gene has undergone rearrangement, since exons 2–5 are located 30 to exons 6–8. In addition, this gene lacks an in-frame ATG start

Novel human SULT gene identification RR Freimuth et al

57

a

Novel SULT Genes

SULT1C3 ~18 kb

2 >142

SULT6B1 ~29 kb

3 129 4998

4 98 851

2 106

5 127 2129

3 113 979

3035

4 90

7a 181 137

4b 84

3940

b

6 95

8a >113 1421

5 127

407

3840

4144

6 95 4231

7b 181

8b >113 233

7 157 3541

8 246 3368

Novel SULT Pseudogenes 297

568 Contains 5 stop codons and 3 frameshifts

SULT1D2P 872 SULT3A1P

SULT1C1P ~18 kb

Contains 4 stop codons and a frameshift 6 101

7 182 3165

593

GC

SULT1D1P ~22 kb

2 ~ 151

3746

CT

2 >148

2 >130 2580

3 129

4 98 819

4 98

5151

1677

No ATG

3 126

2873 Stop GA

SULT2A1P ~153 kb

8 >65

2712

CT

5 127 8941

5 129

6 95 2956

7 172 3257

8 >110 2295

AT 6 >133

3 209 131.176

MER85 Frame shift

5 95

4a 127

4b 124

1856

3086 13057 Stop CG Opposite Strand

Figure 2 Novel human SULT gene structures. Structures for the seven novel human SULT genes annotated in this study are illustrated. In addition to the information given in Figure 1, the lengths of introns (in bp), as well as the locations of mutations (arrows and checkered regions), are shown. Panel (a): Structures of the two novel genes that were not obvious pseudogenes are depicted. RACE was only performed for SULT6B1. Therefore, UTR data are available only for that gene. The 50 untranslated exons for SULT6B1 are indicated by a dashed line and are depicted in Figure 3. Panel (b): Structures for the five pseudogenes identified during this study are illustrated. The two halves of SULT2A1P are on opposite strands, as indicated by the dashed line.

codon near the expected location in the predicted exon 2 sequence, and the splice junctions at the 30 ends of exons 3, 6, and 7 are not GT. There is also a premature stop codon within exon 8, possibly resulting from a 1 bp deletion in a tyrosine codon conserved in most SULT genes. For these reasons, this gene was designated a pseudogene (indicated by the trailing P in the name) and, since its hypothetical translated sequence was most similar to SULT1C1, it was given SULT1C1 as its base name. The order of the four genes in the SULT1C cluster is cen-SULT1C3-SULT1C1-SULT1C1PSULT1C2-ter. The genes are located approximately 23, 5, and 45 kb apart, respectively. SULT6B1 is located near 2p22.3, on the opposite arm of chromosome 2 relative to the SULT1C gene cluster. SULT6B1 contains seven putative protein-coding exons that span approximately 21 kb, and has a structure very similar to those of other SULT genes (Figure 2a). Homology-based prediction of the location of the ATG start codon in exon 2 indicated that this exon is at least 199 bp in length. However, RACE and the expression-profiling experiments described subsequently indicated that the probable start

codon was located much farther downstream than predicted, and revealed that exon 2 was only 106 bp in length. Those experiments also identified four noncoding exons upstream of exon 2, bringing the total length of SULT6B1 to approximately 29 kb. In addition, experiments designed to detect alternatively spliced transcripts identified a second exon 4-like sequence (termed exon 4b) located between exons 4 and 5. The predicted protein sequence of SULT6B1 was most similar to that of chicken SULT6A1 (52% identity, 63% similarity) and, based on the nomenclature that has been suggested for SULT genes,8 this novel SULT was classified as the first member of the SULT6B subfamily. Portions of this gene were subsequently annotated by NCBI0 s automated gene prediction software as ‘similar to sulfotransferase family’. The Chromosome 4 Cluster Three SULT genes are located at chromosome 4q13: SULT1E1 and SULT1B1, both of which encode previously characterized proteins, and SULT1D1P, a gene identified during the present database searches. The gene structure for SULT1E1 was reported previously.17 SULT1E1 contains seven exons that encode the protein, as well as an additional upstream noncoding exon, and is approximately 18 kb in length (Figure 1). SULT1B1 also contains seven protein-coding exons that span 28 kb (Figure 1), but since no RACE data were reported for this gene, no noncoding regions could be identified.21 SULT1D1P shares homology with the conserved gene structures for other SULT1 family genes (Figure 2b), and is about 22 kb in length. A potential TATA box (TATTAA) is located 60 bp upstream of the putative translation initiation codon. Only one other human SULT gene, SULT1E1, contains a TATA box.17 A stop codon is present in exon 8 near the expected location, and seven polyadenylation signals are present within the subsequent 1800 bp. While those features suggested that this gene might be active, exon 2 contains two in-frame stop codons after the predicted ATG start codon, and the GT dinucleotides that comprise the anticipated splice sites at the ends of exons 2 and 4 have been mutated to GA and AT, respectively. Examples of GA–AG splicing have been reported.28,29 However, while it is known that introns may (rarely) splice using AT–AC pairs,29,30 to our knowledge, there are no examples of introns in humans that utilize AT–AG sequences.29 Therefore, SULT1D1P was classified as a pseudogene and assigned the name 1D1 because of the high homology of the hypothetical protein product to mouse SULT1D1 (80% identical, 85% similar). The SULT1D1P gene is found approximately 36 kb telomeric to SULT1B1, and the SULT1E1 gene is located about 28 kb telomeric to SULT1D1P. The Chromosome 16 Cluster Three SULT genes are located on chromosome 16p11.2– 12.1: SULT1A1, SULT1A2, and SULT1A3. The structures of all three of these genes have been reported previously,16,19,20 and those structures were confirmed during the present

www.nature.com/tpj

Novel human SULT gene identification RR Freimuth et al

58

studies. All three genes contain multiple 50 -noncoding exons and seven exons that encode protein, with exon lengths that match the conserved gene structures found among other SULT1 family genes (Figure 1). The SULT1A genes are approximately 4.4, 5.1, and 7.3 kb in length, respectively. Therefore, they are much shorter than most other SULT genes. The assembly of this region of chromosome 16 is still incomplete, so no conclusions can presently be drawn with regard to the orientation of these genes relative to each other. The Chromosome 19 Cluster Two members of the human SULT2 family are located at chromosome 19q13.33. SULT2A1 is about 16 kb long, while SULT2B1 is approximately 48 kb in length (Figure 1). Both of these genes have six exons that encode the protein, but the SULT2B1 gene has an alternative initial coding exon that is alternatively transcribed and is spliced to a site within the ORF (Figure 1). Both of these gene structures were reported previously,12,18 and both were confirmed in the course of this study. The third member of the human SULT2 family is the novel SULT2A1P pseudogene, located at 19q13.4. This gene contains an in-frame ATG codon at the expected start site in exon 2, but it lacks an identifiable exon 7 sequence (Figure 2b). Exon 6 contains a frame shift and is truncated by a MER85 repetitive sequence element, deleting the 50 terminus of the highly conserved region IV sequence found at this location in other SULT genes. Two copies of exon 4 were found that are 68% identical (76% similar) to each other on the basis of predicted amino-acid sequence. The splice site located at the 50 -end of exon 4a is CG rather than AG, and exon 4b contains a stop codon. Finally, the most recent assembly of this region indicates that exons 2 and 3 are located on the opposite strand and approximately 131 kb away from exons 4b, 4a, 5, and 6. Therefore, this gene was classified as a pseudogene and named 2A1P, since the hypothetical translation of the truncated gene is 56% identical (66% similar) to human SULT2A1. SULT2A1P is about 153 kb long and is located approximately 7.76 Mb telomeric to SULT2B1, which is in turn located about 665 kb telomeric to SULT2A1, making the SULT2 ‘cluster’ by far the longest of the SULT gene clusters. Unlike the other SULT gene clusters, the orientation of the genes in the SULT2 cluster is variable. The SULT2A1 gene and exons 4–6 of SULT2A1P are on the same strand, opposite that of exons 2– 3 of SULT2A1P and SULT2B1. SULT4A1 SULT4A1 is located on chromosome 22q13.31. The gene spans 38 kb, making it one of the longest human SULT genes (Figure 1). Like SULT1B1, the structure of SULT4A1 had not been reported previously, but it was readily annotated during our studies. While SULT4A1 contains seven coding exons and shares some homology with the conserved SULT gene structures, the lengths of several of the exons differ significantly from those found in other SULT genes. 50 - and

The Pharmacogenomics Journal

30 -RACE studies have not been reported for this gene, so the number of noncoding exons is not known for SULT4A1. SULT1D2P and SULT3A1P SULT1D2P and SULT3A1P are located on chromosomes 3 and 14, respectively (Figure 2b). Both are processed pseudogenes and represent the first reported SULT processed pseudogenes in any species. SULT1D2P contains five stop codons and three frame shifts, as well as a 349 bp Alu element inserted approximately 297 bp into the sequence. The predicted encoded amino-acid sequence of SULT1D2P shares 69% identity (75% similarity) with the predicted translation of the human SULT1D1P gene. Human SULT3A1P contains four stop codons and a single frame shift. The predicted translation of the SULT3A1P pseudogene is 51% identical and 63% similar to that of mouse SULT3A1. Expression Profiling of SULT1C3 and SULT6B1 Of the seven novel SULT genes which we identified, only SULT1C3 and SULT6B1 do not contain any obvious inactivating mutations (Figure 2a). Therefore, both computational and experimental approaches were used to perform preliminary expression profiling studies for these two genes. Initially, the predicted peptide sequences encoded by both genes were used as queries in a TBLASTN search of the EST database. No ESTs were identified, indicating that, at present, there is no in silico evidence for the expression of either SULT1C3 or SULT6B1. Therefore, we also searched for evidence of expression by using PCR and cDNAs from 20 different human tissues. No SULT1C3 products were detected, even after performing nested amplifications. Therefore, on the basis of our data, the SULT1C3 gene is more accurately described as a putative active gene. However, SULT6B1 products were detected when testicular cDNA was used as template. Therefore, cDNA from this tissue was used to perform 50 - and 30 -RACE analysis. SULT6B1 RACE Marathon-Readyt human testis cDNA was used to perform RACE for SULT6B1. Analysis of 15 30 -RACE clones indicated that there were three functional polyadenylation signals, resulting in 30 -UTR lengths of 52, 80, and 113 bp. Initial attempts at 50 -RACE failed, so a new set of primers was designed that hybridized farther 30 -downstream from the predicted ATG translation initiation site. Products were generated successfully with the new primers, and 12 50 – RACE products were cloned and sequenced. Four exons were found upstream of the first predicted coding exon (Figure 3), one of which (exon 1B) utilized two different 30 splice sites, creating a ‘short’ and ‘extended’ form of that exon. Three different transcription initiation sites were identified (upstream of exons 1D, 1C, and 1A). Finally, six different splicing patterns were observed, all of which spliced into the same location within the predicted exon 2 sequence (Figure 3). That splice site was located 30 of both the computationally predicted translation initiation site and the first set of 50 RACE primers, indicating that the in silico prediction of exon

Novel human SULT gene identification RR Freimuth et al

59

2 extended too far upstream and that an alternate translation initiation site was used. Examples of each of the six splicing patterns have been submitted to GenBank (accession numbers AY289770–AY289775). SULT6B1 Transcript Characterization To characterize full-length SULT6B1 transcripts, primers located in the 50 - and 30 -UTRs were used to perform PCR with human testicular cDNA as template. Analysis of the products indicated that alternative splicing and exon skipping was common. In all, 11 different transcripts were characterized from 19 clones. ‘Complete’ transcripts consisting of coding exons 2–8 were found downstream of exons 1A (two clones), 1C–1B (two clones), 1D (three Human SULT6B1 5'-RACE Results 1D 83

1C 428

60

1B 4774

97

ATG

1A 1211

63

1335

2

106

Figure 3 SULT6B1 50 -RACE results. The four exons identified by performing 50 -RACE with human testicular cDNA are indicated by white boxes, and the lengths of exons and introns are listed in bp. Six distinct splicing patterns were observed from at least three different sites of transcription initiation (exons 1D, 1C, and 1A). The location of the first in-frame ATG translation initiation codon in exon 2 is also shown.

Table 2

clones), and 1D–1B (one clone). Exon 5 was skipped (1D– 1B–2–3–4–6–7–8) in a single transcript, as was exon 6 (1A–2– 3–4–5–7–8). In addition, two transcripts were isolated that lacked exons 4, 5, and 6. Three clones contained exons 1D–2–3–7–8 and two clones contained exons 1D–1B–2–3–7–8. Finally, an additional exon (termed exon 4b) was identified. Exon 4b was flanked by GT–AG dinucleotides and was 84 bp in length. It maintained the open reading frame, but it also contained a stop codon. Exon 4b was identified in three different splice forms, two of which simultaneously lacked exon 6: 1A–2–3–4–4b–5–6–7–8 (two clones), 1A–2–3–4–4b–5– 7–8 (one clone), and 1C–1B–2–3–4–4b–5–7–8 (one clone). Among all of these transcripts, only those that contained exons 2–8 or exons 2–3–7–8 remained in frame. Even though the transcripts that skip exons 4–6 include the conserved Region I and IV sequences common to SULT enzymes, it seems unlikely that the resulting peptide will have SULT activity. Cloning SULT6B1 from Other Primate Species Each exon of the SULT6B1 gene was amplified from chimpanzee and gorilla genomic DNA using the PCR. The predicted amino-acid translations were compared to the predicted human sequence (Table 2). A total of 13 ORR nucleotide differences were found among the three species. Nine amino acids were altered, while four were unchanged. The predicted sequences of the human, chimpanzee, and gorilla enzymes were 97% identical to each other. The exon sequences from chimpanzee and gorilla have been deposited in GenBank (accession numbers AY289776–AY289789). DISCUSSION Sulfation is an important pathway in the biotransformation of many drugs as well as endogenous compounds such as hormones and neurotransmitters.1,2 Therefore, it is important to explore the possibility of pharmacogenetic variation in this pathway for drug metabolism. That process began

SULT6B1 sequence differences among primates Location in human SULT6B1

Nucleotide A148G G274A A314C A321G G336C C390G T400C G429C GCT(538–540)CCC G609A C636T A651C A697G

Nucleotide sequence Amino acid

LYS50GLU GLU92LYS LYS105THR THR107THR LEU112PHE PHE130LEU PHE134LEU ARG143SER ALA180PRO ALA203ALA HIS212HIS PRO217PRO SER233GLY

Human

Gorilla

Chimpanzee

A G A A G C T G GCT C C A A

A A C G C G T C CCC A C C A

G A A G C C C C GCT A T A G

The locations and sequences of the differences found in the predicted ORF sequences of human, gorilla, and chimpanzee SULT6B1 are listed. Nucleotide and amino-acid locations correspond to the human sequence.

www.nature.com/tpj

Novel human SULT gene identification RR Freimuth et al

60

The Pharmacogenomics Journal

Bubble size is % ID of Hit Exon and Strand

SULT1C1 10 6 2

SULT1C1P

SULT1C3

SULT1C2

-2 -6 -10 Position in Contig (NT)

Bubble size is Length of Hit Exon and Strand

nearly two decades ago when biochemical genetic studies of SULT1A1 and SULT1A3 demonstrated that levels of activity for both of these SULT isoforms in the human platelet had high heritability.31–33 More recently, a series of gene resequencing and functional genomic studies have resulted in the identification of functionally significant genetic polymorphisms for SULT1A1,34 SULT1A3,35 SULT1E1,36 and SULT2A1.37 One of the SULT1A1 polymorphisms has already been reported to be of pharmacogenetic significance for variation in the survival of breast cancer patients treated with tamoxifen.38 Therefore, it is important for pharmacogenetics that all human genes encoding cytosolic SULTs be identified and characterized. The present studies represent the first comprehensive database mining project conducted to identify and annotate SULT genes in humans. We identified seven novel SULT genes in addition to the 10 that were previously reported to be expressed in humans. We also found evidence for the expression of SULT6B1, one of the two novel genes that appeared most likely to be expressed in human tissues. The strategy used to search the human genome database involved two different but complementary approaches: one that used explicit SULT sequences as queries, and the other that utilized HMM profiles. The first approach began by using the TBLASTN program to perform an initial screen of the genome. Genomic contigs containing hits of interest were then scanned using two additional programs that employ different alignment algorithms. The TFastA program was chosen for speed and, in some cases, for its increased sensitivity over TBLASTN. The FrameSearch program was used for its ability to perform a Smith and Waterman local alignment using a protein sequence as a query and all possible codons in a nucleotide database. To aid in the annotation of the large number of hits generated by use of those two programs, a graphical display was created that conveyed on a single graph multiple levels of information with regard to each hit. As shown in Figure 4, both programs identified most of the exons in all SULT genes. However, many hits were identified by only one of the two programs. For example, TFastA found most of the exons in the SULT1C1P gene, but FrameSearch found none. Therefore, our decision to use multiple alignment programs based on different algorithms and then combine the datasets prior to analysis greatly aided in the identification and annotation of novel SULT genes. The second approach that was used to search the human genome used HMM profiles rather than explicit amino-acid sequences. HMMs contain position-specific scoring that contains information about the degree of conservation of each residue, making HMM searches more sensitive than pair-wise alignment methods like those used in the initial approach. A HMM profile was created using the sequences of the 11 human SULT enzymes known to be expressed (SULT2B1 produces two isoforms), and that profile was used with the HMMER program to search a six-fold translation of the human genome database. Genomic contigs containing hits of interest were identified and subsequently searched using the GeneWise program. GeneWise also uses HMM profiles, but it is able to search a nucleotide database

10 6 2 -2 -6 -10 Position in Contig (NT)

Figure 4 Graphical display of sequence alignment results. A novel graphical display was developed to facilitate the annotation process. The figure illustrates the SULT1C cluster as an example of this display, with each hit obtained by the FrameSearch (shaded points) or TFastA (open points) programs plotted. Each hit was graphed according to its position within the genomic contig (x-axis) and the exon that was used as a query sequence to generate the hit (y-axis). Points above the x-axis are hits on the forward strand, while points below the x-axis represent hits on the reverse strand. The relative size of each data point indicates either the percent identity of the hit to the query exon sequence (top graph) or the length of the overall alignment (bottom graph). On these plots, hits corresponding to SULT genes form clear diagonals (circled in green) that are readily distinguishable from the nonspecific hits that are scattered throughout the plot. One exception is the SULT1C1P gene, which does not form a diagonal. That is a result of rearrangement of the exons in SULT1C1P (see Figure 2).

directly. In addition, GeneWise can compensate for gaps in the alignment due to either sequencing error or introns. Those searches resulted in the identification of several novel SULT genes, including SULT6B1, the first human SULT gene in family 6 and only the second member of that family to be identified in any species. A series of alternatively spliced SULT6B1 transcripts was identified in the testis, confirming the expression of this gene and forming the basis for future functional studies. In addition to SULT6B1, a previous report39 suggested that a third member of the SULT1C family existed in humans on the basis of partial exon sequence of a putative human homolog of rat SULT1C1, although the full sequence and structure of that gene remained unknown. The present studies have successfully identified and annotated the structure of that gene, SULT1C3. While SULT1C3 has the potential to be active on the basis of nucleotide sequence, no experimental evidence was found which supported the transcription of that gene in the tissues in the cDNA panel or in the MTE array. However, it is possible that SULT1C3 is expressed at a different time during development or under different physiological conditions than when the tissue samples for the cDNAs panel and the MTE array were prepared. Therefore, the question of whether this gene is expressed in humans remains open until SULT1C3 transcripts are detected. After we had identified and studied these two genes, both of them were deposited in GenBank as a result of patent filing WO 0164904 for SULT6B1 and patent filing WO 9172977 for SULT1C3.

Novel human SULT gene identification RR Freimuth et al

61

While experimental data supported the computationally predicted gene structure for SULT6B1, 50 -RACE revealed that exon 2 was approximately half as long as originally predicted. This observation emphasizes the fact that even manually curated gene annotations require experimental confirmation and validation. 50 -RACE experiments identified four noncoding exons upstream of SULT6B1 exon 2, three of which served as sites of transcription initiation (Figure 3). We also obtained evidence for the existence of several alternatively spliced mRNAs, but further studies will be required to determine the function—if any—of the

encoded protein. Nonetheless, several observations can be made with regard to the predicted SULT6B1 protein sequence. First, the initial in-frame ATG in exon 2 matches the Kozak consensus sequence for translation initiation,27 but that ATG would produce a protein that is approximately 20 amino acids shorter at the N-terminus than other human SULTs. Second, the predicted amino-acid sequence of SULT6B1 contains two gaps relative to other human SULTs, one located approximately 20 amino acids downstream of Region I and the other about five residues upstream of Region IV (Figure 5). The first gap is aligned with gaps

1A1 1B1 1C1 1C2 1C3a 1C3d 1E1 2A1 2B1a 4A1 6B1

~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~MALT MALHEMEDFT MAKIEKNAPT MAKIEKNAPT ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~MASPP ~~MAESEAET ~~~~~~~~~~

MELIQDTSRP MLSPKDILRK SDLGKQI.K. FDGTKRL.S. MEKKPELFN. MEKKPELFN. MNSELDYYEK ~~~~~~MSDD PFHSQKLPGE PSTPGEFESK ~~~~~~~~~~

PLEYVKG.VP DLKLVHG.YP .LKEVEGTLL .VNYVKG.IL .IMEVDG.VP .IMEVDG.VP .FEEVHG.IL FLWFEGIAFP YFRYKGVPFP YFEFHGVRLP ~~~~~~~~~~

LIKYFAEALG MTCAFASNWE QPA.TVDNWS QPTDTCDIWD TLILSKEWWE TLILSKEWWE MYKDFVKYWD TMGFRSETLR VGLYSLESIS P..FCRGKME ~~MCTSETFQ

PLQS.FQARP KIEQ.FHSRP QIQS.FEAKP KIWN.FQAKP KVCN.FQAKP KVCN.FQAKP NVEA.FQARP KVRDEFVIRD LAENTQDVRD EIAN.FPVRP AL.DTFEARH

DDLLISTYPK DDIVIATYPK DDLLICTYPK DDLLISTYPK DDLILATYPK DDLILATYPK DDLVIATYPK EDVIILTYPK DDIFIITYPK SDVWIVTYPK DDIVLASYPK

SGTTWVSQIL SGTTWVSEII AGTTWIQEIV AGTTWTQEIV SGTTWMHEIL SGTTWMHEIL SGTTWVSEIV SGTNWLAEIL SGTTWMIEII SGTSLLQEVV CGSNWILHIV

1A1 1B1 1C1 1C2 1C3a 1C3d 1E1 2A1 2B1a 4A1 6B1

DMIYQGGDLE DMILNDGDIE DMIEQNGDVE ELIQNEGDVE DMILNDGDVE DMILNDGDVE YMIYKEGDVE CLMHSKGDAK CLILKEGDPS YLVSQGADPD SELIYAVSKK

KCHRAPIFMR KCKRGFITEK KCQRAIIQHR KSKRAPTHQR KCKRAQTLDR KCKRAQTLDR KCKEDVIFNR WIQSVPIWER WIRSVPIWER EIGLMNIDEQ KYK....YPE

VPFLEFKAPG VPMLEMTLPG HPFIEWARP. FPFLEMKIP. HAFLELKFPH HAFLELKFPH IPFLECRKEN SPWVESEI.. APWCETIV.. LPVLEYPQP. FPVLEC....

I.PSGMETLK LRTSGIEQLE PQPSGVEKAK SLGSGLEQAH KEKPDLEFVL KEKPDLEFVL LM.NGVKQLD ....GYTALS ....GAFSLP ....GLDIIK GDSEKYQRMK

DTPAPRLLKT KNPSPRIVKT AMPSPRILKT AMPSPRILKT EMSSPQLIKT EMSSPQLIKT EMNSPRIVKT ETESPRLFSS DQYSPRLMSS ELTSPRLIKS GFPSPRILAT

HLPLALLPQT HLPTDLLPKS HLSTQLLPPS HLPFHLLPPS HLPSHLIPPS HLPSHLIPPS HLPPELLPAS HLPIQLFPKS HLPIQIFTKA HLPYRFLPSD HLHYDKLPGS

LLDQKVKVVY FWENNCKMIY FWENNCKFLY LLEKNCKIIY IWKENCKIVY IWKENCKIVY FWEKDCKIIY FFSSKAKVIY FFSSKAKVIY LHNGDSKVIY IFENKAKILV

1A1 1B1 1C1 1C2 1C3a 1C3d 1E1 2A1 2B1a 4A1 6B1

VARNAKDVAV LARNAKDVSV VARNAKDCMV VARNPKDNMV VARNPKDCLV VARNPKDCLV LCRNAKDVAV LMRNPRDVLV MGRNPRDVVV MARNPKDLVV IFRNPKDTAV

SYYHFYHMAK SYYHFDLMNN SYYHFQRMNH SYYHFQRMNK SYYHFHRMAS SYYHFHRMAS SFYYFFLMVA SGYFFWKNMK SLYHYSKIAG SYYQFHRSLR SFLHFHNDVP

VHPEPGTWDS LQPFPGTWEE MLPDPGTWEE ALPAPGTWEE FMPDPQNLEE FMPDPQNLEE GHPNPGSFPE FIKKPKSWEE QLKDPGTPDQ TMSYRGTFQE DIPSYGSWDE

FLEKFMVGEV YLEKFLTGKV YFETFINGKV YFETFLAGKV FYEKFMSGKV FYEKFMSGKV FVEKFMQGQV YFEWFCQGTV FLRDFLKGEV FCRRFMNDKL FFRQFMKGQV

SYGSWYQHVQ AYGSWFTHVK VWGSWFDHVK CWGSWHEHVK VGGSWFDHVK VGGSWFDHVK PYGSWYKHVK LYGSWFDHIH QFGSWFDHIK GYGSWFEHVQ SWGRYFDFAI

EWWELSRTHP NWWKKKEEHP GWWEMKDRHQ GWWEAKDKHR GWWAAKDMHR GWWAAKDMHR SWWEKGKSPR GWMPMREEKN GWLRMKGKDN EFWEHRMDSN NWNKHLDGDN

VLYLFYEDMK ILFLYYEDMK ILFLFYEDIK ILYLFYEDMK ILYLFYEDIK ILYLFYEDIK VLFLFYEDLK FLLLSYEELK FLFITYEELQ VLFLKYEDMH VKFILYEDLK

1A1 1B1 1C1 1C2 1C3a 1C3d 1E1 2A1 2B1a 4A1 6B1

ENPKREIQKI ENPKEEIKKI RDPKHEIRKV KNPKHEIQKL KNPKHEIHKV KDPKREIEKI EDIRKEVIKL QDTGRTIEKI QDLQGSVERI RDLVTMVEQL ENLAAGIKQI

LEFVGRSLPE IRFLEKNLND MQFMGKKVDE AEFIGKKLDD LEFLEKTWSG LKFLEKDISE IHFLERKPSE CQFLGKTLEP CGFLGRPLGK ARFLGVSCDK AEFLGFFLTG

ETVDFMVQHT EILDRIIHHT TVLDKIVQET KVLDKIVHYT DVINKIVHHT EILNKIIYHT ELVDRIIHHT EELNLILKNS EALGSVVAHS AQLEALTEHC EQIQTISVQS

SFKEMKKNPM SFEVMKDNPL SFEKMKENPM SFDVMKQNPM SFDVMKDNPM SFDVMKQNPM SFQEMKNNPS SFQSMKENKM TFSAMKANTM H...QLVDQC TFQAMRAKSQ

TNYTTVPQEF VNYTHLPTTV TNRSTVSKSI ANYSSIPAEI ANHTAVPAHI TNYTTLPTSI TNYTTLPDEI SNYSLLSVDY SNYTLLPPSL CNAEALP... DTHGAV....

MDHSISPFM. MDHSKSPFM. LDQSISSFM. MDHSISPFM. FNHSISKFM. MDHSISPFM. MNQKLSPFM. VVD.KAQLL. LDHRRGAFL. ........V. .....GPFLF

RKGMAGDWKT RKGTAGDWKN RKGTVGDWKN RKGAVGDWKK RKGMPGDWKN RKGMPGDWKN RKGITGDWKN RKGVSGDWKN RKGVCGDWKN GRGRVGLWKD RKGKVALWKN

1A1 1B1 1C1 1C2 1C3a 1C3d 1E1 2A1 2B1a 4A1 6B1

TFTVAQNERF YFTVAQNEKF HFTVAQNERF HFTVAQNERF HFTVALNENF YFTVAQNEEF HFTVALNEKF HFTVAQAEDF HFTVAQSEAF IFTVSMNEKF LFSEIQNQEM

DADYAKKMAG DAIYETEMSK DEIYRRKMEG DEDYKKKMTD DKHYEKKMAG DKDYQKKMAG DKHYEQQMKE DKLFQEKMAD DRAYRKQMRG DLVYKQKM.G DEKFKECLAG

C..SLTFRSE T..ALQFRTE T..SINFCME T..RLTFHFQ S..TLNFCLE S..TLTFRTE S..TLKFRTE LP.RELFPWE MP...TFPWD KC.DLTFDFY TSLGAKLKYE

L~~~~ I~~~~ L~~~~ F~~~~ I~~~~ I~~~~ I~~~~ ~~~~~ E~~~~ L~~~~ SYCQG

Figure 5 Human SULT sequences. An amino-acid sequence alignment of all SULT enzymes known to be expressed in humans. The highly conserved Regions I and IV are highlighted in blue and red, respectively. The dimerization interface region overlaps Region IV, and is underlined. The putative SULT1C3 gene might produce four separate isoforms as a result of alternative splicing. The sequences for two of those isoforms (1C3a and 1C3d) are shown, and those sequences can be used to assemble the sequences for 1C3b and 1C3c. The SULT2B1 gene produces two isoforms (2B1a and 2B1b) that differ only at the N terminus, so only SULT2B1a is shown. The SULT6B1 amino-acid sequence is based on the predicted translation of cDNAs.

www.nature.com/tpj

Novel human SULT gene identification RR Freimuth et al

62

present in members of families 2 and 4, while the latter aligns with the gap present in human SULT4A1, which is predicted to enlarge the opening of the active site. Third, the dimerization motif present in other SULT sequences7 is altered in two places in human SULT6B1. Specifically, the Glu residue that forms an ion pair in other SULTs is a Gln residue in SULT6B1 (Figure 5). In addition, the conserved Val residue, which extends into the hydrophobic pocket in the dimer interface region, is a Glu residue in SULT6B1. This residue is also Glu in mouse SULT1E1, an isoform known to exist as a monomer,7 and studies have shown that this change prevents dimerization.7 Therefore, SULT6B1 may exist as a monomer. Among the seven novel genes identified as a result of these studies, two, SULT1D2P and SULT3A1P, were processed pseudogenes, the first SULT processed pseudogenes described in any species. Furthermore, the SULT1C1P and SULT2A1P genes are the first unambiguous introncontaining SULT pseudogenes to be described in any species. While our expression profiling and RACE experiments for SULT1C3 and SULT6B1 were being performed, the structure of the SULT1D1P gene was reported following an independent database search.22 Our annotation confirms and refines the gene structures, map locations, and orientations described in that report. The SULT1D1P gene contains two stop codons in the putative exon 2 sequence. However, as a result of sequence divergence at the N-terminus of SULT enzymes, the initial coding exons of SULT genes are very difficult to predict accurately, so it is possible that an additional, functional exon 2 sequence exists farther upstream. Nonetheless, we concluded that SULT1D1P is a pseudogene. However, it must have been active in germline tissue at one time, since it has a corresponding processed pseudogene on chromosome 3. Unlike SULT1D1P, the ‘active’ SULT3A1 gene corresponding to the SULT3A1P processed pseudogene was not found, despite the fact that we searched with both the mouse SULT3A1 and predicted human SULT3A1 sequences. It is likely that the gene that gave rise to SULT3A1P has been deleted in the course of evolution. Most human SULT genes are found in clusters, presumably formed through gene duplication events, although several solitary genes were also found. The orientation of all the genes in those clusters is head-to-tail, with the exception of the SULT2 cluster, which appears to have been subjected to at least one inversion. The annotations described in this study will provide additional insight into the evolution of the SULT gene family in humans and may aid in future studies of the regulation of these genes. In addition, these results will serve as a starting point for future comparative genomic studies of SULT genes in other species.

MATERIALS AND METHODS Query Sequences The protein query sequences used during the initial database searches were translated from the published cDNA sequences for human SULTs 1A1 (GenBank accession number

The Pharmacogenomics Journal

L19999),40 1A2 (U28169),41 1A3 (L19956),42 1B1 (U95726),43 1C1 (U66036),11 1C2 (AF055584),44 1E1 (U08098),45 2A1 (U08024),46 2B1a (U92314),12 2B1b (U92315),12 and 4A1 (AF188698).47 To increase the specificity of the searches, the terminal 54 amino acids of SULT2B1, which compose the proline-rich tail unique to that isoform, were removed. Additional query sequences were mouse 1D1 (U32371),48 mouse 3A1 (AF026075), mouse 5A1 (AF026074), chicken 6A1 (AF033189),49 and the PFAM database50,51 consensus sequence for cytosolic sulfotransferases (pfam00685). A second consensus sequence was constructed from the 11 human SULT sequences using the GCG program Profilemake. BLAST and Sequence Alignment Searches To perform the initial screen of the human genome, 17 SULT sequences were selected as query sequences in a TBLASTN23 search using a local copy of the RefSeq database (October 2001). The query sequences included the 11 SULT isoforms known to be expressed in humans, four non-human SULT sequences (mouse 1D1, mouse 3A1, mouse 5A1, and chicken 6A1), and the two different consensus sequences. For those searches, the BLOSUM62 scoring matrix was used and the expect parameter was set to 50. The results were annotated manually. Hits that did not correspond to known or novel SULT genes, or which were not obviously nonspecific (alignments that were very short or shared very low similarity, or those that should have encompassed portions of the highly conserved Region I and IV sequences but did not display any homology over those regions), were retained for further analysis. Redundancy was eliminated from the set of retained hits by preserving only a single ‘marker’ hit for each SULT gene and, since a 500 kb window of sequence was searched in the subsequent step, only the best hit in any given 100 kb window of genomic sequence was retained when a single query produced multiple hits within that window. A 500 kb window of sequence was identified for each of the nonredundant hits. To increase the specificity of the searches, the query sequence that produced each of those hits was then divided into exon fragments (for the human SULTs only). Each window was then searched with either each of the exon sequences (human query sequences) or the intact peptide sequence (non-human or consensus sequences) as the query sequence, using two programs— TFastA and FrameSearch. Both programs are from the GCG sequence analysis package,52 are more sensitive sequence alignment programs than TBLASTN, and are designed to search a nucleotide database for similarity to a protein query sequence. The TFastA program employs the Pearson and Lipman k-tuple algorithm, and searches each reading frame independently. FrameSearch utilizes the Smith and Waterman algorithm for local alignments, and compares the query sequence to all possible codons on both strands of the nucleotide sequence, enabling it to identify significant sequence similarity even when frame shifts are present. The number of hits produced by the two alignment programs was too large to annotate manually. Therefore, a

Novel human SULT gene identification RR Freimuth et al

63

filter was constructed using empirical data to reduce the number of hits that required annotation. The filter was designed after searching genomic contig sequences containing known SULT genes (positive controls) or random genomic sequence (negative controls), and analyzing the hits produced. Specifically, four SULT sequences (SULT1A1, SULT2A1, SULT4A1, and pfam00685) were each used to search three genomic contigs (NT_004610, NT_007874, and NT_008101) chosen at random, each approximately 500 kb in length. A total of 3080 hits were produced from those searches, and different methods of filtering the data were tested. The final filter criteria were found to be capable of eliminating most of the nonspecific hits, while retaining an adequate number of intermediate hits to allow the identification of novel SULT genes. In order for a hit to be retained for further analysis, it had to satisfy at least one of the three filter criteria. First, all hits at least 100 bp in length were retained so that long sequences with low percent identity were preserved. Second, all hits that shared at least 50% sequence identity to the query sequence and were at least 39 bp long were retained. That dual requirement eliminated the very short (but high percent identity) nonspecific hits, while still keeping all hits containing at least seven out of 13 matching amino acids. The third criterion, the product of the length and percent identity (LID), was designed to retain hits of both intermediate length and identity. Any hit with a LID score of at least 40 was retained. Of the 3080 hits from the random genomic contigs, 429 (14%) satisfied at least one of the three filter criteria. Therefore, approximately 12 hits are expected in every 500 kb of genomic sequence. This filter successfully reduced the number of hits that required manual annotation to a reasonable number, while still retaining a sufficient number of hits to allow the identification of novel SULT genes. A visual display was developed to improve the efficiency of the manual annotation step. For each 500 kb window of sequence, all hits from the two searches that met or exceeded the filter criteria were plotted to illustrate graphically the quality and context of those hits (Figure 4). Four types of information were conveyed on each graph. The location of each hit within the sequence window was indicated on the x-axis. The exon used as a query sequence and whether the hit was on the forward or reverse strand were indicated on the y-axis. The program that produced the hit (color of data point) and either the length or percent identity of the hit (size of the data point) were also indicated. On these plots, hits corresponding to a single gene tend to cluster together and form a diagonal, whereas nonspecific hits are readily apparent due to their isolation and random scatter (Figure 4). Microsoft Excel was used to create the plots, and a Visual Basic program was used to automate the process for all searches. Following analysis of the first round of hits, seven novel SULT genes were identified. Since these sequences were not included as queries, the predicted amino-acid translations of four of them (SULTs 1C3, 1D1P, 6B1, and 3A1P) were used as query sequences for a second round of searches of the genome sequence database, using the same method

described above. Those four sequences were chosen because they either represented a member of a new subfamily in humans or because they appeared to be functional genes. HMMER and GeneWise Searches A complementary approach to the BLAST-based method described above was also used. That strategy relied on HMMs rather than individual and explicit query sequences. The HMMER (version 2.2 g)24 and GeneWise (version 2.2.0)25 programs were chosen to perform those searches. A local HMM profile26 was constructed from the sequences of the 11 human SULTs which are known to be expressed. Since HMMER requires a protein database, a six-fold translation of the human genome database was performed. The E and domE parameters were set to 0.01. After manual annotation of the HMMER hits, genomic contigs were identified that contained either a SULT gene or a hit that could not be unambiguously classified as nonspecific. GeneWise was then used to search those contigs in an attempt to refine the HMMER sequence alignments. The hits from the GeneWise searches were also annotated manually. Annotation of SULT Genes To annotate novel SULT genes, we took advantage of the high degree of conservation among SULT gene structures.3 Specifically, the expected lengths of exons, the locations of splice junctions within the predicted cDNA, and the presence of GT/AG dinucleotides flanking exons were considered when annotating exons. Each hit corresponding to a SULT exon was refined using the GCG local sequence alignment program Gap. This process was used to annotate and assemble each of the gene structures manually. The map location and structure of each gene was verified using the most recent assembly of the NCBI human genome database (build 32). The structures of previously published genes were confirmed in a similar fashion. For those genes, the BLAST and Gap programs were used to compare the genomic sequence for each gene to reference mRNA sequences. Expression Profiling of SULT1C3 and SULT6B1 Since there were no obvious inactivating mutations in the ORF or splice junctions of either SULT1C3 or SULT6B1, they appeared to be potentially functional genes. In an attempt to find evidence for their expression in humans, the predicted amino-acid sequences of SULT1C3 (isoforms SULT1C3a and SULT1C3d) and SULT6B1 were used as queries for TBLASTN searches of the human EST database.53 The results of the searches were filtered such that only those hits that were at least 75% identical or that were more than 30 bp long with at least 90% similarity were retained. All alignments that met those criteria were examined manually. To test experimentally for evidence of expression, 20 cDNAs (fetal lung, fetal brain, mammary gland, adrenal gland, and Multiple Tissue cDNA panels I and II) (Clontech, Palo Alto, CA) were used as template for the PCR with primers designed to hybridize with the predicted ORF sequences of SULT1C3 and SULT6B1 (primer sequences are available upon request). For SULT6B1, forward primers

www.nature.com/tpj

Novel human SULT gene identification RR Freimuth et al

64

located in exons 2 and 3 were paired with reverse primers in exons 8 and 7, respectively, to perform nested PCR. Products were detected with testicular cDNA as template. Therefore, Marathon-Readyt human testis cDNA (Clontech) was used as template to characterize the transcripts further. For SULT1C3, the primers were located in exons 7 and 8, but no products were observed following PCR amplification using cDNA from 20 human tissues. Therefore, overlap extension54 was used to construct a probe consisting of exons 5, 6, 7b, and 8b. The probe was radioactively labeled with 32P-dCTP by random priming (Oligolabeling Kit, Pharmacia, Piscataway, NJ, USA) and used to probe a Human Multiple Tissue Expression (MTEt) Array (Clontech) containing RNA from 76 tissues. No signal was detected. SULT6B1 RACE and Transcript Characterization To identify the site(s) of transcription initiation for SULT6B1, Marathon-Readyt human testis cDNA was used as template for 50 - and 30 -RACE. For 50 -RACE, reverse primers were designed near the 30 -end of the predicted exon 2 sequence and within exon 3. 30 -RACE was performed using a series of nested primers located in the predicted sequences of exons 7 and 8. The RACE products were cloned into pCR2.1 (Invitrogen, Carlsbad, CA, USA). To characterize the splicing patterns in full-length SULT6B1 transcripts, forward primers designed in exons 1D, 1C, and 1A were paired with reverse primers located in the 30 -UTR. The PCR products were cloned into pCR 3.1 Uni (Invitrogen). The Wizard Plus Miniprep DNA Purification System (Promega, Madison, WI, USA) was used to isolate the DNA prior to sequencing. Primer sequences are available upon request. Cloning SULT6B1 From Other Primate Species To determine if SULT6B1 was present in other primate genomes, intron-based primers were designed that flanked each exon of the human gene sequence (sequences available upon request). Those primers were used to PCR amplify each exon from gorilla and chimpanzee genomic DNA (Coriell Cell Repositories Primate Panel, repository numbers NG05251 and NG06939, respectively). The products were sequenced and alignments with human sequences were constructed with the GCG program Pileup. DNA Sequencing and Analysis DNA sequencing was performed in the Mayo Clinic Molecular Biology Core Facility with Applied Biosystems Model 377 DNA sequencers and BigDyet (PerkinElmer) terminator sequencing chemistry. The University of Wisconsin GCG software package,52 Version 10.2, was used to analyze the nucleotide and protein sequences. ACKNOWLEDGEMENTS We thank Luanne Wussow for her assistance with the preparation of this manuscript, Michael Heldebrant and Harold Solbrig for their assistance with the GeneWise program, and Dr Rebecca Raftogianis Blanchard, Emily Aaronson, Richard Hwang, and Ray Mak for their contributions to the preliminary experimental characterization of novel genes. This work was supported in part by NIH grants RO1 GM35720 (RMW), UO1 GM61388 (RMW, CGC, EDW), and the

The Pharmacogenomics Journal

Washington University Genome Analysis Training Program (NIH T32-HG00045) (RRF).

DUALITY OF INTEREST Dr.Weinshilboum has either provided consulting services or presented seminars at Abbott Laboratories, Bristol-Myers Squibb, Eli Lilly, Johnson and Johnson, Roche and Merck, Inc. All fees and honoraria for these services and/or seminars were paid to Mayo Foundation. In addition, Drs. Wieben and Weinshilboum currently hold a peer-reviewed grant from Eli Lilly.

ABBREVIATIONS HMM MTC ORF PAPS RACE SULT UTR

hidden Markov model multiple tissue cDNA open reading frame 30 -phosphoadenosine-50 -phosphosulfate rapid amplification of cDNA ends sulfotransferase untranslated region

REFERENCES 1 Weinshilboum R, Otterness D. Conjugation–deconjugation reactions in drug metabolism and toxicity. Chapter 2, Sulfotransferase enzymes. In: Kauffman FC (ed) Handbook of Experimental Pharmacology Series, Vol 112 Springer-Verlag: Berlin, Heidelberg 1994; pp 45–78. 2 Falany CN. Enzymology of human cytosolic sulfotransferases. FASEB J 1997; 11: 206–216. 3 Weinshilboum RM, Otterness DM, Aksoy IA, Wood TC, Her C, Raftogianis RB. Sulfotransferase molecular biology: cDNAs and genes. FASEB J 1997; 11: 3–14. 4 Komatsu K, Driscoll WJ, Koh Y, Strott CA. A P-loop related motif (GxxGxxK) highly conserved in sulfotransferases is required for binding the activated sulfate donor. Biochem Biophys Res Commun 1994; 204: 1178–1185. 5 Driscoll WJ, Komatsu K, Strott CA. Proposed active site domain in estrogen sulfotransferase as determined by mutational analysis. Proc Natl Acad Sci USA 1995; 92: 12328–12332. 6 Marsolais F, Varin L. Identification of amino acid residues critical for catalysis and cosubstrate binding in the flavonol 3-sulfotransferase. J Biol Chem 1995; 270: 30458–30463. 7 Petrotchenko EV, Pedersen LC, Borchers CH, Tomer KB, Negishi M. The dimerization motif of cytosolic sulfotransferases. FEBS Let 2001; 490: 39–43. 8 Raftogianis RB, Freimuth RR, Buck J, Weinshilboum RM, Coughtrie MWH. A proposed nomenclature system for the cytosolic sulfotransferase (SULT) superfamily. 2004; Pharmacogenetics in press. 9 Negishi M, Pedersen LG, Petrotchenko E, Shevtsov S, Gorokhov A, Kakuta Y et al. Structure and function of sulfotransferases. Arch Biochem Biophys 2001; 390: 149–157. 10 Fujita K, Nagata K, Ozawa S, Sasano H, Yamazoe Y. Molecular cloning and characterization of rat ST1B1 and human ST1B2 cDNAs, encoding thyroid hormone sulfotransferases. J Biochem 1997; 122: 1052–1061. 11 Her C, Kaur GP, Athwal RS, Weinshilboum RM. Human sulfotransferase SULT1C1: cDNA cloning, tissue-specific expression, and chromosomal localization. Genomics 1997; 41: 467–470. 12 Her C, Wood TC, Eichler EE, Mohrenweiser HW, Ramagli LS, Siciliano MJ et al. Human hydroxysteroid sulfotransferase SULT2B1: two enzymes encoded by a single chromosome 19 gene. Genomics 1998; 53: 284–295. 13 Sakakibara Y, Yanagisawa K, Katafuchi J, Ringer DP, Takami Y, Nakayama T et al. Molecular cloning, expression, and characterization of novel human SULT1C sulfotransferases that catalyze the sulfonation of N-hydroxy-2-acetylaminofluorene. J Biol Chem 1998; 273: 33929–33935.

Novel human SULT gene identification RR Freimuth et al

65

14

15

16

17

18

19

20

21

22

23 24 25 26 27 28

29

30

31

32

33

34

Wang J, Falany JL, Falany CN. Expression and characterization of a novel thyroid hormone-sulfating form of cytosolic sulfotransferase from human liver. Mol Pharmacol 1998; 53: 274–282. Falany CN, Xie X, Wang J, Ferrer J, Falany JL. Molecular cloning and expression of novel sulphotransferase-like cDNAs from human and rat brain. Biochem J 2000; 346: 857–864. Aksoy IA, Weinshilboum RM. Human thermolabile phenol sulfotransferase gene (STM): molecular cloning and structural characterization. Biochem Biophys Res Commun 1995; 208: 786–795. Her C, Aksoy IA, Kimura S, Brandriff BF, Wasmuth JJ, Weinshilboum RM. Human estrogen sulfotransferase gene (STE): cloning, structure, and chromosomal localization. Genomics 1995; 29: 16–23. Otterness DM, Her C, Aksoy S, Kimura S, Wieben ED, Weinshilboum RM. Human dehydroepiandrosterone sulfotransferase gene: molecular cloning and structural characterization. DNA Cell Biol 1995; 14: 331–341. Her C, Raftogianis R, Weinshilboum RM. Human phenol sulfotransferase STP2 gene: molecular cloning, structural characterization, and chromosomal localization. Genomics 1996; 33: 409–420. Raftogianis RB, Her C, Weinshilboum RM. Human phenol sulfotransferase pharmacogenetics: STP1gene cloning and structural characterization. Pharmacogenetics 1996; 6: 473–487. Freimuth RR, Raftogianis RB, Wood TC, Moon E, Kim U-J, Xu J et al. Human sulfotransferases SULT1C1 and SULT1C2: cDNA characterization, gene cloning, and chromosomal localization. Genomics 2000; 65: 157–165. Meinl W, Glatt H. Structure and localization of the human SULT1B1 gene: neighborhood to SULT1E1 and a SULT1D pseudogene. Biochem Biophys Res Commun 2001; 288: 855–862. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol 1990; 215: 403–410. Eddy SR HMMER: Profile Hidden Markov Models for Biological Sequence Analysis 2001. http://hmmer.wustl.edu. Birney E, Copley R Wise2 Software Package 2001. http://www.ebi.ac.uk/ Wise2/. Eddy SR. Profile hidden Markov models. Bioinformatics 1998; 14: 755–763. Kozak M. An analysis of 50 -noncoding sequences from 699 vertebrate messenger RNAs. Nucleic Acids Res 1987; 15: 8125–8148. Twigg SRF, Burns HD, Oldridge M, Heath JK, Wilkie AOM. Conserved use of a non-canonical 50 splice site (/GA) in alternative splicing by fibroblast growth factor receptors 1, 2 and 3. Hum Molec Genet 1998; 7: 685–691. Burset M, Seledtsov IA, Solovyev VV. Analysis of canonical and noncanonical splice sites in mammalian genomes. Nucleic Acids Res 2000; 28: 4364–4375. Tarn W-Y, Steitz JA. A novel spliceosome containing U11, U12, and U5 snRNPs excises a minor class (AT–AC) intron in vitro. Cell 1996; 84: 801–811. Reveley AM, Carter SMB, Reveley MA, Sandler M. A genetic study of platelet phenolsulfotransferase activity in normal and schizophrenic twins. J Psychiatr Res 1982/1983; 17: 303–307. Price RA, Cox NJ, Spielman RS, Van Loon J, Maidak BL, Weinshilboum RM. Inheritance of human platelet thermolabile phenol sulfotransferase (TL PST) activity. Genet Epidemiol 1988; 5: 1–15. Price RA, Spielman RS, Lucena AL, Van Loon JA, Maidak BL, Weinshilboum RM. Genetic polymorphism for human platelet thermostable phenol sulfotransferase (TS PST) activity. Genetics 1989; 122: 905–914. Raftogianis RB, Wood TC, Otterness DM, Van Loon JA, Weinshilboum RM. Phenol sulfotransferase pharmacogenetics in humans: association of common SULT1A1 alleles with TS PST phenotype. Biochem Biophys Res Commun 1997; 239: 298–304.

35

36

37

38

39 40

41

42

43

44

45

46

47

48

49

50

51 52 53 54

Thomae BA, Rifki OF, Theobald MA, Eckloff BW, Wieben ED, Weinshilboum RM. Human catecholamine sulfotransferase (SULT1A3) pharmacogenetics: common functional genetic polymorphism. J Neurochem 2003; 87: 809–819. Adjei AA, Thomae BA, Prondzinski JL, Eckloff BW, Wieben ED, Weinshilboum RM. Human estrogen sulfotransferase (SULT1E1) pharmacogenomics: gene resequencing and functional genomics. Br J Pharmacol 2003; 139: 1373–1382. Thomae BA, Eckloff BW, Freimuth RR, Wieben ED, Weinshilboum RM. Human sulfotransferase SULT2A1 pharmacogenetics: genotype-tophenotype studies. Pharmacogenomics J 2002; 2: 48–56. Nowell S, Sweeney C, Winters M, Stone A, Lang NP, Hutchins LF et al. Association between sulfotransferase 1A1 genotype and survival of breast cancer patients receiving tamoxifen therapy. J Natl Cancer Inst 2002; 94: 1635–1640. Nagata K, Yoshinari K, Ozawa S, Yamazoe Y. Arylamine activating sulfotransferase in liver. Mutat Res 1997; 376: 267–272. Wilborn TW, Comer KA, Dooley TP, Reardon IM, Heinrikson RL, Falany CN. Sequence analysis and expression of the cDNA for the phenolsulfating form of human liver phenol sulfotransferase. Mol Pharmacol 1993; 43: 70–77. Zhu X, Veronese ME, Iocco P, McManus ME. cDNA cloning and expression of a new form of human aryl sulfotransferase. Int J Biochem Cell Biol 1996; 28: 565–571. Zhu X, Veronese ME, Bernard CC, Sansom LN, McManus ME. Identification of two human brain aryl sulfotransferase cDNAs. Biochem Biophys Res Commun 1993; 195: 120–127. Wang J, Falany JL, Falany CN. Expression and characterization of a novel thyroid hormone-sulfating form of cytosolic sulfotransferase from human liver. Mol Pharmacol 1998; 53: 274–282. Sakakibara Y, Yanagisawa K, Katafuchi J, Ringer DP, Takami Y, Nakayama T et al. Molecular cloning, expression, and characterization of novel human SULT1C sulfotransferases that catalyze the sulfonation of N-hydroxy-2-acetylaminofluorene. J Biol Chem 1998; 273: 33929– 33935. Aksoy IA, Wood TC, Weinshilboum R. Human liver estrogen sulfotransferase: identification by cDNA cloning and expression. Biochem Biophys Res Commun 1994; 200: 1621–1629. Otterness DM, Wieben ED, Wood TC, Watson WG, Madden BJ, McCormick DJ et al. Human liver dehydroepiandrosterone sulfotransferase: molecular cloning and expression of cDNA. Mol Pharmacol 1992; 41: 865–872. Falany CN, Xie X, Wang J, Ferrer J, Falany JL. Molecular cloning and expression of novel sulphotransferase-like cDNAs from human and rat brain. Biochem J 2000; 346: 857–864. Sakakibara Y, Yanagisawa K, Takami Y, Nakayama T, Suiko M, Liu MC. Molecular cloning, expression, and functional characterization of novel mouse sulfotransferases. Biochem Biophys Res Commun 1998; 247: 681–686. Cao H, Agarwal SK, Burnside J. Cloning and expression of a novel chicken sulfotransferase cDNA regulated by GH. J Endocrinol 1999; 160: 491–500. Sonnhammer EL, Eddy SR, Durbin R. Pfam: a comprehensive database of protein domain families based on seed alignments. Proteins 1997; 28: 405–420. Bateman A, Birney E, Cerruti L, Durbin R, Etwiller L, Eddy SR et al. The Pfam protein families database. Nucleic Acids Res 2002; 30: 276–280. Devereux J, Haeberli P, Smithies O. A comprehensive set of sequence analysis programs for the VAX. Nucleic Acids Res 1984; 12: 387–395. Boguski MS, Lowe TM, Tolstoshev CM. dbEST—database for ‘expressed sequence tags’. Nat Genet 1993; 4: 332–333. Ho SN, Hunt HD, Horton RM, Pullen JK, Pease LR. Site-directed mutagenesis by overlap extension using the polymerase chain reaction. Gene 1989; 77: 51–59.

www.nature.com/tpj

Lihat lebih banyak...

Comentarios

Copyright © 2017 DATOSPDF Inc.