Dr.  Lhoëst  G.J.J.


Protein-Protein ,Protein-Drug Interactions and MS       

1. Introduction

Proteins are capable to form multicomponent complexes that carry out specific functions.  These functional units on the one hand are as simple as dimeric transcription-factor complexes or on the other hand proteins may exist as multimeric complexes.  The fact that proteins in higher organisms contain higher numbers of functional domains suggests that many of these proteins have multiple associations and all proteins bind or interact with at least one other protein.  Understanding how protein complexes work is essential to understand how cell systems work and the first step in understanding these systems is to identify the components.  When the functions of many newly discovered genes are incvestigated a key clue is the association of the corresponding proteins with other proteins of known function.  Many protein kinase signaling complexes involve association of a kinase , a phosphatase, and regulatory protein together with scaffolding proteins.   If the basic biochemical functions of the kinases and phosphatases could be ascertained by identifying their catalytic domains, the involvement of the accessory proteins in the functional kinase signaling complex would come essentially from a demonstrated association of these proteins.

1.1  Three Dimensional Structure and Function of Proteins

       Proteins can be described as a chain of amino acids joined by peptide bonds in a specific sequence which is simply not linear but can be folded into compact shapes that contain coils, zigzags, turns and loops.  The last decades the three-dimensional-shape or conformations of more than a thousand proteins have been elucidated.  A spatial arrangement of atoms that depends on the rotation of a bond or bonds is called a conformation.   The conformation of a protein  molecule can change without breaking covalent bonds, whereas the various configurations of a molecule can be changed only by breaking and re-forming covalent bonds.  Since each amino acid residue has a number of possible conformations and taking into account that there are many residues in a protein, each protein has a very great number of potential conformations.  Nevertheless, under physiological conditions, each protein folds into a single stable shape known as its native conformation.  The presence of hydrogen bonds and other weak interactions between amino acid residues constrain rotation around the covalent bonds in a polypeptide chain in its native conformation.  The biological function of a protein depends entirely on its native conformation. 

A protein may be a single polypeptide chain or it may be composed of several polypeptide chains bound to each other by weak interactions.  Although exceptions exist, each polypeptide chain is encoded by a single gene. a                                                                                                                                                                                                         


Proteins from E. coli cells separated by two-dimensional electrophoresis.  In the first dimension the proteins are separated by a pH gradient where each protein migrates to its isoelectric point.

The second dimension separates proteins by size on an SDS-popyacrylamide gel.  Each spot corresponds to a single polypeptide

































  The size of genes and the polypeptides they encode can vary by more than an order of magnitude.  Some polypeptides contain only 100 amino acid residues with a relative molecular mass of about 11 000 since the average molecular weight of an amino acid residue is 110.  Very large polypeptide chains contain more than 2000 amino acid residues (M= 222 000).  In some species, the size and sequence of every polypeptide can be determined by the sequence of the genome.  Ther are about 4000 different polypeptides in the bacterium Escherichia coli with an average size 300 amino acid residues (M= 33000).   Humans and other mammals have about 40 000 different polypeptides.  The study of large sets of proteins ,such as the entire complement of proteins produced by a cell , is part of an emerging field called proteomics.


Proteins come in a variety of shapes.  Many are water-soluble, compact, roughly spherical macromolecules whose polypeptide chains are tighly folded.  The globular proteins characteristically have a hydrophobic interior and a hydrophilic surface.  They possess indentations or clefts that specifically recognize and transiently bind other compounds.  By selectively binding other molecules, these proteins serve as dynamic agents of biological action.  Many globular proteins are enzymes which are the biochemical catalysts ofn cells and about 31 % of the polypeptides in E. coli are classical metabolic enzymes.  Other types of globular proteins include various factors, carrier proteins , regulatory proteins and 12 % of known proteins in E. coli fall into these categories.   Polypeptides can also be components of large subcellular or extracellular structures such as ribosomes, flagella and cilia, muscle and chromatin.   Fibrous proteins are a particular class of structural proteins that provide mechanical support to cells or organisms.  Fibrous proteins are typically assembled into large cables or threads.  Examples of fibrous proteins are a-keratin, the major component of hair and nails, and collagen, the major component of tendons, skin, bones and teeth.  Other examples of structural proteins include those that make up the protein components of viruses, bacteriophages, spores and pollen.  Many proteins are either integral components of membranes or membrane-associated proteins.  This category accounts for at least 16 % in E. coli and a much higher percentage in eukaryotic cells.

1.1.1  The Four Levels of Protein Structure                                                                                 

Levels of protein structure :

a) the linear sequence of amino acid residues defines the primary structure.

b) Secondary structure consists of region of regularly repeating conformations of the peptide chain such as a-helices and b-sheets.

c) Tertiary structure describes the shape of the fully folded polypeptide chain.  The example shown has two domains.

c) Quaternary structure refers to the arrangement of two or more polypeptide chains into a multisubunit molecule.














The primay structure describes the linear sequence of amino acid residues in a protein and the aminon acid sequences are always written from tha amino terminus (N-terminus) to the carboxyl terminus (C-terminus).  The three-dimensional structure of a protein is described by three additional levels: secondary structure, tertiary structure and quaternary structure.  The forces responsible for maintaining or stabilizing the three structure levels are primarily noncovalent. 

Secondary structure refers to regularities in local conformations maintained by hydrogen bonds between amide hydrogens and carbonyl oxygens of the peptide backbone.  The major secondary structures are a helices and b strands (including b sheets).  Cartoons showing the stuctures of folded proteins usually represent a-helical regions by helices, and b-strands by broad arrows pointing in the N-terminal to C-terminal direction. 

Tertiary structure describes the completely folded and compacted polypeptide chain.  Many folded polypeptides consist of several distinct globular units linked by a short stretch of amino acid residues and such units are called domains.  Tertiary structures are stabikized by the interactuins of amino acid side chains in nonneighboring regions of the polypeptide chain.  The formation ot tertiary structure brings distant portions of the primary and secondary structures close together.

Some proteins possess quaternary structure which involves the association of two or more polypeptide chains into a multisubunit or oligomeric protein.  The polypeptide chains of an oligomeric protein may be identical or different.


1.1.2  Methods for Determining Protein Structure





Diagram of X rays diffracted by a protein crystal


The primary structure of the amino acid sequence of polypeptides can be determined directly by chemical methods such as the Edman degradation or indirectly from the sequence of the gene.  The usual technique for determining the three-dimensional conformation of a protein is X-ray crystallography. A beam of collimated or parallel X rays is aimed at a crystal of protein molecules.  Electrons in the crystal diffract the X rays which are recorded on film or by an electonic detector.



A) Crystal of myoglobin


X-ray photography of a myoglobin crystal









Mathematical analysis of the diffraction pattern produces an image of the electron clouds surrounding atoms in the crystal.  This electron density map reveals the overall shape of the molecule and the positions of each of the atoms in thtree-dimensional space.  By combining these data with the principles of chemical bonding , it is possible to deduce the location of all the bonds in a molecule and hence the overall structure.  The technique of X-ray crystallography has developed to the point where it is possible to determine the structure of a protein without precise knowledge of the amino acid sequence.  The knowledge of the primary structure makes fitting of the electron density map much easier at the stage where chemical bonds  between atoms are determined. 

The determination of protein structures is now limited mainly by the difficulty of preparing crystals of a quality suitable for X-ray diggraction.  A protein crystal contains a large number of water molecules and it is often possible to diffuse small ligands such as substrate or inhibitor molecules into the crystal.  Often, the proteins within the crystal retain their ability to bind these ligands and many times they exhibit catalytic activity.  The catalytic activity of enzymes in the crystalline state demonstrates that the proteins crystallize in their in vivo native conformations.

Once the three-dimensional coordinates of the atoms of a macromolecule have been determined, they are deposited in a data bank where there are available to other scientists.  For example data files from the Protein Data Bank (PDB) may be used.  (  

There are many ways of showing the three-dimensional stucture of proteins:


Native FKBP, FK506 binding protein

Space-filling model









Ribbon model of the polypeptide chain      Native FKBP












Space-filling models depict each atom as a solid sphere.  These images reveal the dense, closely packed nature of folded polypeptide chains.  Space-filling models of structures are used to illustrate the overall shape of a protein and the surface exposed to aqueous solvent.  The interior of folded proteins is nearly impenetrable, even by small molecules such as water.  The structure of a protein can also be described as a simplified cartoon that emphasizes the backbone of the polypeptide.  In these models, the amino acid side chain have been eliminated making itb easier to see how the polypeptide folds into a three-dimensional structure.  Such models have the advantage of allowing us to see into the interior of the protein, and they reveal elements of secondary structure such as a helices and b strands.


2.  Identifying Protein-Protein Interactions

The detection of the association of proteins with each other in cellular systems has come mainly from two types of experiments:

The first type of experiment involves the immunoprecipitation of a protein of interest together with any associated proteins (see figure of the dissection of a multiprotein complex by immunoprecipitation and Western-blot analysis) The proteins are then analyzed by 1D-SDS-PAGE, electrophoretically transferred to a membrane, and the membrane is probed with antibodies suspected as partners of the target protein.  This approach requires that the antibodies to these proteins are available.  These antibody "pull-down" experiments are very useful tools to confirm suspected protein-protein interactions.  This approach precludes the detection of unanticipated members of multiprotein complexes.  In the figure shown, no antibody to the protein marked "+" is available and it is not detected even though it is present in the complex. 


The second approach is the yeast two-hybrid  system .  In this experiment detection of the interaction between two proteins is done indirectly.

Each of the genes that code for the two proteins of interest (Pr1 and Pr2 in the figure shown) is fused to a transcription factor and then the pair of hybrid gene is expressed in yeast.  The transcription-factor component ( a DNA-binding factor (DBD) and an activation domain AD) encoded by the two different hybrids will activate a reporter gene in the yeast , but only if they become associated with each other to form an active transcription factor.  This only happens when the two gene products of interest interact with each other to form a complex.  When the hybrid proteins of interest form a complex, the transcription-factor pieces are also brought together, the reporter gene is activated and a signal is detected.  This assay has done much to help establish protein-protein interactions for proteins from a variety of species.


2. Proteomics       

2.1 Introduction to Proteomics                                                                 

2.2 Tools of Proteomics

2.3 Applications of Proteomics

2.4 The Proteome     

2.4.1 The Proteome and the Genome

2.4.2 The Life and Death of a Protein

2.4.3 Proteins as Modular Structure

2.4.4 Functional Protein Families

2.4.5 Deducing the Proteome from the Genome

2.4.6 Gene Expression, Codon Bias and Protein Levels

3. Analytical Proteomics    

3.1 Isolation of Proteins and Peptides

3.1.1 Complex Protein and Peptide Mixtures

3.1.2  Extracting Proteins from Biological Samples

3.1.3 Protein Separation Before Digestion        

3.1.4 Protein Separation After Digestion

3.1.5 Protein Digestion Techniques




3.2 Identication of Proteins by Mass Spectrometry     

3.2.1  Peptide Sequence Analysis by MS/MS

3.2.2  Peptide Ion Fragmentation in MS/MS

3.2.3  The Tandem Mass Spectrum

3.2.4  Peptides Containing Proline

3.2.5  Identification of Proteins with MS/MS Data    Identifying Proteins from ESI/MS-MS data : SEQUEST Other Algorithms and Software Tools for Identifying

              proteins from MS-MS data  The SALSA Algorithm


 2.  Proteomics and Mass Spectrometry      

2.1 Introduction to Proteomics       

Proteomics aims to document the overall distribution of proteins in cells, identify and characterize individual proteins of interest and finally to elucidate their relationships and functional roles.  Genomics cannot adequately predict the structure or dynamics of proteins since it is at the protein level that most regulatory processes take place , where disease processes often occur and where most drug targets are to be found.  Proteomics and Genomics are two disciplines investigating the molecular organization of the cell at complementary levels, proteins and genes and a synergistic effect between the two disciplines exist increasing the effectiveness of the other .  Until the mid-1990s cell biologheists studied individual genes and proteins using techniques then available, Northern blots for gene expression and Western blots for the proteins.  The biological landscape was changed by three main developments:

a) The growth of gene, expressed sequence tag (EST) and protein -sequence databases during the late 1990s.  Complete genomic sequences of bacteria, yeast, drosophila and recently of the complete sequence of the human genome was accomplished.  Also genomes of plants and of widely studied animals are approaching completion.  These genome sequence databases are the catalogs from which part of our understanding of living systems will be extracted.

b) The introdustion of user-friendly bioinformatics tools to extract information from these databases.  Such database search tools are integrated with other tools and databases to predict the functions of the protein products based on the occurence of specific functional domains or motifs.

c) The last development is the oligonucleotide microarray.  The array contains a series of gene-specific oligonucleotides or cDNA sequences on a slide or a chip.  If a mixture of fluorescently labeled DNAs from a sample of interest is applied to the array, the expression of thousands of gene can be probed at once.  The technique can replace thousands of Northern blot analyses and with a two-colour fluorescent probe labeling, expression of genes in two different samples can be compared directly on one slide or  chip.  A yeast cDNA microarray is illustrated at the following address :    From this single array the expression of all genes in the yeast genome can be assessed. and the whole system can be seen at once.but the information contained in these thousands of data points is beyond our interpretation possibilities and new bioinformatic tools are progressively developed in order to facilitate the work of biologists.  The life and death of cell is dictated by the expression of genes and the activities of their protein products.  Proteins such as transmembrane receptors, transcription factors, protein kinases, chaperones expresses a function among other functions and activities also expressed in the same cell.

Analyses in proteomics are directed at complex mixtures and identification is not by complete sequence analysis but instead by partial sequence analysis with the aid of database matching tools.and the context is more system biology than sructural biology.  Individual protein analysis, complete sequence analysis and structure and function determination  are fields belonging to Protein Chemistry.

If gene microarrays offer a picture of the expression of many or all genes in a cell, unfortunately the levels of mRNAs do not necessarily predict the levels of the corresponding proteins in a cell.  Factors such as mRNAs stability and different efficiencies in translation may affect the generation of new proteins.  Also many proteins involved in signal transduction, transcription-factor regulation and cell-cycle control are rapidly turned over as a means of regulating their activities. ,  Moreover, proteins are subject to many posttranslational modificatioons and other modifications by environmental agents.

2.2  Tools of Proteomics

The proteome and its components may be studied nowadays because of the development and integration of four important tools providing the investigators with sensitive and specific means of identifying and characterizing proteins.

a) The first tool is the database.

,Protein, EST, and complete genome-sequencehe  databases collectively provide a catalog of all proteins expressed in organisms from which the databases are available.

b) The second tool is mass spectrometry (MS).

Modern mass spectrometry is able to analyze biomolecules such as proteins and peptides.  MS can provide accurate molecular mass measurements of intact proteins as large as 100 kDa or more.  Mass spectrometry is able to provide highly accurate protein mass measurments which are generein identially of limited utility because they are often not sensitive enough and because net mass is often insufficient for unambiguous protein identification.  MS can also provide accurate mass measurements of peptides from proteolytic digests.  In contrast to whole protein mass measurements, peptide mass measurements can be done with higher sensitivity and mass accuracy.  The data from these peptide mass measurements can be searched immediately against databases in order to obtain frequently the identification of the target protein.  Also MS analyses can provide sequence analysis of peptide obtained from proteolytic digests and mass spectrometry is now considered the state-of-the-art in peptide sequence analysis.

c) The third tool for proteomics is an emerging collection of sofware

Software collection exists that can match MS data with specific protein sequences in databases.  It is possible to determine the sequence of a peptide from MS data but this de novo sequence interpretation is a laborious task when hundreds of spectra are to be interpreted.  These software tools take uninterpreted MS data and match it to sequences in protein, EST, and genome-sequence databases with the aid of specialized algorithms.

d) The fourth essential tool is analytical protein-separation technology.

Protein separations simplify complex protein mixtures by resolving them into individual proteins or small groups of proteins.  Also analytical separations allow investigators to target specific proteins for analysis.  2D-SDS-PAGE is most widely associated with proteomics, and two -dimensional gels represent one of the best technique for resolving proteins in a complex sample.  Nevertheless other techniques may be used such as high performance liquid chromatography (HPLC), capillary electrophoresis (CE), isoelectric focusing (IEF), and affinity chromatography and combination of techniques such as ion-exchange liquid chromatography (LC) in tandem with reverse-phase (RP)-HPLC is a powerful tool to resolve complex peptide mixtures.

2.3  Applications of Proteomics

Four principal applications are in current practice in proteomics, including mining, protein-expression profiling, protein-network mapping and mapping of protein modifications.

a) Mining

Mining consists of the identication of all or as many as possible of the proteins in a sample .  The proteome is analyzed directly and the proteome composition is not infered from expression data for genes (e.g. by microarrays).  Proteins are resolved to the greatest extent possible and then MS and associated databases as well as software tools are used  for identification.  Although there are several approaches to mining, what they collectively offer is the ability to confirm by direct analysis what could only be deduced from gene-expression data.

b) Protein-expression profiling

Protein-expression profiling is the identification of proteins in a particular sample either as a function of a particular state of the organism or cell (e.g;  developmental state, or disease state) or as a function of exposure to a drug, chemical or physical stimulus.  Protein-expression profiling is most commonly practiced as a differential analysis in which two states of a particular system are compared.  Normal and diseased cells or tissue can be compared to determine which proteins are expressed differently in one state compared to the other.  This may be considered as a mean of detecting potential targets for drug therapy in disease.

c) Protein-network mapping

Protein-network mapping is a specialized area of proteomics trying to determine how proteins interact with each other in living systems.  Generally proteins carry out their function in association with other proteins.  These interactions determine the functions of protein functional networks, such as signal-transduction pathways and complex biosynthetic or degradation pathways.  Through in vitro studies with individual purified proteins, much has been learned about protein-protein interactions.  Through the creative pairing of affinity-capture techniques coupled with analytical oroteomics methods, more complex networks can be characterized.  Multiprotein complexes are also involved in point-to-point signal-tranduction pathways in cells and protein-network profiling would offer the ability to assess at once the status of all the participants in a pathway.  This area represents one of the most potentially powerful future applications of proteomics.

d) Mapping of protein modifications

Mapping of protein modifications is the task of identifying how and where proteins are modified.  A lot of posttranslational modifications govern the targeting, structure,  function and turnover of proteins.  Also environmental chemicals, drugs and endogenous compounds are producing reactive electrophiles that modify the stucture of proteins.  Modified proteins such as specific phosphorylated amino acid residues can be detected with antibodies, but the precise sequence sites of a modification are not known.  The nature and sequence specificity of posttranslational modifications are best determined by proteomics approaches.  How chemical modifications of the proteome affect living systems is another question that can be raised.


2.4 The Proteome     

2.4.1  The proteome and the genome

Although our cells contain all the information necessary to make a complete human being, not all the genes are expressed in all the cells.  Genes coding for enzymes essential to basic cellular function such as glucose catabolism and DNA synthesis are expressed practically in all cells and genes with highly specialized functions are expressed only in specific cell types such as rhodopsin in retinal pigment epithelium.  Consequently, it may be considered that cells express on the one hand genes whose protein products provide essential functions and on the other hand genes whose protein products provide peculiar cell-specific functions.  Every organism has one genome but many proteomes and any cell represents some subset of all possible gene products. Any protein, although being a product of a single gene, may exist in multiple forms varying within a cell or between different cells.  Most proteins exist in several modified forms and these modifications affect protein structure, localization, function and turnover.

Five different aspects of the proteome will be described:

a) The life-cycle of proteins may be summarized starting from their appearance as translation products in ribosomes to their modifications and degradation.

b) Proteins may be considered as modular structures classified in groups based on sequence motifs, domain structures and biochemical functions.

c) Functional families of proteins are closely related to the distribution of the genome.

d) The proteome is closely related to genomic sequencies indicating the diversity and redundacy of functions in living systems.

e) The factors that dictate how much of any protein is present in a cell at a given time will be briefly described.


2.4.2  The Life and Death of a Protein

Proteins are synthesized by the translation of mRNAs into polypeptides on ribosomes.  The initial polypeptide-translation product undergoes some type of modification before it assumes its functional role in a living system.  These changes are defined  broadly as posttranslational modifications including a wide variety of reversible and irreversible chemical reactions.  More or less 200 different types of posttranslational modifications have been reported and some of them are illustrated herunder illustrating the life cycle of a protein.

A protein is formed as a ribosomal translation product of an mRNA sequence.  Secondary structure on a random-coil polypeptide is confered by folding and oxidation of cysteine thiols to disulfides.  Carboxylation of glutamate residues or removal of a N-terminal methionine are modifications that can occur early in the life of a polypeptide.  Further processing in the Golgi apparatus often results in glycosylation.  Leader or signal sequences which may be proteolytically cleaved causes frequently the specific delivery of a protein in specific or extracellular compartments.  Combination with other proteins gives rise to multisubunit complexes.  Anchoring of proteins in or on membranes is favoured by palmitoylation or prenylation of cysteine residues.  These modifications which are more or less permanent result in the delivery of functional proteins to specific locations in cells.

At their cellular destination, proteins carry out their many functions and the activities of many proteins are then controlled by post-translational modifications. Phosphorylation of serine, threonine or tyrosine residues are among the modifications which are the best undesrstood.  Phosphorylation may be at the origin of  activation or inactivation of enzymes, the alteration of protein-protein interactions and associations and may cause a modification in prorein structures and target proteins for degradation.  It appears also that protein phosphorylation may be a key switch for rapid on-off control of signaling cascades, cell-cycle control and other key cellular functions.

Proteins may be subjected to oxidation processes by the ubiquitous presence of free radicals and other oxidants in biological systems leading to oxidative protein damage.  Cysteine thiols, methionine, tryptophan, histidine and tyrosine residues are susceptible to be easily oxidized.  Products of lipid and carbohydrate oxidation are able to react with proteins.  Environmental agents including radiation, chemicals and drugs can oxidize or covalently modify proteins.  These chemical modifications may inactivate proteins and they all produce some modifications of protein structure.

Degradation of proteins are often initiated by protein modifications .  Phosphorylation of some proteins is rapidly followed by conjugation with ubiquitin which leads to degradation by the 26S proteasomal complex.  Other stimuli for protein ubiquitination exist such as oxidative damage and other protein modifications.  Degradation of proteins may be also caused by lysosomal enzymes. 

2.4.3  Proteins as Modular Structure

Proteins may be considered as modular or mosaic structures.  Certain amino acid sequences tend to form secondary structures such as a-helices, b-sheets or random-coil structures.  Specific amino acid sequences and secondary structures derived from these sequences confer special properties and functions.  Functional building blocks or modules may be considered as segments of amino acid sequences from which Mother Nature has assembled a tool box to build proteins with diverse related functions. 















An a,b-domain structure is illustrated by triose phosphate isomerase in which the strands (designated by arrows pointing in the N-C terminal direction ) are wound into a b-barrel.  Each b-strand in the interior of the b-barrel is interconnected by a-helical regions of the polypeptide chain on the outside of the molecule.



The modular units in proteins that confer specific properties and functions are known as "motifs" or "domains".and are recognizable sequences that confer similar properties or functions when they occur in a variety of proteins.  Sometimes, amino acid sequences with motifs and domains are highly conserved and do not vary from protein to protein.  Also some key amino acids occur in a reproducible relationship to each other in a sequence, even though various substitutions in other amino acids occur.  Moreover, some short sequences can confer specificity for certain modifications and for example proteins that undergo N-glycosylation tend to display a tripeptide sequence "Asn-XYZ-Ser/Thr in which the target asparagine is followed by any amino acid and finally by a serine or a threonine residue.  If XYZ is a proline, glycosylation does not occur.  Although this sequence does not ensure glycosylation, it provides a signature motif that can offer clues to possible biochemical roles. 

Domains are often formed by longer amino acid sequences confering specific properties or functions on a protein.  Domain structures refer sometimes to sequences that confer a bulk physical property to a segment of the polypeptide such as transmembrane domains which simply form helices that span a lipid bilayer membrane.  Other domain structures provide hydrogen bonding or other interactions for key enzyme substrates or prosthetic groups.  Domains often are made up of combinations of units of secondary structure such as helix-loop-helix domains.  The significance of motifs and domains for proteomics is that they represent the translation of peptide sequence to protein functions.

2.4.4  Functional Protein Families

The proteome may be subdivided into families of proteins that carry out related functions.  Proteins may participate in signaling pathways in the synthesis of nucleic acids or in carbohydrate catabolism.  Enzymes involved in intermediary metabolism and nucleic acid metabolism account for about 15 % of the proteins represented in the proteome.  Proteins associated with structure, protein synthesis and turnover (cytoskeletal proteins, ribosomal proteins, chaperones and mediators of protein degradation) account for another 15-20 % .  Signaling proteins and DNA binding proteins account for 20-25 %.  Roughly 40 % of the genome encodes protein products with no known function and assigning these functions is a fundamental challenge for human functional genomics.

2.4.5  Deducing the Proteome from the Genome

Full genomic sequences of several organisms have been completed and the prediction of products of all organism's genes has been allowed by analysts.  Based on the predicted amino acid sequences of each gene product, these proteins have been classified on the basis of the domains and sequence motifs they contain.  Interesting relationships may be revealed between the size of the genomes and the predicted content of the proteomes for certain organisms.  Comparison of all the predicted protein products indicated the occurence of proteins whose sequence differed only slighly from others in the genome.  Correction for these redundant protein products called paralogs allowed the calculation of a main proteome for each organism.  This core proteome represents the basic collection of basic protein families for an organism.  It appears that the relationship betwwen the complexity of an organism and the number of genes in its genome is not simple because more complex regulation of the genes and the functions of the protein products may account for the greater complexity.  The human genome sequence encodes between 30,000 and 40,000 genes and taking into account the great difference in complexity of the human organism compared to the worm for example, it is surprising that the human genome encodes only about twice as many genes as that of the worm. The complexity of the human genome lies most probably in the diversity of the human proteomes rather than in the size of the human genome. 

2.4.6  Gene Expression, Codon Bias and Protein Levels

Expression levels of proteins vary from a few copies to more than a million and it is important to realize in this context that the levels of a protein expressed in a cell has little to do with its significance.  Essential enzymes of intermediary metabolism or structural proteins often are present at levels in the thousands of copies per cell whereas protein kinases involved in cell-cycle regulation are found at only tens of copies per cell.  The level of any protein in a cell is controlled by  a) the rate of trancription of the gene

      b) the efficiency of translation of mRNA into protein

      c) the rate of degradation of the protein in the cell

Gene expression can dictate protein levels to a certain extent but the influence of the two other factors has to be taken into account.  Many genes are regulated by inducible trancription factors which are regulated in turn by a wide variety of environmental influences.  An intrinsic determinant of the level of expression of many genes is a phenomenon referred to as "codon bias" .  This term describes the tendency of a organism to prefer certain codons over others that code for the same amino acid in the gene sequence.  Consequently, genes containing codon variants that are less preferred tend to be expressed at a lower level.  Calculated codon bias values for yeast genes range from 0.2 to 1.0 where a value of 1.0 favors the highest level of gene expression.  Studies in yeast have compared protein levels, mRNA expression and codon bias for a number of proteins and the following generalizations can be drawn:

1.  Genes with low codon bias values tend to be expressed at low level, whether analyzed on the basis of mRNA expression or protein levels.

2. mRNA levels correlate poorly (r<0.4) with protein levels when genes with codon bias values of 0.25 or less are considered.  Correlation between mRNA levels and protein levels is much higher (r > 0.85) for the most highly expressed genes presenting codon bias values above 0.5.

3. Longer-lived proteins appear to be present in higher abundance than short-lived proteins.

Although gene-expression measurements may indicate changes in protein levels, it is difficult to infer protein expression from gene expression.


3. Analytical Proteomics    

Proteins are macromolecules representing quantitavely  the main biochemical material in animals contributing to 2/3 of the organic substance.  The amino acid sequence including the posttranslational modifications (alteration of the amino acid side chains, addition of carbohydrates, formation of disulfide bonds ..) are defined as the primary structure of a protein.  Recent progress in biochemistry and biotechnology is closely related to the determination of the amino acid sequences of peptides ( less than 10 kDa in mass) and proteins (more than 10 kDa in mass).  Techniques for isolating, purifying and sequencing protein and peptides (mapping) are available.  They are bases on  

                                                                                                a) the isolation of the protein/peptide

                                                                                                b) sequential chemical degradation of the molecule

                                                                                                c) identification of the released amino acid residues

Considerable progress was made in protein and peptides isolation due to the introduction of 1- and 2-dimensional gel electrophoresis combined with sensitive staining procedures in order to visualize the protein spots.  Sodium dodecylsulfate- polyacrylamide gel electrophoresis (SDS-PAGE) with an optional electro-transfer to polyvinylidene difluoride (PVDF) has become the most widely-used method of preparing protein samples for primary structure analysis and mainly for proteins that can only be recovered in very small quantities (pmoles or less). 

3.1  Isolation of proteins and peptides

In order to give the MS instruments a better opportunity to obtain useful data on the components of a mixture, proteins in a first step are to be converted to peptides by proteolytic enzymes and in a second step complex mixtures of proteins or peptides must be separated into somewhat less complex mixtures.  There is no obligatory order in these two steps since we can either first separate proteins , then digest them  and separate the peptides or it is possible to didest a complex mixture of proteins to peptides and then resolve the peptides.

3.1.1  Complex Protein and Peptide Mixtures

The chances of identifying many peptides in a mixture are increased when the complexity of the mixture is decreased.  Based on the number of known human genes, a typical human cell may contain about 20,000 different expressed proteins.  Assuming that the average molecular weight is 50 kDa containing an average number of lysine and arginine residues, then each protein would give about 30 tryptic peptides and consquently one cell's proteins would yield about 6,000,000 tryptic peptides.

3.1.2 Extracting Proteins from Biological Samples

A biological sample is a piece of tissue, a plate of cultured cells, a fask of bacteria, a leaf and so on.  The sample is usually pulverized, homogenized, sonocated or disrupted to yield a soup containing cells, subcellular components and biological debris in an aqueous buffer or suspension.  Proteins are extracted from this soup by a number of techniques.  The objective for proteomic analysis is to recover as much of the protein as possible with as little contamination by other biomatrials such as lipids, cellulose, nucleic acids etc.  This is done with the aid of :

a) Detergents

Detergents such as SDS, 3-([3-cholamidopropyl]dimethylammonio)-1-propane sulfonate (CHAPS), Cholate, Tween help to solubilize membrane proteins and aid their separation from lipids.

b) Reductants

Reductants such as dithiothreitol (DTT), mercaptoetanol, thiourea reduce disulfide bonds or prevent protein oxidation.

c) Denaturing agents

Denaturing agents such as urea and acids disrupt protein-protein interactions, secondary and tertiary structures by altering ionic strength and pH.

d) Enzymes

Enzymes such as DNAse, RNAse digest contaminating nucleic acids carbohydrates and lipids.

Different methods have been developed to extract proteins from different sample types such as cultured cells or leaves.  Inhibitors of proteases are often used to prevent proteolytic protein degradation and some of these agents may interfere with proteomic analysis.  A serine protease inhibitor phenylmethylsulfonyl fluoride (PMSF) is frequently used to prevent protein degradation during tissue processing.  Residual PMSF in protein samples may inhibit tryptic digestion needed for proteomic analysis.  Also detergents may interfere with some analytical protein separations and with proteolytic digestions.  Consequently, the knowledge of the history of the sample is important for the success of the analytical scheme.

3.1.3  Protein Separation Before Digestion     

The separation of proteins will be be described before the digestion process.  With intact proteins, the main three separation approaches are 1D as well as 2D-SDS-PAGE and preparative isoelectric focusing (IEF).  Although most widely used, alternatives are reversed phase HPLC (RP-HPLC), size exclusion, ion exchange or affinity chromatography.  Regarding of the used, the idea behind separating intact protein is to take advantage of their diversity in physical properties, especially isoelectric point and molecular weight.  The mixture may be separated into a small number of fractions (as in 1D-SDS-PAGE and preparative IEF) or into many fractions (many spots as in 2D-SDS-PAGE).  The fractions are then submitted to proteolytic digestion followed either by further separation of the peptide fragment or direct MS analysis of the peptides.  One-Dimensional SDS-PAGE

The most used analytical separation method in protein chemistry 1D-SDS-PAGE is useful for proteomic analysis.  The protein sample is dissolved in a loadind buffer that often contains a thiol reductant (mercaptoethanol or DTT) and SDS.

The separation method is based on the binding of SDS to the protein which imparts negative charge issued from the SDS sulfate group to the protein in more or less constant proportion relative to the molecular weight.  1D-SDS-PAGE is done on polymerized acrylamide gels where the extent of cross-linking varies from 5 to 15 %.  Lower degrees of cross-linking allow easier passage of larger proteins though the gel.  gradient gels can provide better resolution of a broad molecular weight range of proteins.  Low resolution degree is obtained by 1D-SDS-PAGE and bands that appear to contain a single protein may actually contain multiple molecular species.  A gel slice spanning 5kDa range from a crude cellular extract may contain from dozens to hundreds of different proteins.  The 1D-SDS-PAGE analysis will often produce a single clean-looking band, whereas 2D-SDS-PAGE of the same sample will resolve the sample into multiple spots along the same molecular weight band but with different isoelectric point.  This can reflect multiple posttranslational modifications that do not significantly affect SDS binding or migration through tyhe polyacrylamide gel.






  Two-Dimensional SDS-PAGE

2D-SDS-PAGE remains the single best method for resolving highly complex protein mixtures.  The technique is a combination of two different types of separations.  The proteins are first resolved on the basis of the isoelectric point by IEF and afterwards the focused proteins are further resolved by electrophoresis on a polyacrylamide gel. 

The proteins are resolved in the first dimension by isoelectric point and in the second dimension by molecular weight.  Dedicated 2D-SDS-PAGE systems have been introduced that use immobilized gradient (IPG) strips and a relatively foolproof hardware to facilitate the transfer of proteins from the IPG strip into the SDS-PAGE slab gel.  The IPG strip is based on the use of immobilized pH gradients in which polycarboxylic ampholytes are immobilized on supports to reproducibly create stable pH gradient.  IPG strips can be purchased from major suppliers that afford reproducible separations over a variety of wide and narrow pH ranges.  Proteins with similar isoelectric point are best separated using the narrow pH ranges.  The steps in an IEF separation are the following:

a) The strip is hydrated with a buffer and the protein is slowly loaded into the strip under voltage.

b) The voltage is increased to achieve focusing. 

It must be mentioned that commercially available systems provide temperature control as well as accurate voltage or current control facilating reproducible separations.

c) After the focusing step, the strip is treated with a buffer that contains a thiol reductant and SDS and the strip is the joined to the SDS-PAGE slab gel.  The IPG strip containing the focused proteins may be considered as a "stacking gel¨" in a 1D-SDS-PAGE system.

d) The proteins are then resolved in the same manner as for 1D-SDS-PAGE.  Visualization of the protein

Proteins separated by 2D gels are visualized by conventional staining techniques including silver, Coomassie blue and and amido black staining.  Silver staining and newer fluorescent dye are the most sensitive.  Not all of the staining protocols are compatible with subsequent analysis of the proteins.  Silver staining with formalin fixation of the proteins tend to fix proteins in the gel preventing both their digestion and the recovery of any peptides formed.  Consequently, it is important to use staining protocols that are compatible with subsequent digestion and elution steps.






      SDS gel coloured by Coomassie blue  Problems encountered with 2D-SDS-PAGE

Despite the superiority of 2D-SDS-PAGE over other methods  as a means of resolving complex protein mixtures, the technique presents the following problems:

a) Reproducible 2D-SDS-PAGE analysis is performed with difficulty.  This aspect is important when one wishes to use 2D-SDS-PAGE to compare two samples by comparing the images of the stained gels.  Differences in protein migration in either dimension could be mistaken for differences in levels of certain proteins between the two samples.

b) Relative incompatibility of some proteins with the first dimension IEF step may occur.  Large hydrophobic proteins do not behave well in this type of analysis.

c) Another problem with 2D-SDS-PAGE is the relatively small dynamic range of protein staining used as a detection technique since spot densities reflect at best about a 100 fold range of protein concentrations.  Consequently, only abundant proteins are visualized by staining of 2D-gels and the less abundant proteins are not detected.  The relationship of gene expression measured by mRNA trancripts and protein levels measured by incorporation of radio labeled methionine has been studied in yeast.  Yeast express about 2/3 of their 6000 genes and 2D-SDS-PAGE analysis with visualization by silver-staining revealed a maximum of about 1000 proteins.  Consequently of about the 4000 expressed genes, 3000 were not detected in the 2D-SDS-PAGE analysis.  This means that most of the proteins detected were products of genes with high codon bias and thus with a tendency towards higher expression.  2D-SDS-PAGE seems to be the best technique for analysis of abundant long-lived proteins.  Proteins of considerable interest in biology are expressed at relatively low levels and are rapidly turned over.  For these proteins other analytical methods are often necessary.  Preparative IEF

The generation of a pH gradient is achieved with soluble ampholytes (polycarboxylic acid compounds) that generate a stable pH gradient when voltage is applied accross the focusing cell.  The protein sample is added, voltage is applied and the proteins are separated according their isoelectric point.  In a commercial apparatus, the Biorad Rotofor cell, the focusing cell is divided by permeable membranes into a series of chambers.  After the focusing step, the chambers are simultaneously emptied by a vacuum ,sipper that drows the contents of each section of the cell into a separate tube and finally the entire protein mixture is separated into 12-20 fractions.  Milligrams to grams of total protein per run can be treated representing a large sample capacity.  The ampholytes can be removed from the fractionated samples by dialysis or gel filtration prior to further processing of the protein.  Recovery of proteins from solution-phase IEF exceeds 85-90 %.  High-Performance Liquid Chromatography-  HPLC

Although HPLC of intact proteins has nos not become a widely used technique for analytical proteomics, it is nevertheless applicable as an initial step to fractionate proteine mixtures.  Reverse phase (RP), anion and cation exchange, size exclusion and affinity chromatography are available possibilities.  HPLC is as useful as preparative IEF for resolving protein mixtures into fractions and the advantage of HPLC is the diversity of separation modes available.  Tandem HPLC separations combine two different types of chromatography and for example strong cation exchange followed by RP combine two completely different separation modes.

3.1.4 Protein Separations After Digestion

The proteins in the sample are first digested into a mixture of peptides which are separated prior to analysis.  The use of microcapillary HPLC with special control adaptations and automated  MS instrument control allowed the acquisition of MS data on hundreds or thousands of peptides in a simple run.  If this approach is choosen, the number of available methods to separate the peptides is more limited since 1D and 2D-SDS-PAGE are not useful in resolving peptides because of their more limited range of pI and molecular weight.  The method of choice is HPLC. Tandem LC Approaches for Peptide Analysis

The diversity of stationay phases and separation modes gives HPLC considerable resolving power.  The combination of HPLC separation modes is one of the most effective tools in analytical proteomics and the use of combined separation modes in series is referred to as "Tandem HPLC".  The combination of dissimilar separation modes allows a greater resolution of peptides uin a mixture.  The major HPLC separation modes are:

a) RP:  hydrophobicity

b) Strong cation exchange :  net positive charge

c) Strong anion exchange:  net negative charge

d)  Size exclusion : peptide size / molecular weight

e) Affinity :  Interaction with specific functional groups

Of the separation modes all but size exclusion are likely to be useful for peptide separation.  Resolving power of available size-exclusion media is not sufficient to separate peptides in the molecular-weight range that results from proteolytic digests.  Microcapillary columns linked in series and eluted directly into the mass spectrometer were used to analyze complex peptide mixtures and the term "MUDPIT" (Multidimensional Protein Identification Technique) was introduced to describe this approach.





Peptides are first applied to to a strong cation exchange (SCX) column and the peptides are adsorbed to the SCX column with affinities that are proportional to the overall number of positive charges (e.g. , ionized nitrogen) on each peptide.  The peptides are eluted by a step gradient of increasing salt concentration.  Each step releases a group of peptides which then pass on the RP column which is downstream of the SCX column.  Each peptide groiup is then separated by a RP-HPLC gradient resolving the peptides on the basis of their hydrophobicity. The RP column is connected to the mass spectrometer and the pepdides coming out of this column are submitted to mass spectrometry ananlysis.  After the RP gradient is complete, the next step step of the salt gradient releases more peptides from the SCX column and these peptides are again resolved by the RP column and analyzed by MS.  Since the limits of detection of many MS instruments are below the levels at which proteins can be detected by gel staining, the described technique presents some advantages over 2D-SDS-PAGE especially when very dilute samples are analyzed.

The application of tandem LC to proteomic analysis is relatively new and this promising approach will certainly undergo increasing developments and become much more widely used.
  Capillary Electrophoresis

Proteins placed in an electrical field will migrate a point in a pH gradient where they display an overall neutral change.  The analysis performed in a microcapillary tube provides greatest resolution of all peptide analytical techniques and can be coupled directly to MS instruments.  CE has great potential as a technique for analytical proteomics.  Development of instruments for this purpose is continuing and CE-MS may become a useful tool in proteomics analysis.  Protein Separations Before or After Digestion ?

Initial protein separation followed by digestion and analysis is the most widely used analytical proteomics approach.  This due mainly on the preeminence of 2D-SDS-PAGE for protein separations.  The advantage of the technique is the ability of 2D gels to serve as image maps allowing investigators to compare changes inn the proteome bases on changes in the patterns of spots on the gel.  Although drawbacks exist, there are no other technique available that provide an intuitive "snapshot" of the proteome and most probabbly 2D-SDS-PAGE will remain a dominant methodology in proteomics.  For lower-abundance proteins, 2D gels are not useful because of the lack of sensitivity and in this case separation methods such as tandem LC provide a good alternative. The most flexible comprehensive strategy for proteomic analysis may be a hybrid of methods as illustrated:


Proteins are first separated as intact species either by preparative IEF, preparative 1D-SDS-PAGE or HPLC.  The fractions obtained are submitted to enzymic digestion and the resulting peptides are separated by HPLC prior to introduction into the MS.  HPLC may involve a single separation mode (RP) or a tandem LC approach.









3.1.5 Protein Digestion Techniques

By reason of the fact that on the one hand it may be very difficult to obtain mass measurements on very large and hydrophobic proteins and that on the other hand the sensitivity of measurements of intact protein masses is not as good as the sensitivity observed for peptide mass measurements and peptide tandem analysis, analyzing intact proteins seems not to be a good option at the present time.  The analysis of peptides rather than proteins is the approach of choice because of the following reasons:

a) MS instruments now are well suited to the analysis of peptides.

b) Modern instruments can perform highly accurate mass measurements of peptides.

c) Data can be obtained from which peptide sequence can be deduced with certainty.

Moreover, tha data obtained from MS analysis of peptides can be taken direcly for comparison to protein sequences derived from protein and nucleotide sequence databases.  A key element of the search algotithms that assign protein identity from comparisons of peptide MS data to database information relies upon the fact that certain proteolytic enzymes cleave the proteins to peptides at specific sites.  Generation of peptides from proteins

The best protein digestion protocol would cleave proteins at certain specific amino acid residues to yield fragments that are most compatible with MS analysis.  Peptide fragments of about 6-20 amino acids are ideal for MS analysis and data base comparisons.  Peptides shorter than about 6 amino acids generally are to short to produce unique sequence matches in database searches.  Peptide larger than 20 amino acids in tandem MS analyses are providing sequence information with difficulty.  Consequently, one of the goals of protein digestion will be to produce the highest yield of peptides of optimal length for MS analysis.  Proteases and their cleavage specificities

As mentioned earlier, the analysis of the primary structure of proteins is less complex and more accurate when performed on peptides derived from the larger protein.  Protein degradation is based either on chemical cleavage at the C-terminus with cyanobromide, by Edman degradation or at specific dipeptide linkages with formic acid or hydroxylamine.  Also enzymic digestion is used producing a mixture of a large number of peptides whereas chemical cleavage produces less fragments but larger in size.  What is needed for analytical proteomics are stable, well-characterized enzymes with well-defined specificities.  These enzymes must be available in quantity and high purity and must be robust enough for application in a variety of experimental conditions.  A number of proteases that meet these requirements have been used for proteomic analysis as illustrated in table herunder:

                                        Specific cleavage of polypeptides

                                     Reagent                                           Cleavage site
  Chemical cleavage  
                          Cyanogen bromide                   Carboxyl side of methionine residues
                    Phenyl isothiocyanate (Edman)                Uncharged terminal amino group of the peptide
                              Hydroxylamine                              Asparagine-glycine bonds
                   2-Nitro-5-thiocyanobenzoate                           Amino side of cysteine residues
 Enzymic cleavage  
                                       Trypsin                Carboxyl side of lysine and arginine residues
                                   Chymotrypsin               Carboxyl side of tyrosine, phenylalanine and tryptophan
                          Endoproteinase Asp-N                                                                                        Amino side of aspartic acid and cysteic acid
                             Endoproteinase Lys-C              Amide, ester and peptide bonds at the carboxylic side of lysine Cleavage of Proteins by Chemical Approaches

Edman sequencing has been used for protein identification since the mid 1980s when automated sequencers began to become avaikable.  Edman sequencing employs stepwise chemical degradation of a protein or peptide from its N-terminus and the subsequent identification of the released and derivatized amino acids.   The Edman degradation developed by Pehr Edman of the University of Lund in Sweden offers an advantage over the Sanger

method in that it removes the N-terminal residue and leaves the remainder of the peptide intact.  The Edman degradation  is based on the labeling labeling reaction between the N-terminal amino group and phenyl isothiocyanate.  When the labeled polypeptide is treated with acid the N-terminal amino acid residue splits off as an unstable intermediate that undergoes rearrangement to a phenylthiohydantoin. This last compound can be identified by comparison with phenylthiohydantoins prepared from standard amino acids.  The polypeptide chain that remains after the first Edman degradation can be submitted to another degradation in order to identify the next amino acid in the sequence.  As residues are successively removed, amino acids formed by hydrolysis during the acid treatment accumulate in the reaction mixture and interfere with the procedure.  The Edman degradation procedure has been automated in what is called a sequenator.  Each amino acid is automatically detected as it is removed and the technique has been applied successfully to polypeptides with as many as 60 amino acid residues. 

Many proteins are blocked to Edman chemistry due to the modification of the N-terminal amino group and hence yield no data when the intact protein is analyzed.  Gel electrophoresis is the preferred method of choice for protein separation and when two-dimensional gel electrophoresis (2-DE) is used , extremely high resolution separation of complex mixtures is achievable.  Introduction of electroblotting of gel-separated proteins to a membrane (polyvinylidine difluoride, PVDF) which is compatible with Edman sequencing chemistry provided a direct link between these two methods.  Later, methods were developed for the chemical or enzymic digestion of low quantities of gel-separated proteins blotted to various membranes ans separation/purification of the resulting peptides in preparation for Edman sequencing.  If the intact protein was blocked to Edman sequencing, this approach provided the opportunity to generate amino acid sequences from internal peptides.  Internal sequencing provided significantly more sequence coverage than N-terminal sequencing alone.  During the early 1990s efforts were directed on the one hand towards improving both membranes and digestion protocols for identification of gel-separated proteins involving either the capture or release of proteins and peptides and on the other hand towards increased sensitivity of the Edman sequencers.  Edman sequencing generates amino acid sequences de novo and therefore there is no requirement for correlation of the experimentally derived data to amino acid sequence databases to assist in the identification process as is the case with the mass spectrometry-based approaches.   Consequently, Edman sequencing continues to play an important role in the identification of proteins from species with poorly characterized genome.

Proteins can also be cleaved with cyanogen bromide (CNBr) which cleaves proteins at methionine residues.  The reaction proceeds with a high degree of specificity but the relative infrequency of methionine residues in most proteins means that CNBr cleavage yields few large fragments often not useful for tandem MS analyses.  Cleavage of Proteins by Enzymic Approaches  Trypsin

In proteomic analysis, trypsin is by far the most widely used protease.  Trypsin is obtained primarily from porcine or bovine pancreas and is easily putified.  It may be purchased modified with tosylphenylalanylchloromethane (TPCK) to inhibit residual chymotrypsin.  Trypsin cleaved proteins at lysine and arginine residues, unless either of these is followed by a proline residue in the C-terminal direction.  The spacing of lysine and arginine residues in many proteins is such that many of the resulting peptides are of a length well-suited to MS analysis.  trypsin will cut proteins more frequently than will a protease that cuts at only one amino acid residue and for example a 50 kDa protein will yield about 30 tryptic peptides.  In solution and in gel digestion protocols, this enzyme displays good activity.  MS laboratories that routenely carry out proteomics analysis are familiar with trypsin autolysis fragments which appear as by-products of tryptic digestion protocols.  Glu-C (V8-protease)

Glu-C is an endoproteinase that cleaves at the carboxyl side of glutamate residues in either ammonium acetate or ammonium bicarbonate buffer.  In a sodium phosphate buffer the enzyme cleaves at both glutamate and aspartate residues and Glu-C can be used for in-gel digestions.  Glu-C displays a markedly different cleavage specificity than trypsin improving the likelihook to obtain complementary peptide fragments of a protein.  This may be useful for analysis of proteins with regions of high lysine and arginine content which may undergo extensive cleavage with trypsin to yield very short peptides with little sequence context.  Other Proteases and Cleavage Reagents

A number of other enzymes are used for proteomic analysis including Lys-C, chymotrypsin, Asp-N and nonspecific proreases.  Those enzymes that cleave at only one amino acid residue are providing larger fragments not useful in tandem mass spectrometry.  Chymotrypsin cleaves at tyrosine, phenylalanine and tryptophan providing too many small samples.  These proteases are nevertheless used in specific situations.  Nonspecific Proteases

These enzymes such as subtilysin, pepsin, proteinase K and pronase cleave proteins more or less randomly to produce multiple overlapping peptides.  Digestions must be carried out for short periods of time to avoid the digestion process from going too far.  In-Del Digestions

A commonly used approach to digestion of proteins separated by 1D- or 2D-SDS-PAGE is reported as in-gel digestion.  The band or spot of interest is cut from the gel , destained and treated with a protease such as trypsin.  The enzyme penetrates the gel matrix and digests the protein to peptides eluted from the gel by washing.  The technique is closely bound to 2D-SDS-PAGE proteomics strategies.  This approach  is applicable to other proteases such as GluC and chymotrypsin.  The gel-staining technique used is important for successful in-gel digestions.






3.2  Identification of proteins by mass spectrometry   

The identification of proteins by correlation with sequence databases relies on the avaibility of constraining parameters which distinguish specific matches from all the other sequences in the database.  Identification of the residues after cleavage is mainly based on HPLC in conjunction with mass spectometry (MS).  Using mass spectrometry primary structural information of proteins and peptides are obtained.  MS has the ability to determine the molecular weights of peptides in mixtures resulting from protein digestions.  This provides a useful measure of the integrity, purity and overall state of modification of a peptide or protein.   MS/MS is able to provide amino acid sequence  information on peptides and complete or partial information may be obtained at the femtomole to picomole level in complex mixtures and for blocked or modified peptides that are often impossible to sequence by cjhemical methods.   The method of choice for biopolymers analysis is ESI/MS or MALDI/MS.  Recent developments on protein/peptide mapping are focused on miniaturization of the method.  Approaches on in-gel concentration and/or digestion procedures to exclude dilution steps are described.  here, proteins are fixed in primary SDS-PAGE or concentrated in secondary PAGE or agarose gels, followed by digestion.  The digestion products are then subjected to MS analysis.

3.2.1 Peptide-mass searching

After the commercial development of MALDI-MS and the demonstration that the method was capable of measuring the masses of peptides in mixtures issued from enzymic digestion of gel-separated proteins, a number of groups developed algorithms for protein identification based upon correlating measured peptide masses with experimentally calculated peptide masses derived from proteins existing in sequence databases.  This has often been applied to the identification of gel-separated proteins.

                 Experimental Methods             Computational methods


       Intact protein                                                                     Protein-Sequence Database

         (gel separated or purified                                                       and translation of nucleotide sequence DB

        by HPLC)


digest                                                                                              Peptide Mass Search                 

a)       For each sequence entry calculate peptide masses from give, enzyme specificity.

b)     Correlate observed peptide masses with measured peptide masses.

c)      Rank order best correlation


          Peptides                 Protein identification by mass spectrometry



Peptide masses                                                                Uninterpreted fragment ion search

Measured by MALDI/MS                                              a) For each sequence entry calculate peptide                   

or ESI/MS                                                                           masses from given enzyme specificity

                                                                                        b) for each peptide whose observed mass  

                                                                                           equal the one calculated generate        

                                                                                                      theoretical fragment ion list

                                                                                                   c) Correlate observed fragment ion masses

                                                                                                                                         with measured fragment ion masses

                                                                                                                                       d) Rank over best matching


     Internet sites with mass spectrometry-based protein identification tools :



1) Peptides are generated by digestion of the protein of interest using specific enzyme of known cleavage specificity.

2) The masses of the peptides are accurately determined using Maldi-MS or ESI-MS.

3)  Theoretical peptide masses are calculated for each sequence entry in the database using the same cleavage specificity as the reagent  

      used  experimentally .

4)   A ranking is then calculated to provide a measure of fit between experimentally derived and calculated peptide masses.


The approach is well suited for to genetically well-characterized organisms especially those whose entire genomes have been determined (see the TIGR data base : or those for which extensive protein or cDNA sequence databases have been established. 

Protein identification by peptide mass searching depends on the correlation of several peptide masses derived from the same protein between the experimental data set and the calculated data table.  The technique is not suited for the identification of proteins in mixtures.  It is rare that all of the measured peptide masses will be matched with the sequence of the protein from which they originate.  This not only complicates the identification of the components in protein mixtures but also the identification of single, purified proteins.  The potential reasons for these unmatched peptide masses are the following:

1) The protein was identified correctly but the additional masses are due to post-translational or artifactual modifications or post-translational processing (N- or C-terminal processing).  These modifications are to be confirmed experimentally in order to reconcile these mass differences.

2) The protein was identified correctly but unspecific proteolysis occurred or a contaminating protease was present.

3) The protein was identified correctly but it was part of a mixture of contaminating protein since  even 2-DE protein  spots may consist of more than one protein.

4) The protein identified may be a sequence homolog or splice variant of that reported in the database.

5)  The protein identified was a false positive and if the mass accuracy of the experimental data is not high enough, the results are difficult to confirlm or disprove.

The accuracy of the peptide-mass measurement is a critical experimental parameter when attempting to identify proteins using peptide-mass data.  The greater the accuracy, the greater the confidence of the assignment.   The specificity of the enzyme or chemical reagent used is another critical parameter since the purer the reagent the more reliable the search results will be.  Trypsin is the most commonly employed enzyme but even highly purified trypsin can cleave at sites other than the C-terminal to Lys or Arg if not followed by Pro. 






















A problem with all proteases is that they may not cleave the substrate to completion leaving missed cleavage sites if two or more consecutive amino acids in a protein sequence are potential cleavage sites for the enzyme.  


3.2.1  Peptide Sequence Analysis by MS/MS

A human/mouse protein sequence database search for protein identification with a m/z measurement of a tryptic peptide VGAHAGEYGAEALER from human hemoglobin alpha produced two hits for this peptide for the (M+H)+ proton adduct of m/z = 1529.7384.   These two hits are corresponding to hemoglbine alpha highly conserved betwween mice and men.

                   HUMAN                       VGAHAGEYGAEALER

                   MOUSE                        IGGHGAEYGAEALER

VGA and IGG are different and HAG and HGA are close but different.  The two peptides are identical by mass but may be distinguished by the peptide  pattern which can be recognized by inducing peptides fragmentations producing product ions in MS/MS.  Let us consider a model peptide AVAGCAGAR in order to illustrate key concepts of tandem MS fragmentation.

Each amino acid residue has an amide NH group at one end , a C=O group at the other and an alpha carbon with one proton in the middle.  The side chains that give each amino acid its special chemistry are attached to the alpha carbon.  The amino acid units that contain these elements are considered as residues. An extra proton must be added to the N-terminal residue and an extra OH group to the C-terminal amino acid.  The sequence AVAGCAGAR may be represented as cumulative numbers as illustrated





3.2.2  Peptide Ion Fragmentation in MS/MS

When peptide ions collide with neutral gas atoms in the collision cell of a Q-TOF, an ion trap or a triple quad , the kinetic enzergy which is absorbed induces fragmentation.  The most significant cleavages occur along the peptide backbone. and a commonly accepted nomenclature describes peptide ion fragmentation.  The bond between the carbonyl group and the amide nitrogen in commonly observed to produce a y ion and a and b-ions.  A y-ion is a fragment in which the positif charge is retained on the C-terminal portion of the original peptide ion and b-ion is a fragment in which the charge is retained on the N-terminal portion of the original peptide ion.  When peptide ions fragments either a b-ion or a y-ion is formed. and the other half of the peptide is lost as a neutral fragment.  Doubly charge ions are providing twice as much informations.  The a-,z-c-x-ions in spectra obtained on ions traps, Q-TOF and triple quad instruments are observed occasionally because more energy is needed than for the cleavage of the b- y-ions.  The are observed more frequently when magnetic sector tandem instruments using greater energied for collision induced dissociation.of peptide ions.

3.2.3  The Tandem Mass Spectrum

The predicted b- and y-ion fragmentations for a model peptide AVAGCAGAR is illustrated at the left side of the page.  The real ion trap ms2 mass spectrum of  the doubly charged ion of AVAGCAGAR (monoisotopic mass = 776.395) /2= 388,20 is illustrated herunder:






Starting from the N-terminus left side , cleavages are generating an ascending series of fragment ion corresponding to the b-series and a descending order of y-series fragments.  For example the b4- and y5- fragments are formed by the cleavage of the G-C bond.  The b- and y-ion series in the MS_MS mass spectrum provide the sequence of the peptide.  The mass differences between y7 - y6 (605.3 - 534.3 = 71 amu) , y6-y5 (534.3 - 477.0 = 57 amu) and y5-y4 (477.0-374.2 = 103 amu)  correspond to the residual masses of the amino acids alanine, glycine. and cysteine demonstrating the presence of an AGC motif in the peptide.  The complete y-ions from y8 to y1 establishes the VAGCAGAR motif.  The complete b-ions from b8 to b1 corresponds to the AVAGCAGA motif.  Combining the results of the y- and b-ion series, the MS-MS spectrum provides definitive confirmation of the sequence AVAGCAGAR .

Interpretation of the MS-MS mass spectrum is of course easier when the peptide sequence is known but with an unknown sequence interpretation may be more difficult.  This is called de novo sequence interpretation which may last from half an hour to several hours depending of the quality of the spectra and the experience of the analyst.  By reason of the fact that LC/ MS-MS analysis can generate thousands of MS-MS spectra , data-reduction algorithms and software tools have been developed to compare MS-MS data to peptide sequences in databases to identify the proteins from which the peptides were derived.  Sequest is one of these programs.


3.2.4  Peptides Containing Proline

Different parameters can prevent a mass spectrometer from generating an MS-MS spectrum including all the b- and y-ion series.  These parameters are:

a) Differences in the tendencies of different peptide bonds to fragment

b) Peculiar fragmentation characteristics of certain amino acids

c)  The damping effect of proline on peptide ion fragmentation

In a certain way, fragmentation does depend on how easily the protons in the protonated peptide ions can migrate to various in-chain peptide amide nitrogens.  The most easily protonated sites are most easily cleaved.  Also certain acidic amino acid side chains are able to stabilize more easily the positive charge and for example cleavages adjacent to glutamate or aspartate residues are giving rise to intense fragment ions.  In MS-MS peptide spectra, peptide ions gain energy from collisions and this energy is distributed among different competing pathways.  The most intense ions are those resulting from cleavages near the middle of the peptide and cleavages occurring easily tend to diminish the contribution of other cleavages .  Consequently, some fragment ions will be weak or absent.  Tryptic peptides generate easily doubly charged ions because lysine or arginine residues at the C-terminus.  The y-ion series is often more intense than the b-ion series in the MS-MS spectra of tryptic peptides by reason of the fact that basic side chains in lysine and arginine residues retain the positive charge at the C-terminus of peptide fragments.  Sometimes cleavages of doubly charged peptide ions are giving rise to doubly charged product ion and a neutral fragment rather than a singly charged fragment.  Although the situation is infrequent, interpretaion of spectra may become more difficult.                Certain amino acid side chains undergo specific cleavages  excluding the peptide backbone.  Water can be eliminated from serine and threonine from their side chains containing hydroxy groups and the series of ions generated from this water loss are sometimes more intense than the ions generated for the intact serine or threonine containing-fragments.  Phosphoserine and phosphothreonine residues are losing phosphoric acid (H3PO4) and the ions formed from these losses are abundant in the MS-MS spectra.  They are good indicators of the presence of phosphopeptides.  Eliminations of H2S from cysteine and ammonia from glutamine and asparagine are also characteristic side chain losses.  Another ambiguity in peptide fragmentation is the occurence of proline residues. Proline prevents tryptic cleavage when located on the C-terminal side of either a lysine or arginine and also affect MS-MS fragmentation.  This is closely bound to the structure of proline which has a cyclic side attached at both alpha carbon  and the secondary amine.  The result is that proline nitrogen does not have an available site for protonation ,cleavage is practically suppressed and b- or y-ions are missing where cleavages about proline residues fail to occur.

The cis-configuration of proline has the preference because the nitrogen atom of proline is bound to two tetrahedral carbons.






3.2.5  Identification of Proteins with MS-MS Data     

The first way of identification of protein from peptide MS-MS spectra is  the de novo interpretation of the spectrum  to obtain a peptide sequence followed by a basic local alignment search tool (BLAST) searching of the sequence against a sequence database to identify the protein.  This is a reasonable approach when one or two proteins are to be identified from bands of a SDS gel.  But clearly the field of proteomics relies on identification of a large number of proteins from MS-MS spectra and de novo sequencing/Blast searching approach is to slow for a large number of proteins.  This is why a second approach to protein identification with MS-MS data comes into play.  The second way of approach is not making use of the de novo interpretation and algorithms are applied to correlate directly MS-MS spectral data with peptide sequences found in databases without really interpreting each MS-MS spectrum individually.  This second way of working fits with the emerging database resources resulting from genome sequencing.  If a MS-MS spectrum is obtained of a peptide whose sequence exists in a database, the right algorithm should be able to make the match.  These algorithms can match MS-MS data to protein sequences or to nucleotide sequences (genone or expressed sequence tags [EST] ) that are translated to protein sequences.  Identifying Proteins from ESI/MS-MS data :  SEQUEST

Sequest is an algorithm/program introduced in 1995 to identify proteins by matching MS-MS data to database sequences and similar progrtams also exist.  The output of these programs depends on the quality of the MS-MS data obtained and on the completeness and accuracy of the database used.  Sequest works according to the following scheme:

a) In a MS-MS scan not only the scan is recorded but also the m/z value of the precursor ion.

b) This information is stored together with the scan data.

c) When analysis is complete, the user opens the Sequest program and selects the datafile containing the MS-MS scans to be analyzed.

d) The type of enzyme ( trypsin for example) used for digestion of the protein sample is provided to Sequest and also if singly or doubly carged ions where subjected to MS-MS.

e) The user selects a database against which the MS-MS data are to be compared

Once the program starts, all the proteins in the database arev subjected to a virtual digestion with the enzyme specified by the user (trypsin for example). this generates a master list of possible peptides for comparison to the MS-MS scans. Each MS-MS scan is analyzed in the following way:

1.  The precursor ion for each MS-MS scan is uded to select peptides in the database with the same mass and with a certain tolerance.

2. Theoretical MS-MS spectra are generated from each of the selected peptides.

3. The MS-MS spectrum being analyzed is compared with each of the theoretical MS-MS spectra generated from the database.

4. A correlation score is calculated for each match between the MS-MS scan and the theoretical MS-MS spectra.

5. The best match or matches for each MS-MS scan analyzed is reported.

It must be mentioned that MS-MS spectrum in which over half of the predicted b-and y-ions in a peptide match the major signals in the spectrum is often a correct match.  If the most prominent fragment ions do not match the b- and y-ions of a presumed peptide, the match is usually incorrect.    Sequest does not make judgments about the quality of the matches assigned and an aid to decision-making is a summary of database proteins matched to MS-MS scans presented in a browser window listing the proteins in order of decreasing number of hits (MS-MS scan matches).  A protein with several high-quality hits on different peptide sequences is likely to be correctly identified.  The most reliable protein identifications are those in which several different sequences within the identified protein provide high-quality matches to MS-MS spectra in the datafile. 


Some complications are to be mentioned that can make Sequest analyses more time-consuming or less accurate and complete.

a) Many peptides bear covalent modifications modifying the m/z values of the peptides actually analyzed.  Consequently, Sequest would use a mass that did not correspond to the unmodified peptide mass in the database and in this case no correct match between the MS-MS scan of the modified peptide and the database sequence would be possible by reason of the mass difference.  Sequest allows the user to specify specific modifications to amino acidssuch that the algorithm can search for both modified and unmodified variants.  With well-known  modifications such as phosphorylation of serine, threonine or tyrosine it works reasonably well but unanticipated modifications  may be missed.

b) The incorrect assignment of charge state to precursor ion for MS-MS spectra is another problem for Sequest.  if a singly charged ion is incorrectly designated as doubly charged or conversely, it will be compared to theoretical MS-MS spectra from database peptides of the wrong mass.  Other Algorithms and Software Tools for Identifying Proteins from MS-MS data

The MS-Tag program ( was originally developed for analysis of post-source-decay (PSD) spectra obtained in MALDI-TOF analyses of peptides but has been modified to accomodate MS-MS gata from different types of instruments.  The following parameters can be entered by the user:

a) a list of m/z values from the MS/MS spectrum to be analyzed.

b) the m/z value and charge state of the precursor ion

c) Information about the type of enzyme used for proteolytic digestion

d) Information on the instrument used to obtain the MS-MS data

The algorithm prefilters tha database for peptides that match the precursor m/z of the MS-MS spectrum being analyzed.  The output provides a tabular list of matching peptides and fragments that matched the ions recorded in the actual MS-MS spectrum.  MS-Tag is well-suited to the analysis of MALDI-TOF PSD spectra which contain immonium ions (low m/z fragments indicating the presence of individual amino acids).  SALSA:  An Algorithm for Searching Specific Features of Tandem MS    

When  Sequest is uded peptide MS-MS data are supposed to be available and the question which is raised is what proteins do these peptides come from ?  Sequest and other programs are well-suited to the task of protein identification from MS-MS data.  If we want to do something more than identify what proteins are present in a sample, the situation is different and different situations may be considered:

1.  The sample contains many proteins but identification may be restricted to the proteins bearing some specific modifications for example posttranslational modifications such as phosphorylation or a modification by a drug.

2.  Only peptides with  a same sequence identity must be identified in a mixture

3.  The sample is suspected to contain a particular protein which may be present in multiple modified forms which are to be detected.

In fact what is to be identified are those MS-MS spectra that display specific features of interest suchn as specific functional groups as phosphorylated amino acids or also MS-MS spectra may be selected displaying b-or y-ion series that indicate a particular amino acid sequence motif.   For this kind of information an algorithm called SALSA ( Searching Algorithm for Spectral Analysis)  may be employed.  The SALSA ALGORITHM

Four specific characteristics in the MS-MS spectra are detected by SALSA:

a) The first is a product ion at a specific m/z value.  This may result from a chemical modification which is lost and appears as a charged fragment in the MS-MS spectrum regardless of the m/z of the peptide from which the fragment was lost.

b) The second feature is a neutral loss in which a neutral fragment is lost from the precursor ion.  The difference between the mass of the precursor and the product ion detected will be equal to the mass of the lost neutral fragment. For example, this may occur for doubly charged ions.

c) Another specific characteristic is a charged loss in which a multiply charged precursor ion loses a charged fragment.  A doubly charged precursor may lose a single charge fragment.  The formation of singly charged b- and y-ions in the MS-MS spectra of doubly charged peptides is a typical example of charged losses.

d) The fourth specific characteristic is an ion-pair which is significant of two signals separated by a m/z value anywhere in the MS-MS spectrum.  The mass difference can indicate the presence of a specific amino acid in the peptide sequence and the y-ion series in a peptide containing cysteine would exhibit a pair of product ion signals separated by 103 m/z units due to the residue of cysteine.

In principle individual MS/MS spectra could be interpreted but the need to examine hundreds or thousands of MS-MS spectra from a single LC-MS run makes this impractical.  The SALSA algorithm serves the need to perform rapid computer-assisted sreening of many MS-MS spectra.  SALSA scores MS-MS scans based on the intensities of the ions that define the specific features and is able to rank the best hits.  If an intense product ion arising from a specified neutral loss (phosphoric acid from phosphoserine) is observed, a high score will be attributed.  A low-abundance product ion corresponding to the same neutral loss in another scan would give that scan a low score. 

A hierarchy of importance for different characteristics can be set in SALSA and some spectral peculiarities can be designated on the one hand as primary characteristics which are scored whenever they are detected and on the other hand as secondary characteristics.  These secondary features are linked to some primary characteristics and are only scored when the linked primary characteristic is detected.  Some peptides modified by carbohydrates undergo a neutral loss of water which could be considered as a specific feature.  Nevertheless a SALSA search for MS-MS scans displaying a neutral loss of 18 mass units from the precursor has a great chance to produce many hits because peptides containing serine or threonine residues may also lose water even if they do not contain the structural feature of interest.  The use of multiple-scoring criteria in a primary-secondary scoring hierarchy increases the ability of SALSA to detect selectively MS-MS scans derived from specific peptides and their derivatives.  Amino Acid Sequence-Motif Searching with SALSA

The presence of an ion pair separated by a specified distance on the m/z axix is one of the MS-MS features detected by SALSA.  Sources of ion pairs in the MS-MS spectra are b- and y-ions series.  For the peptide AVAGCAGAR a pair of ions are found at m/z 477 and m/z 374 corresponding to the Y5 and Y4 ions.  The mass difference is 103 m/z units indicating the presence of a cysteine residue.  The Y4 and Y3 ions are separated by 71 m/z units corresponding to a valine residue.  SALSA can detect the MS-MS scan with an ion pair separated by 103 m/z units and issued from AVAGAAGAR but it will likely also detect the MS-MS scans from other cysteine-containing peptides in the sample.  To be more selective we could focus on the gap between Y5 and Y3 (174 m/z units) corresponding to the cysteine and alanine residues and then MS-MS scans of peptides containing a CV or VC dipeptide would be picked out.  We must be aware that a single ion pair can never be a selective means of differentiating any of the MS-MS scans from all the rest.  If we have a datafile with several hundreds MS-MS scans how would it be possible to use ion searching for the doubly charged ion of AVAGCAGAR ?  It must be recalled that in most peptide MS-MS spectra, the most intense ions are those due to cleavages near the middle of the peptide.  Let us start with the tripeptide sequence GCA corresponding to four ions as shown. 


The highest mass is used as a reference, the next ion is 57 units lower corresponding to the glycine residue mass, the third ion is 103 mass units lower than the second and is the cysteine residue, the fourth ion is the alanine residue mass found 71 mass units lower than the third ion.  This kind of ion series would correspond to a y-ion series for a peptide containing a GCA motif.  The reported ion series acts as a scale with mass values which can be matched to each MS-MS spectrum in the datafile.  The "GCA" scale matches signals in the MS-MS spectrum of AVAGCAGAR.  The ions in the series must be linked together and a peptide that contains a glycine, cysteine, alanine but not in the same sequence will not match the GCA scale even if the peptide has the same amino acid composition.  The only problem with doing an ion-series search for a short-sequence motif is that other peptides containing the GCA motif may be also detected.  A search of an eight-ion series motif illustrated in red AVAGCAGAR is much more likely to identify selectively the MS-MS scan for the AVAGCAGAR peptide.  An important convention in SALSA is that the ions in a series are entered from the highest to the lowest m/z.  Consequently, a search with the VAGCAGA motif starts with a gap of 99 units between the first two ions (valine) followed by losses of m/z 71,57 and so on correspondind to alanine and glycine.  The described ions  correspond to the y-ions series for the AVAGCAGAR peptide as illustrated.





of the                              TOP


  ion of



In MS-MS spectra of tryptic peptides, the y-ion series is often more intense than the b-ion series although exceptions exist.   SALSA scores MS-MS spectra for ion series based on :

a)  The number of ions in the MS-MS spectrum that match the series

b) The intensities of the ions that are matched.  Intense signals that match most of all of the ion in the series will get the highest score.

The user may specify a minimum numbert of ions that must be found in the MS-MS spectrum for a match.