Tin Sinh Học – KEGG Pathway Database (part 9-2)

Tin Sinh Học – KEGG Pathway Database (part 9-2)

Hey! Welcome back! Let’s continue with our Week 9’s lectures on Ontology and Identification of Molecular Pathways. In the last unit, we learnt about the important concept of ontology, and we looked closely at the Gene Ontology which is a hierarchical, common, controlled vocabulary to describe the molecular functions, biological processes, and cellular components of genes and gene products. There is another way of organizing and representing biological knowledge, that is to organize the molecules into biological pathways. We’ll take the KEGG pathway database as an example. Let’s continute our Unit 2 of lectures. So what is a biological pathway? As you know, molecules don’t work in isolation in our body. In fact they work together in teams, just like people in a large factory that manufactures a product. Some people have a very specialized job. They take a half-finished product from the person on her left, add a part onto it, and pass it on to the next person on her right who adds another part. There are also product managers walking around controlling the pace of the production and making sure that no step is too fast and no step too slow. There are sales managers who monitor the demand for the product on the market and pass on this message to the supply manager. The supply manager then brings more or less raw material to the production manager, who then tunes up or down the speed of the production. A biological pathway is similar. It is a series of actions among molecules in a cell that leads to a certain product or a change in a cell. There are three main types of pathways: the metabolic pathways which are like a factory’s product assembly line, the gene regulation pathways like the production management, and the signal transduction pathways like the monitoring of the market and sales and the transmission of the information to the supply manager and product manager. Why do we need to have pathway databases? Experimental biologists spend lots of time and effort on discovering new components of pathways, new connections between the components, and even brand new pathways. However, this knowledge used to be scattered all over different papers in different formats, which made it hard to find. In the past few decades, some good-hearted bioinformatics scientists had taken the trouble to collect all the knowledge into databases with graphical interface, so that biologists can now easily learn about hundreds of pathways by simple pointing and clicking on their computer. In addition, as we will see later in this week’s lectures, pathway databases also enable computations and analyses that can discover important patterns above individual genes. So what pathway databases are out there? We listed here a few of the main pathway databases, including KEGG, BioCarta, BioCyc, PANTHER, PID, and Reactome. These are all great resources. In this unit we’ll take KEGG as an example to elaborate. KEGG stands for Kyoto Encyclopedia of Genes and Genomes. It organizes data in several overlapping ways, including pathways, diseases, drugs, compounds, and so on. We will focus on KEGG pathways here. As of 2013 there are 450 reference pathways in KEGG. KEGG PATHWAY are divided into seven categories, including metabolism, genetic information processing, environmental information processing, cellular processes, organismal systems, human diseases, and drug development. Some of these categories were relatively new and not yet comprehensive, such as human diseases. But categories like metabolism are extremely useful. Each category is organized in a hierarchical structure. For instance, carbohydrate metabolism is a kind of metabolism. And starch and sucrose metabolism is a kind of carbohydrate metabolism. KEGG is a collection of mostly manually drawn pathway maps like this Starch and Sucrose Metabolism pathway. The nodes marked with a rectangle are gene products, mostly proteins but sometimes RNAs. In this pathway most of the nodes are enzymes. Small circles denote other molecules, mostly chemical compounds such as the substrates. The pathway is also linked out to other upstream and downstream pathways. As you see there are many interactions between the nodes. Let’s look at the interactions more closely. There are several types of interactions in the pathways. The first type is protein-protein interactions, including phosphorylation and dephosphorylation, marked with a +p or –p on the arrow, ubiquitination, glycosylation, and methylation, marked with a +u, +g, or +m on the arrow. Activation and inhibition are marked with a standard arrow head or a T head. Other types of effect such as indirect effect, state change, binding/association and dissociation are marked with different arrows. A second type of interaction in the pathways is gene expression regulation,Finally protein complexes are shown as grids. including expression and repression either through a chemical compound or directly,or indirect regulatory effect. A third type of interaction is enzyme-enzyme relationships such as two successive reaction steps shown here. This is a pathway entry page that you see on the web. The pathway entry is stored in two formats. The first is a simple flat-file format very similar to what you see on the web. The second more informative format is the KEGG Markup Language, or KGML format. Here a pathway has properties such as its basic information, name, organism, number, etc. It defines a number of entries as well as reactions and relations. The substrate and product are defined as features of reactions. In order to generate the nice graphical representation of a pathway that we had looked at earlier, one feature of each entry is the graphics including the coordinates, logo shape, size, and color. The KGML format is stored in the computer like this.An example of an entry is shown here, following the format on the last slide. The KEGG pathways can be browsed in its hierarchical structure. You can also search for a pathway of interest.
I mentioned earlier that an entry has a feature called graphics. A user can input a list of genes that she wants to highlight by specifying the gene ID and the background and foreground color, and the gene will be highlighted in the pathway map. So you see, the underlying computer representation of the pathways enables a flexible and user-friendly interface. You can’t do this kind of searching and browsing easily if you store your data in free-text file without the well-defined data structure. Talking about structure, KEGG actually also defines something that is sort of similar to the Gene Ontology, called the KEGG Orthology, or KO, that describes gene functions in a hierarchical controlled vocabulary. KO has four flat levels. The top level is shown here which has a few very broad categories. If you click on “Metabolism”, you’ll see that it has a number of sub-categories such as carbohydrate metabolism, energy metabolism, If you click on “Carbohydrate Metabolism”, you’ll see a list of different types of carbohydrate metabolisms. If you click on “Starch and Sucrose Metabolism”, you’ll see the bottom level of KO, which is a list of gene products who have functional roles in Starch and Sucrose Metabolism. To end this unit, here are some summary questions for you to think about. We had talked at length about the representation of Gene Ontology and KEGG pathways in the computer. But how are individual genes associated with Gene Ontology terms and KEGG pathways? That is the topic of our next unit. I look forward to seeing you then.

Danny Hutson

1 thought on “Tin Sinh Học – KEGG Pathway Database (part 9-2)

Leave a Reply

Your email address will not be published. Required fields are marked *