Gene Maps


The text file 'gene2gene.idx' (not shown, confidential source) contains data showing the co-occurrence of gene pairs in a medical literature database (MEDLINE). That is, this file contains entries of the following format, for a set of 11833 genes:

GLB1	GLB1;13;ACP1;2;GLA;2;GLO1;2;GANAB;1;ACY1;1;CYP1A1;1;PPGB;1;NEU1;1;ADK;1;GUSB;1;PEPD;1;UMPH2;1;EBP;1;FUCA1;1;ELN;1;TF;1;IDH1;1;GOT1;1;GOT2;1;HEXA;1;MAN2C1;1;COL10A1;1;ACP2;1;PGK1;1;PGM2;1

The first column contains the symbol for a particular gene, and the second column contains a list of genes which appear in the same articles in the medical database, along with a count of the number of such articles containing each gene pair, sorted in order of decreasing frequency. This information can be used to construct a "gene map" of the closest neighbors to a specific gene, where such a map is embedded in "literature space". To construct this map, a matrix containing all the neighbors of a specific gene and the distances to each of those neighbors and between each of them (where known), is constructed from the data file. Here is a matrix for the gene GLB1, selected from the table at random:

26 points
GLB1, ACP1, GLA, GLO1, GANAB, ACY1, CYP1A1, PPGB, NEU1, ADK, GUSB, PEPD, UMPH2, EBP, FUCA1, ELN, TF, IDH1, GOT1, GOT2, HEXA, MAN2C1, COL10A1, ACP2, PGK1, PGM2
13, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
2, 196, 1, 47, -1, -1, -1, -1, -1, 1, -1, 1, 1, -1, -1, -1, 3, 3, 1, 5, -1, -1, -1, 12, 1, 14
2, 1, 65, 1, 1, 1, -1, -1, -1, 1, 2, 1, -1, -1, 1, -1, -1, -1, -1, 1, 1, 1, -1, -1, 5, 1
2, 47, 1, 163, -1, -1, -1, -1, -1, 1, -1, 1, 1, -1, -1, -1, 6, 1, 1, 2, -1, -1, -1, 1, 1, 14
1, -1, 1, -1, 1, -1, -1, -1, -1, -1, 1, -1, -1, -1, 1, -1, -1, -1, -1, -1, 1, 1, -1, -1, -1, -1
1, -1, 1, -1, -1, 26, -1, -1, -1, 1, 1, 1, -1, -1, -1, -1, 1, 1, 2, 1, -1, -1, -1, 3, -1, 2
1, -1, -1, -1, -1, -1, 1646, -1, -1, -1, -1, -1, -1, -1, -1, -1, 2, -1, -1, -1, -1, -1, 1, -1, -1, -1
1, -1, -1, -1, -1, -1, -1, 37, 11, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1
1, -1, -1, -1, -1, -1, -1, 11, 6090, -1, -1, -1, -1, -1, -1, 1, 67, -1, -1, -1, -1, -1, -1, -1, -1, 1
1, 1, 1, 1, -1, 1, -1, -1, -1, 22, 1, 2, -1, -1, -1, -1, -1, -1, 1, 2, -1, -1, -1, 1, 1, 2
1, -1, 2, -1, 1, 1, -1, -1, -1, 1, 39, 1, -1, -1, 1, -1, -1, 1, 1, 1, 2, 1, -1, 1, -1, 1
1, 1, 1, 1, -1, 1, -1, -1, -1, 2, 1, 16, 1, -1, -1, -1, 1, -1, 1, 2, -1, -1, -1, 1, 1, 2
1, 1, -1, 1, -1, -1, -1, -1, -1, -1, -1, 1, 9, -1, -1, -1, -1, 1, 1, -1, -1, -1, -1, 1, -1, -1
1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 138, -1, 15, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1
1, -1, 1, -1, 1, -1, -1, -1, -1, -1, 1, -1, -1, -1, 20, -1, -1, -1, -1, -1, 1, 1, -1, -1, -1, -1
1, -1, -1, -1, -1, -1, -1, -1, 1, -1, -1, -1, -1, 15, -1, 2402, 3, -1, -1, -1, -1, -1, -1, -1, -1, -1
1, 3, -1, 6, -1, 1, 2, -1, 67, -1, -1, 1, -1, -1, -1, 3, 10408, -1, -1, -1, -1, -1, -1, -1, -1, 6
1, 3, -1, 1, -1, 1, -1, -1, -1, -1, 1, -1, 1, -1, -1, -1, -1, 51, 1, -1, 1, -1, -1, 2, 1, 3
1, 1, -1, 1, -1, 2, -1, -1, -1, 1, 1, 1, 1, -1, -1, -1, -1, 1, 33, 2, -1, -1, -1, 3, -1, 3
1, 5, 1, 2, -1, 1, -1, -1, -1, 2, 1, 2, -1, -1, -1, -1, -1, -1, 2, 14, -1, -1, -1, 1, 1, 2
1, -1, 1, -1, 1, -1, -1, -1, -1, -1, 2, -1, -1, -1, 1, -1, -1, 1, -1, -1, 84, 1, -1, 1, 1, -1
1, -1, 1, -1, 1, -1, -1, -1, -1, -1, 1, -1, -1, -1, 1, -1, -1, -1, -1, -1, 1, 2, -1, -1, -1, -1
1, -1, -1, -1, -1, -1, 1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 37, -1, -1, -1
1, 12, -1, 1, -1, 3, -1, -1, -1, 1, 1, 1, 1, -1, -1, -1, -1, 2, 3, 1, 1, -1, -1, 30, -1, 2
1, 1, 5, 1, -1, -1, -1, -1, -1, 1, -1, 1, -1, -1, -1, -1, -1, 1, -1, 1, 1, -1, -1, -1, 599, 1
1, 14, 1, 14, -1, 2, -1, -1, 1, 2, 1, 2, -1, -1, -1, -1, 6, 3, 3, 2, -1, -1, -1, 2, 1, 117

Each element in the matrix contains a count of the number of medical articles which mention two of the genes in the list shown above: the element (i,j) counts the co-occurrence of the ith and jth genes in the list. Elements marked -1 refer to gene pairs which are not mentioned together in an article. Thus, the table is both symmetrical and sparse. A C program written to access the table and construct this matrix is very fast: it can launch, read and parse 11833 gene entries, and perform searches for all the neighbors of a specific gene in about 2 seconds on a 733 MHz Macintosh G4.

To convert the citation count matrix above into a distance matrix, the following conversion formula is used:

dist(i,j) = max(citations(x,y)) / citation(i,j), i != j, for all x,y

That is, each citation count is replaced by its inverse multiplied by the maximum number of off-diagonal counts (since we are interested only in counts between different genes). Distances along the diagonal (between a gene and itself) are set to 0. Thus, the resulting table contains distance entries which are between 0 and 1:

0, 0.5, 0.5, 0.5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
0.5, 0, 1, 0.0212766, -1, -1, -1, -1, -1, 1, -1, 1, 1, -1, -1, -1, 0.333333, 0.333333, 1, 0.2, -1, -1, -1, 0.0833333, 1, 0.0714286
0.5, 1, 0, 1, 1, 1, -1, -1, -1, 1, 0.5, 1, -1, -1, 1, -1, -1, -1, -1, 1, 1, 1, -1, -1, 0.2, 1
0.5, 0.0212766, 1, 0, -1, -1, -1, -1, -1, 1, -1, 1, 1, -1, -1, -1, 0.166667, 1, 1, 0.5, -1, -1, -1, 1, 1, 0.0714286
1, -1, 1, -1, 0, -1, -1, -1, -1, -1, 1, -1, -1, -1, 1, -1, -1, -1, -1, -1, 1, 1, -1, -1, -1, -1
1, -1, 1, -1, -1, 0, -1, -1, -1, 1, 1, 1, -1, -1, -1, -1, 1, 1, 0.5, 1, -1, -1, -1, 0.333333, -1, 0.5
1, -1, -1, -1, -1, -1, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0.5, -1, -1, -1, -1, -1, 1, -1, -1, -1
1, -1, -1, -1, -1, -1, -1, 0, 0.0909091, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1
1, -1, -1, -1, -1, -1, -1, 0.0909091, 0, -1, -1, -1, -1, -1, -1, 1, 0.0149254, -1, -1, -1, -1, -1, -1, -1, -1, 1
1, 1, 1, 1, -1, 1, -1, -1, -1, 0, 1, 0.5, -1, -1, -1, -1, -1, -1, 1, 0.5, -1, -1, -1, 1, 1, 0.5
1, -1, 0.5, -1, 1, 1, -1, -1, -1, 1, 0, 1, -1, -1, 1, -1, -1, 1, 1, 1, 0.5, 1, -1, 1, -1, 1
1, 1, 1, 1, -1, 1, -1, -1, -1, 0.5, 1, 0, 1, -1, -1, -1, 1, -1, 1, 0.5, -1, -1, -1, 1, 1, 0.5
1, 1, -1, 1, -1, -1, -1, -1, -1, -1, -1, 1, 0, -1, -1, -1, -1, 1, 1, -1, -1, -1, -1, 1, -1, -1
1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, -1, 0.0666667, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1
1, -1, 1, -1, 1, -1, -1, -1, -1, -1, 1, -1, -1, -1, 0, -1, -1, -1, -1, -1, 1, 1, -1, -1, -1, -1
1, -1, -1, -1, -1, -1, -1, -1, 1, -1, -1, -1, -1, 0.0666667, -1, 0, 0.333333, -1, -1, -1, -1, -1, -1, -1, -1, -1
1, 0.333333, -1, 0.166667, -1, 1, 0.5, -1, 0.0149254, -1, -1, 1, -1, -1, -1, 0.333333, 0, -1, -1, -1, -1, -1, -1, -1, -1, 0.166667
1, 0.333333, -1, 1, -1, 1, -1, -1, -1, -1, 1, -1, 1, -1, -1, -1, -1, 0, 1, -1, 1, -1, -1, 0.5, 1, 0.333333
1, 1, -1, 1, -1, 0.5, -1, -1, -1, 1, 1, 1, 1, -1, -1, -1, -1, 1, 0, 0.5, -1, -1, -1, 0.333333, -1, 0.333333
1, 0.2, 1, 0.5, -1, 1, -1, -1, -1, 0.5, 1, 0.5, -1, -1, -1, -1, -1, -1, 0.5, 0, -1, -1, -1, 1, 1, 0.5
1, -1, 1, -1, 1, -1, -1, -1, -1, -1, 0.5, -1, -1, -1, 1, -1, -1, 1, -1, -1, 0, 1, -1, 1, 1, -1
1, -1, 1, -1, 1, -1, -1, -1, -1, -1, 1, -1, -1, -1, 1, -1, -1, -1, -1, -1, 1, 0, -1, -1, -1, -1
1, -1, -1, -1, -1, -1, 1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, -1, -1, -1
1, 0.0833333, -1, 1, -1, 0.333333, -1, -1, -1, 1, 1, 1, 1, -1, -1, -1, -1, 0.5, 0.333333, 1, 1, -1, -1, 0, -1, 0.5
1, 1, 0.2, 1, -1, -1, -1, -1, -1, 1, -1, 1, -1, -1, -1, -1, -1, 1, -1, 1, 1, -1, -1, -1, 0, 1
1, 0.0714286, 1, 0.0714286, -1, 0.5, -1, -1, 1, 0.5, 1, 0.5, -1, -1, -1, -1, 0.166667, 0.333333, 0.333333, 0.5, -1, -1, -1, 0.5, 1, 0

Note that while the distance matrix is computed from citation information contained in the literature, it could also be derived from taxonomic distance data, gene or protein sequence alignments (using BLOSSUM or PAM substitution values), or any other reasonable definition of "distance".

To construct a gene map in 2d, an iterative Monte Carlo approach is used by an X-Windows C program. First, each gene in the list is assigned a random point (x(i),y(i)) in 2d "literature space". Then, iterations are performed in which each gene is moved to a slightly different location in a random fashion, if the new location decreases the value of the following "error function" for the entire network of genes:

error = sum((sqrt((x(i)-x(j))^2 + (y(i)-y(j))^2) - dist(i,j))^2) / n, n for all dist(i,j) > 0

This error function computes the average squared difference between the current 2d distance between two genes in the graph, and the actual distance between the genes as specified by the matrix constructed from the table above. This form of error function is a typical one used in dynamic optimization problems. As the value of the error function falls, the positions of the genes in the 2d graph more closely approximate the actual distances between them as given by the table. When the error function reaches its minimum value, the resulting 2d graph is "optimal". Note that the optimal graph might not have an error equal to 0, since it may not be possible to correctly position all genes in 2d Euclidean space. Also, there may be more than one optimal solution, each placing some genes at different locations. Thus, the optimization process really is an approximation.

A final 2d configuration for the GLB1 gene and its 24 nearest neighbors is shown here:

The graph optimization process can also be performed in 3 dimensions, using an error function which is a direct extension of the one listed above. Performing the optimization in 3d gives each gene more "room" to move around, such that the final positions of each gene may significantly reduce the overall error. Here is a 3d gene map for GLB1. The plot can be rotated and zoomed to see it from any (altitude,azimuth) angles:

The graph optimization process can be performed in any number of dimensions (e.g. >3). The resulting errors will be correspondingly less, and the final gene positions more accurate. However, it is difficult to draw a plot having more than 3 dimensions, since some 3d subset of 4 or more dimensions must then be used. For example, there are 4 ways to choose 3 out of 4 dimensions to draw, and 2 different chiralities (orientations) for each of those. Nevertheless, since it is not known what the actual "information dimensionality" of the genome might be, it may be useful to explore higher dimensional gene map optimizations in further research.

Here is another 3d gene map, showing the nearest neighbors of GANAB:

This second gene was chosen because it shares some of the same neighbors as GLB1. Also shown in this plot are the final optimized distances between genes, and the actual distances from the table in parens. Here is a 3d map from a third gene, HEXA, which also shares some of the same neighbors as GLB1 and GANAB:

If several genes share some of the same neighbors, it should be possible to use these common genes as a set of reference guides to align two or more gene maps, and thus to extend the overall "gene atlas". By creating optimized 3d maps of each of the 11833 genes in the 'gene2gene.idx' table, it should be possible to create a single "star map" which contains the relative locations of each gene in the same coordinate system. This altas could then be used, for example, to see which genes cluster closely together, and to predict similarities in gene function which have not yet been discovered or reported in the medical literature.

To explore this possibility, another X-Windows C program was written to perform partially automated alignment of two gene maps which contain at least two genes in common. First, the final 3d coordinates of two optimized maps are written to text files, and then these coordinates are read by the alignment program. This program then performs a superimposition of both sets of genes based on their coordinates. The two central genes of each map are automatically aligned onto each other, such that the line segment between them is also aligned along the Z axis of the plot. This alignment is performed in the following steps:

  1. Both sets of gene coordinates are translated (moved) so that the first common gene is at the origin: (0,0,0).
  2. Each set of points is then rotated about the Z axis so that the line segment between the two common genes lies in the XZ plane.
  3. Each set of points is then rotated about the Y axis so that the line segment between the two common genes lies directly along the Z axis (is vertical).
  4. The second set of points is radially scaled (multilpied by) the ratio of the lengths of the common gene segments, so that the two common genes in both maps are aligned on top of each other, and all distances between genes are at the same scale.
Then, the combined plot is manually aligned by rotating the second set of points about the Z axis until one or more additional common points are visually aligned. For example, here is an initial automatic alignment of the GLB1 and GANAB maps, shown from above (looking down the Z axis). The GLB1 and GANAB genes are superimposed in both maps:

Here is the alignment after the GANAB map has been rotated so that the gene GUSB is also nearly superimposed in both maps:

Note that in this alignment (now viewed at alt=30,az=10), the genes GLA and HEXA are also aligned:

Also, although the two additional genes FUCA1 and MAN2C1 are nearly aligned, they appear to be interchaged in the two corresponding maps. This is because there are two optimal solutions to the final GANAB map, in which the FUCA1 and MAN2C1 genes are interchanged:

Here is the result of aligning the GANAB and HEXA 3d maps about GANAB, HEXA, GUSB and either FUCA1/FUCA1 or MAN2C1/FUCA1:



©Copyright Sky Coyote, 2002-5