GPMLChangeProposal
From PathVisio Wiki
Proposed Changes to GPML
Feel free to discuss, or add more proposals. I would like to wait with the next GPML update until the next major release of PathVisio.
Changes for M7
License attribute
A new "license" column was added to the INFO table in MAPP format recently. GPML should store this info as well, the easiest way to do this is with a new attribute to the Pathway element.
Implementation:
Add attribute "License" to element "Pathway".
Arbitrary attribute-value pairs
It would be good to store arbitrary metadata, similar to cytoscape. ... When we do this we might also be able to replace some of the infrequently used attributes with attribute-value pairs.
Implementation:
Add a new second level element "Attribute":
- Attribute
- Key: The name of the attribute
- Value: The value of the attribute
The Attribute element can be a subelement of:
- Pathway
- DataNode
- Line
- Shape
- Group
- Link
Z Order
Add a "ZOrder" attribute to each Shape, DataNode and Line element. This attribute stores the drawing order of the object. Changing the drawing order is already supported by the Java code, but is lost when saving to GPML.
Implementation:
Add attribute "ZOrder" to elements "Shape", "DataNode", "Label" and "Line"
State variables
A protein or gene can be in different states (e.g. phosphorylation, glycolysation, active/inactive, open/closed). A state variable can be represented as a little shape on the border of the DataNode that indicates the state of the DataNode instance. GPML should be able to support a state attribute for DataNode elements to enable representation of post translational modifications and for SBGN compatibility (see section 2.3.11 of the specification, note that I adopted the name "state variables" from SBGN).
Implementation:
Add a new second level element "StateVariable":
- StateVariable
- subelement Graphics
- Contains all style/color attributes from Shape->Graphics
- RelX, RelY: the placement of the shape, relative to the parent DataNode
- Shape: shape to display, same as Shape->Type
- TextLabel: text label to display
- subelement Graphics
- subelement #Comment
The "StateVariable" element can be a subelement of "DataNode":
<DataNode ...>
<Graphics .../>
<StateVariable TextLabel="P@132" Shape="Ellipse">
<Graphics .../>
</StateVariable>
</DataNode>
TODO: Should StateVariable have a 'type' attribute that specifies the type (phosphorylation, activation state, conformation state, etc.)? Or should this be derived from the shape type or text label?
Pending changes
Reduce Label and DataNode to Shape
The Label element can be reduced to Shape and the DataNode element can be reduced to a subclass of Shape. Besides that the xml-schema would become simpler and better to understand, this would enable:
- Adding text to shapes, or changing label outline to any shape
- Customization of DataNode appearance (e.g. font, fill color or shape)
Implementation:
- Add the attributes from "Label" to "Shape":
- TextLabel
- FontName
- FontStyle
- FontDecoration
- FontStrikethru
- FontWeight
- FontSize
- Alternatively, the current "Label" could become a subelement of "Shape", to group the text related attributes.
- Remove the old "Label" element
- Make "DataNode" extend "Shape" (is this possible in xml-schema?), to remove the redundant graphics and style attribute definitions.
Remove some deprecated stuff
There are some deprecated attributes that we can probably remove fairly easily as soon as we find that no pathways in wikipathways make use of them anymore.
TODO: Make a list of deprecated attributes.
Improved definition of links
We added facilities for describing links between pathways in the past, but they are currently unused.
The element to describe a link is named "Link". Links are very similar to Labels, the only difference is that links have an optional href attribute. We haven't put a lot of thought in the href attribute yet. Is this an url, a filename, a pathway title or???
I propose to rename href to PathwayRef and make use of the stable pathway ID's. It's better to use PathwayRef than Href to prevent confusion with the href in html. Href is currently unused both in PathVisio and GenMAPP so we should not really inconvenience anybody is we remove it. At the same time I propose to remove the separate Link attribute. Labels will automatically by links if they have a PathwayRef attribute. This will simplify the implementation.
Support for reference pathways
Pathways derived from reference pathways, should also record this in GPML. I would suggest to store both data in the same way, by adding a 'derived-from' child element to pathway. This could be used to store external sources: <derived-from source="Reactome" source-pathway="Reactome_pathway_id" date="19830402"/> and internally: <derived-from source="WikiPathways" source-pathway="WP_pathway_id" date="19830402" />
Categories
...
Stable id's
...
Finished Changes to GPML
Archive of past ideas after they have been implemented.
1. GroupID
We'd like to store grouping information.
Proposal:
- each object gets an optional GroupID attribute (independant to GraphID)
- each object will also get a GroupStyle attribute with three possible values: stack, box, complex. Stack is default.
2. WindowWidth and WindowHeight
- windowwidth and windowheight are not used in PathVisio, but they are still used by GenMAPP.
Proposal: Keep the attributes around for backwards compatibility, but make them optional, with good defaults.
3. MapInfoLeft / MapInfoTop
Currentyl mapinfoleft / mapinfotop <-> infobox.centerx, infobox.centery both store the coordinates of the "infobox". This is redundant, probably caused by a past misunderstanding on the meaning of the infobox.
Proposal: mapinfoleft and mapinfotop should be removed.
4. Availability, Maintained-By
The "copyright" column currently translates to the "availability" attribute. I think copyright is actually a more descriptive name. The pathway maintainer is in the awkwardly named attribute "Maintained-By".
Proposal: rename the attribute availability back to copyright. rename the attribute maintained-by to maintainer.
5. linetype for both start and end
There was a request made for double-ended arrows. Line currently has the Type attribute that only allows for single-ended arrows.
Proposal: move the Type attribute to the point subelement, so we can define it for each point.
example before:
<Line Type="Arrow" Style="Solid">
<Graphics Color="0000ff">
<Point x="8100.0" y="10450.0" />
<Point x="8250.0" y="10350.0" />
</Graphics>
</Line>
example after:
<Line Style="Solid">
<Graphics Color="0000ff">
<Point x="8100.0" y="10450.0" Type="Arrow" />
<Point x="8250.0" y="10350.0" Type="Arrow" />
</Graphics>
</Line>
6. rotation and orientation
To simplify both the GPML and programming, I merged rectangles, ellipses and ovals together by putting them together in a "shape" class. I couldn't do this for braces, because they don't define "rotation" and "height", but instead use the very similar attributes "orientation" and "picpointoffset". Orientation restricts braces to straight angles (0, 90, 180, 270). Internally in the program, braces are rotated exactly in the same manner as the other shapes and there is no need to restrict it to these straight angles. On the other hand, people might prefer to only use straight angles in their pathways but this doesn't only go for braces but for rectangles and ovals as well. About "picpointoffset": it really serves exactly the same function as "height" for other shapes, so we might prefer to be consistent and rename "picpointoffset" to "height".
proposal: Make brace a subtype of shape. Give all shapes the ability to specify either free rotation or orientation, where orientation="top" is equivalent to rotation="0.0" etc.
example before:
<Shape Type="Rectangle"> <Graphics Color="000000" Rotation="0.0" CenterX="15300.0" CenterY="8100.0" Width="7001.354" Height="1699.317" /> </Shape> <Brace> <Graphics Color="000000" Orientation="top" CenterX="2900.0" CenterY="2400.0" Width="4253.105" PicPointOffset="130.0" /> </Brace>
example after:
<Shape Type="Rectangle"> <Graphics Color="000000" Orientation="top" CenterX="15300.0" CenterY="8100.0" Width="7001.354" Height="1699.317" /> </Shape> <Shape Type="Brace"> <Graphics Color="000000" Orientation="top" CenterX="2900.0" CenterY="2400.0" Width="4253.105" Height="130.0" /> </Shape>
7. fixedshape and complexshape
- In the current GPML, ComplexShape is used for hexagon, pentagon, triangle, vesicle and proteincomplex. FixedShape is used for OrganA, OrganB, OrganC, Ribosome and CellA. The only difference between the two is that a FixedShape doesn't have a height and width attribute, whereas ComplexShape has only a width attribute. The size of a FixedShape can't be changed at all, ComplexShape can be resized but the width and height are always in the same proportion (the height is not specified because it can be calculated from the width). These restrictions are present in GenMAPP as well. Is it important to maintain?
proposal: lift the restriction on size for complexshape and fixedshape. Both will get height and width attributes, both can be rescaled just like any other shape.
8. Node / Edge
We'd like to make the distinction between objects that are nodes, edges and merely annotations explicit.
Proposal 1:
Each Graphics element gets an ObjectType attribute of element Graphics; sibling of ColorType with 3 possible values: node, edge, annotation (enumeration values)
Proposal 2:
A node is really any object that can have data mapped to it. Therefore all GeneProducts are automatically nodes too. The only issue is with labels: they can be sometimes merely annotation, sometimes they represent a metabolite or a concept that can have data mapped to it. So what we really want is to distinguish those two label types. Another problem is that Metabolites are not GeneProducts. We could generalize the case for "any object that can have data mapped to it"
| was: | becomes |
|---|---|
| GeneProduct with type "rna", "protein" or "unknown" | DataNode with type "rna", "protein", "unknown" or "geneProduct" (=both rna or protein) |
| Label for a metabolite or concept like "apoptosis" | DataNode with type "smallMolecule" or "concept" |
| Label for annotation | remains Label |
In this model, all dataNodes are nodes, all lines that link two dataNodes are edges, and everything else is annotation. Also lines that are attached at only one end to a dataNode are annotation.
Proposal 3:
Geneproducts are renamed to DataNode. DataNode can have five different types: rna, protein, unknown, geneproduct or metabolite.
All top-level objects except DataNode (Rect, Oval, Line, GeneProduct, Label) get an attribute objectType with 3 possible values: node, edge, annotation. DataNodes are always of type node of course. Annotation is default for non DataNode objects.
In the future, Labels that are really metabolites should be converted to DataNodes with type "metabolite". Labels that function as nodes but don't get data mapped to them will be Labels with objectType "node". All other labels will have objectType "annotation". In similar fashion, lines can be "edge" or merely "annotation".
9. Notes versus Comments
notes and comments are redundant. Comments are currently used more (508k vs 203k in all GPML files together)
Proposal 1: Remove notes altogether. Upon conversion from mapp to gpml, notes and comments (or remarks) are merged together.
Example before:
<notes> See Abraham et al., 2007 </notes> <comment> The interaction between SMURF1 and CHEAPDATE is still disputed. </comment>
Example after proposal 1:
<comment> See Abraham et al., 2007 The interaction between SMURF1 and CHEAPDATE is still disputed. </comment>
Proposal 2: Allow multiple comments, and add a "source" attribute, to get more information about how and when the comment was made. This allows for several people adding their own comments, and adding comments in an automated fashion without messing up previous comments (for example a text-mining script could add a pubmed reference to a certain geneproduct). This 2nd proposal is of course more complicated to program than the 1st. Another disadvantage is that this might invoke people to use comments too much, to store all kinds of biologically relevant information that we might prefer to have in biopax format.
Example after proposal 2:
<comment source="GenMAPP notes"> See Abraham et al., 2007 </comment> <comment source="GenMAPP comment"> The interaction between SMURF1 and CHEAPDATE is still disputed. </comment> <comment source="Andra's textminer"> pmid=123466, 123223, 12355 </comment>
10. mixup between geneID and name
This is really a problem within PathVisio that can be mostly solved without changing the GPML.xsd. The problem in PathVisio is that the geneId attribute is used for the label, and the name attribute is used as the GeneID. This table explains this completely:
| GPML attribute | Pathvisio variable | Pathvisio property type | property panel description | used for... |
|---|---|---|---|---|
| geneID | GmmlDataObject.geneID | PropertyType.GENEID | "Label" | text to display in geneproduct box |
| Xref | GmmlDataObject.Xref | PropertyType.XREF | "Xref" | unused, currently disabled |
| Name | GmmlDataObject.geneProductName | PropertyType.NAME | Database Identifier | Database ID. |
| GeneProduct-Data-Source | GmmlDataObject.dataSource | PropertyType.GENEPRODUCT_DATA_SOURCE | Database name | System code |
However, while thinking about this it occurred to me that this is also a stylistic issue. The problem is that currently GeneProduct has a very long list of attributes, some of which have very similar meanings. By reorganizing the attributes as outlined below, I think we can make it more intuitive.
Proposal:
- use attribute "TextLabel" i.o "Name" for the text to display, identical to label.
- The current xref attribute is not used much for anything, in GenMAPP nor in PathVisio. Keep it for backwards compatibility but mark as deprecated and hide it in PathVisio.
- To clearly link GeneID and SystemCode, create an element xref with attributes database and id. There would be one and only one Xref for each GeneProduct.
Before:
<GeneProduct GeneID="TNFSF10" Xref="" Type="unknown" Name="8743" BackpageHead="TRAIL" GeneProduct-Data-Source="LocusLink"> <Graphics Color="000000" CenterX="2550.0" CenterY="1050.0" Width="1000.0" Height="300.0" /> </GeneProduct>
after:
<GeneProduct textLabel="TNFSF10" Type="unknown" BackpageHead="TRAIL" > <Xref ID="8743" Database="LocusLink" /> <Graphics Color="000000" CenterX="2550.0" CenterY="1050.0" Width="1000.0" Height="300.0" /> </GeneProduct>
incidentally, this proposed style is also more similar to how BioPAX does it.
11. Group Elements
There might be situations where we want to include a group into a group, which is not possible with the current GroupId. E.g. when we have a group of datanodes (in order to organize them dynamically) and a circle that we want to group with these datanodes, just because we want them to move together. We want one group with style "stack" that contains the datanodes and a group without style that contains the stack group and the circle.
A solution for including a group in another group, is to have a Group element. GroupIds now link to the Group element's name and the group itself can have a GroupId too.
Here's an example:
<Pathway ...>
<DataNode TextLabel="Gene1" ... GroupId="g1">
<Graphics ... />
<Xref ... />
</DataNode>
<DataNode TextLabel="Gene2" ... GroupId="g1">
<Graphics .../>
<Xref .../>
</DataNode>
<DataNode TextLabel="Gene3" ... GroupId="g1">
<Graphics ... />
<Xref ... />
</DataNode>
<Shape Type="Oval" ... GroupId="g2">
<Graphics ... />
</Shape>
<Group name="g1" groupId="g2" style="stacked"/>
<Group name="g2"/>
</Pathway>
I think having a group element looks more intuitive instead of having two seperate attributes (id and style) to every GPML element (where style should be identical for all elements in a group and is therefore redundant). It also has more perspective too I think, in case we want to expand the possibilities of a group (e.g. linking lines to groups, or even to a specific anchor point of a group). It also makes programming more convenient, since now the group is a real entity and you don't have to deduce the group structure from you data (which is a disadvantage I experienced by using the GraphIds for making multi-segment lines out of seperate two-point lines).
It would also allow Group properties to be stored at the group-level, such as style and center.
12. Literature References
Let's take a look at the BioPAX definition:
Publication Xref
Definition: An xref that defines a reference to a publication such as a book, journal article, web page, or software manual. The reference may or may not be in a database, although references to PubMed are preferred when possible. The publication should make a direct reference to the instance it is attached to. Comment: Publication xrefs should make use of PubMed IDs wherever possible. The DB property of an xref to an entry in PubMed should use the string “PubMed” and not “MEDLINE”. Examples: PubMed:10234245 Properties: The following properties may be used when the DB and ID fields cannot be used, such as when referencing a publication that is not in PubMed. The URL property should not be used to reference publications that can be uniquely referenced using a DB, ID pair. One reason for this is that it is expected that DB, ID pairs are more stable than URLs.
- AUTHORS - The authors of this publication, one per property value.
- SOURCE - The source in which the reference was published, such as: a book title, or a journal
title and volume and pages.
- TITLE - The title of the publication.
- URL - The URL at which the publication can be found, if it is available through the Web.
- YEAR - The year in which this publication was published.
An example:
<bp:publicationXref rdf:ID="Pubmed_16262255">
<bp:ID rdf:datatype="http://www.w3.org/2001/XMLSchema#string">16262255</bp:ID>
<bp:DB rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Pubmed</bp:DB>
<bp:YEAR rdf:datatype="http://www.w3.org/2001/XMLSchema#int">2005</bp:YEAR>
<bp:TITLE rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Ubiquitination of p21Cip1/
WAF1 by SCFSkp2: substrate requirement and ubiquitination site selection</bp:TITLE>
<bp:AUTHORS rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Wang, W</bp:AUTHORS>
<bp:AUTHORS rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Nacusi, L</bp:AUTHORS>
<bp:AUTHORS rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Sheaff, RJ</bp:AUTHORS>
<bp:AUTHORS rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Liu, X</bp:AUTHORS>
<bp:SOURCE rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Biochemistry 44:14553-64</bp:SOURCE>
</bp:publicationXref>
We could translate this to XML by adding a 'publicationXref' element:
<publicationXref id="16262255" database="PubMed" title="..." year="..." etc> <author>Wang, W</author> <author>Nacusi, L</author> </publicationXref>
Then we have two options to link it to a GPML element:
- As an optional nested element (similar to Graphics)
- Advantages:
- similar to BioPAX
- less redundant (multiple elements can make use of the same reference element)
- Disadvantages:
- need for an extra identifier that needs to be unique within the document
- Advantages:
- As a reference (similar to Group)
- Advantages:
- Element and reference are tightly coupled (no need for identifiers)
- May be more intuitive to read
- Disadvantages:
- Can be redundant when many elements have the same reference
- Advantages:
13. Hyperlinks between pathways
Refer to other pathways with a html-style link. This tag would be exactly the same as TextLabel, with a couple of differences:
- the tag is named Link instead of Label - the defaults for the text style are blue and underlined - there is an extra "href" attribute. The href can either be a complete url, or the name of another pathway. It should be possible to refer to any local pathway by name only, not by specifying the complete path.
