Thodoros Topaloglou
Gene Logic Inc.
This manuscript discusses
modeling issues for Human samples in a gene expression reference database. This
manuscript presents one out of a series of four case studies planned to be
carried out by the Ontologies working group at the November 17th
MGED meeting.
DISCLAIMER: The modeling of human samples as presented in this manuscript is different than that of GeneExpress™ by Gene Logic. GeneExpress™ supports a very thorough representation of samples which includes sample specific parameters, pathology review, clinical profiles of donors including medical tests, therapeutic area specific assays, other genomic assays, sample processing measurements, etc. This results in a representation that far exceeds in scope and complexity the MIAME goals.
The goals of this case study
are the following:
1. Model human samples following the MIAME guidelines and
identify the sample qualifiers (attributes) that are candidates for
“ontology support”.
2. Provide descriptions for sample attributes in MIAME
and identify value and type constraints specific to human species.
3. Use a realistic example in order to identify
shortfalls of the current MIAME guideline.
Perhaps the most distinctive feature of Human samples is their association to Human Donor information, which under certain circumstances, can have a contributing effect to gene expression. Another important feature of human tissues is the ability to assess the pathological processes in the sample by conventional pathology screening (this is also the case for tissues from other organisms).
The following example
illustrates a representation structure for human samples using the OPM
notation. Note, I chose OPM as the modeling language, because of its strengths
in dealing with semantic aspects of a representation. I use extensively the
DESCRIPTION field (facet) to state the meaning of each attribute as well as
express informal integrity constraints.
CLASS Human_Sample
ATTRIBUTE type :
CV_SampleType REQUIRED
DESCRIPTION:
“The type of a human sample can be one of: ‘tissue’,
‘cells’, ‘RNA’,
‘other’”
ATTRIBUTE source
(type, name, ID): set-of (
CV_SourceType,
REQUIRED, STRING REQUIRED, STRING OPTIONAL) OPTIONAL
DESCRIPTION:
“The source of a sample is modeled as a multivalued tuple attribute with
components source
type,
name and ID. The type can be one of ‘Sample’, ‘Cell
line’, ‘Biorepository’, ‘Institution’,
‘Vendor’,
and ‘Other’. Name and ID, if available, represent the source
specific name and ID
of
the sample.”
ATTRIBUTE
gender: CV_Gender REQUIRED
DESCRIPTION:
“The gender of the individual that the sample is taken from”
ATTRIBUTE
treatment (type, detail): (CV_TreatmentType REQUIRED, Treatment OPTIONAL)
OPTIONAL
DESCRIPTION:
“Treatment is a tuple that consists of a type and a detail description.
Although treatment
Specification
is optional, once specified, the type at least must be filled.”
ATTRIBUTE
anatomy (vocabulary, term, code, description) :(
STRING
REQUIRED,
STRING
REQUIRED,
STRING
OPTIONAL,
STRING
OPTIONAL) REQUIRED
DESCRIPTION:
“The anatomic site from
which the sample is extracted is described by a term and code in
some
vocabulary, and a longer string description. The vocabulary can be either
‘informal’
or a formal one like ‘SNOMED’. If ‘informal’ is
specified as vocabulary, then term
can
be any value. If a formal vocabulary is used, them the term and code values
must be
valid
term and code values in the vocabulary . ”
PROPERTY
MGED_CURATION:
“Ensure existence of anatomy.term and anatomy.code in anatomy.vocabulary;
Exempt
if anatomy.vocabulary = ‘informal’”.
ATTRIBUTE
pathology (vocabulary, term, code, category, description): (
STRING
REQUIRED
STRING
OPTIONAL
STRING
OPTIONAL
CV_PathologyCategory
REQUIRED
STRING
OPTIONAL) REQUIRED
DESCRIPTION:
“The pathological processes in the sample are described by a term and
code in some
vocabulary,
a longer string description and a category term. The vocabulary can be either
‘informal’
or a formal one like ‘SNOMED’. If ‘informal’ is
specified as vocabulary, then term
can
be any value. If a formal vocabulary is used, them the term and code values
must be
valid
term and code values in the vocabulary. The category is a placeholder for a
broad
pathology
term such as ‘normal’, ‘tumor’, etc.”
ATTRIBUTE
diseases : set-of (vocabulary, term, code, stage, description, status) : (
STRING
REQUIRED,
STRING
REQUIRED,
STRING
OPTIONAL,
STRING
OPTIONAL,
STRING
OPTIONAL,
STRING
OPTIONAL) OPTIONAL
DESCRIPTION: “A sample may be associated with
one or more diseases via the donor, i.e., a sample may
be
a normal kidney of a diabetic donor who also has high blood pressure. A disease
is
described
by a term and code in some vocabulary, a longer string description, a stage and
a
status. The vocabulary can be either ‘informal’ or a formal one
like ‘SNOMED’. If ‘informal’
is
specified as vocabulary, then term can be any value. If a formal vocabulary is
used, then
the
term and code values must be valid term and code values in the vocabulary. The
stage
component
can be used to describe the stage of the disease, if applicable, e.g.,
‘early’,
‘late’.
The status can be used to specify information such as
‘contributing’, ‘not-contributing’,
‘past
condition’, etc.”
ATTRIBUTE
qualifier_MIAME (name, value, vocabulary) : set-of (
CV_MIAME_Qualifier
REQUIRED,
STRING
REQUIRED,
STRING
OPTIONAL) OPTIONAL
DESCRIPTION:
“MIAME allows users to specify property (qualifier) value pairs. The
available qualifier
names
are specified in the MIAME qualifiers CV. “
ATTRIBUTE
qualifier_other (name, value, vocabulary) : set-of (
STRING
REQUIRED,
STRING
REQUIRED,
STRING
OPTIONAL) OPTIONAL
DESCRIPTION:
“If the MIAME predefined qualifiers are not sufficient, the user can make
use of an open-
ended list of qualifier-value pairs.”
ATTRIBUTE
file (name, description): set-of (STRING REQUIRED, STRING OPTIONAL) OPTIONAL
DESCRIPTION:
“A submitted to the public database sample may be associated with one or
more files. The
files
may hold a histopathology report, tissue slide, clinical record, research
paper, ect.”
CONTROLLED VALUE
CLASS CV_SourceType
{ ( "Sampe", S, "The sample is
derived by another sample" ),
(
"Cell Line", C, "The sample is derived by a cell line" ),
(
"Biorepository", U, "The origin of the sample is a
biorepository" )
(“Institution”,
I, “The origin of the sample is a public institution such as a
hospital”)
(“Vendor”,
V, “The origin of the sample is a commercial vendor”)
(“Other”,
O, “None of the above”)
}
DEFAULT:
"Other”
CODE_TYPE:
CHAR(1)
DESCRIPTION:
"Types of sample sources"
CONTROLLED VALUE
CLASS CV_SourceType
{ ( "Sampe", S, "The sample is
derived by another sample" ),
(
"Cell Line", C, "The sample is derived by a cell line" ),
(
"Biorepository", U, "The origin of the sample is a
biorepository" )
(“Institution”,
I, “The origin of the sample is a public institution such as a
hospital”)
(“Vendor”,
V, “The origin of the sample is a commercial vendor”)
(“Other”,
O, “None of the above”)
}
DEFAULT:
"Other”
CODE_TYPE:
CHAR(1)
DESCRIPTION:
"Types of sample sources"
CONTROLLED VALUE
CLASS CV_Gender
{ ( "Female", F, "Female"
),
(
"Male", M, "Male" ),
(
"Unknown", U, "Unknown" )
}
DEFAULT:
"Unknown”
CODE_TYPE:
CHAR(1)
DESCRIPTION:
"Gender"
CONTROLLED VALUE
CLASS CV_TreatmentType
{ ( "Compound", T, "The
sample/donor has been treated by some treatment agent at a given dose and
time" ),
(“Diet”,
D, “The sample/donor has been subjected to a diet”),
(
"Surgury", S, "The sample has been surgically modified" ),
(
"Genetic", G, " The sample has been genetically modified "
)
(“Other”,
O, “None of the above”)
}
DEFAULT:
"Other”
CODE_TYPE:
CHAR(1)
DESCRIPTION:
"Types of sample Treatments"
CONTROLLED VALUE
CLASS CV_PathologyCategory
{ ( "Normal", N, "Normal"
),
(
"Tumor", T, "The sample is tumor; can be further detailed in
malignant or benign" ),
(
"Diseased", D, " The sample has been genetically modified "
)
(“Not
Reported”, R, “Not Reported”)
}
DEFAULT:
"Not Reported”
CODE_TYPE:
CHAR(1)
DESCRIPTION:
"The pathological condition of the samples in very general terms."
CONTROLLED VALUE
CLASS CV_MIAME_Qualifier
{ ( "Sex", 0, "Gender of the
human donor" ),
(
"Age", 1, "Age of the donor when the sample was taken”),
(
"Development Stage", 2, " The developmental stage of the sample,
if sample is prenatal " )
(“Tissue”,
3, “Tissue type, same as anatomy”)
(“Cell
Type” 4, “Type of cells, if sample is a cell culture”)
(“In
Vivo Treatment”, 5, “Same as treatment, applies to human
tissues”)
(“In
Vitro Treatment” 6, “Same as treatment, applies to human cell
lines”)
(‘Handling”,
7, “Description of the sample handling e.g., time to freeze, condition of
packaging, etc”)
(“Genotype”,
8, “Genetic characteristics such as disease alleles, polymorphisms,
etc”)
}
CODE_TYPE:
NUMBER
DESCRIPTION:
"Names of MIAME approved qualifiers (sort of), applicable to humans"
In this case study, we chose
to use explicit attributes over qualifier value pairs, for certain human sample
characteristics.
Explicit attributes were used
in the following cases:
1. Structured fields e.g., anatomy. A valid value of
pathology can be (“SNOMED”,
“T51000”, “LEFL LOBE OF LIVER”,
“Liver”). The attribute consists of components with specific
meaning. First component represents the vocabulary used, second component is a
code, third component is an anatomy term, and fourth component is a description
in plain language.
2. Stuctured fields where there is a need to express
explicit constraints e.g., anatomy. The constraint on
the anatomy attribute is “the ID and term must be valid in the
vocabulary”.
3. Structured or primitive fields valued in a Control
Vocabulary (CV) domain, e.g., treatment, gender. A valid treatment value can be
(C,
“Acetaminophen, 500mg, 4hrs before liver sample was taken”). In this case a value constraint on treatment type is
enforceable.
4. Required fields, e.g., gender. Note that gender is
also available as a qualifier but there is no way to express that it is
mandatory.
5. Set valued fields, e.g., source. A valid value for
source can be {(I, “Stanford”, 10578), (C, Eisen-001, 198)} meaning
the sample source is Stanford U., accession=10578, and comes from an Eisen cell
line with id 198. – sorry Mike.
Multi-valued fields can be
easily handled by the qualifier-value approach too. However, expressing
structural and value constraints becomes problematic under this approach.
Qualifiers (property-value
pairs) can be used instead of explicit attributes, if one wants to trade
semantic clarity for simplicity.
Control vocabularies seemed
appropriate for attribute domains such as gender, treatment type, pathology
category. This case study is an illustrative example of the succinctness that
simple CVs can add to a representation.
Use of ontologies can enrich sample annotations even further. Ontologies allow one to take advantage of established bodies of knowledge from well-studied subject matters such as anatomy and disease. Such advantages include taxonomic inference e.g., “a sample from the left lobe of liver also qualifies as a liver sample”.
The above modeling example, instead of adopting specific ontologies, encourages to user to reference the control vocabulary or ontology for the various terms used. In that case, the consumers of this information can use thesauri or other similar systems, such as UMLS, to translate from one CV or ontology to another.
This case study demonstrated that the “qualifier-value” pair modeling approach is not precise enough for unambiguous representations. Our recommendation is to add a third required component that describes the vocabulary from which the value is taken, i.e., a “qualifier-value” pair, becomes “qualifier-value-vocabulary” triple. If the vocabulary is informal, the default value “private” must be used.
A list of candidate vocabularies for each MIAME mandatory sample qualifier should be developed. This case study introduced some CVs. Established standards should be used as much as possible. Since many of these standards are not that popular in this community, there is a need for cataloguing all the applicable vocabularies (preferably the public available ones). For instance, SNOMED is widely accepted by human pathologists, but requires licensing.
A future MIAME release, should include descriptions for sample data qualifier. The qualifier name may be interpreted in multiple ways. The description may vary by species.