Representing Human Samples in a Gene Expression Database

 

Thodoros Topaloglou

Gene Logic Inc.

 

Introduction

 

This manuscript discusses modeling issues for Human samples in a gene expression reference database. This manuscript presents one out of a series of four case studies planned to be carried out by the Ontologies working group at the November 17th MGED meeting.

 

DISCLAIMER: The modeling of human samples as presented in this manuscript is different than that of GeneExpress™ by Gene Logic. GeneExpress™ supports a very thorough representation of samples which includes sample specific parameters, pathology review, clinical profiles of donors including medical tests, therapeutic area specific assays, other genomic assays, sample processing measurements, etc. This results in a representation that far exceeds in scope and complexity the MIAME goals.

 

Modeling of Human Samples

 

The goals of this case study are the following:

 

1.     Model human samples following the MIAME guidelines and identify the sample qualifiers (attributes) that are candidates for “ontology support”. 

 

2.     Provide descriptions for sample attributes in MIAME and identify value and type constraints specific to human species.

 

3.     Use a realistic example in order to identify shortfalls of the current MIAME guideline.

 

Perhaps the most distinctive feature of Human samples is their association to Human Donor information, which under certain circumstances, can have a contributing effect to gene expression.   Another important feature of human tissues is the ability to assess the pathological processes in the sample by conventional pathology screening (this is also the case for tissues from other organisms).

 

The following example illustrates a representation structure for human samples using the OPM notation. Note, I chose OPM as the modeling language, because of its strengths in dealing with semantic aspects of a representation. I use extensively the DESCRIPTION field (facet) to state the meaning of each attribute as well as express informal integrity constraints.

 

CLASS Human_Sample

      ATTRIBUTE type : CV_SampleType REQUIRED

            DESCRIPTION: “The type of a human sample can be one of: ‘tissue’, ‘cells’, ‘RNA’,  ‘other’”

      ATTRIBUTE source (type, name, ID): set-of (

                                 CV_SourceType, REQUIRED, STRING REQUIRED, STRING OPTIONAL) OPTIONAL

            DESCRIPTION: “The source of a sample is modeled as a multivalued tuple attribute with components source

                                 type, name and ID. The type can be one of ‘Sample’, ‘Cell line’, ‘Biorepository’, ‘Institution’,

                                 ‘Vendor’, and ‘Other’. Name and ID, if available, represent the source specific name and ID

                                 of the sample.”

      ATTRIBUTE gender: CV_Gender REQUIRED

            DESCRIPTION: “The gender of the individual that the sample is taken from”

      ATTRIBUTE treatment (type, detail): (CV_TreatmentType REQUIRED, Treatment OPTIONAL) OPTIONAL

            DESCRIPTION: “Treatment is a tuple that consists of a type and a detail description. Although treatment

                                 Specification is optional, once specified, the type at least must be filled.”

      ATTRIBUTE anatomy (vocabulary, term, code, description) :(

                                 STRING REQUIRED,

                                 STRING REQUIRED,

                                 STRING OPTIONAL,

                                 STRING OPTIONAL) REQUIRED

            DESCRIPTION: “The  anatomic site from which the sample is extracted is described by a term and code in

                                 some vocabulary, and a longer string description. The vocabulary can be either

                                 ‘informal’ or a formal one like ‘SNOMED’. If ‘informal’ is specified as vocabulary, then term

                                 can be any value. If a formal vocabulary is used, them the term and code values must be

                                 valid term and code values in the vocabulary . ”

            PROPERTY

                  MGED_CURATION: “Ensure existence of anatomy.term and anatomy.code in anatomy.vocabulary;

                                            Exempt if anatomy.vocabulary = ‘informal’”.

      ATTRIBUTE pathology (vocabulary, term, code, category, description): (

                                 STRING REQUIRED

                                 STRING OPTIONAL

                                 STRING OPTIONAL

                                 CV_PathologyCategory REQUIRED

                                 STRING OPTIONAL) REQUIRED

            DESCRIPTION: “The pathological processes in the sample are described by a term and code in some

                                 vocabulary, a longer string description and a category term. The vocabulary can be either

                                 ‘informal’ or a formal one like ‘SNOMED’. If ‘informal’ is specified as vocabulary, then term

                                 can be any value. If a formal vocabulary is used, them the term and code values must be

                                 valid term and code values in the vocabulary. The category is a placeholder for a broad

                                 pathology term such as ‘normal’, ‘tumor’, etc.”

      ATTRIBUTE diseases : set-of (vocabulary, term, code, stage, description, status) : (

                                 STRING REQUIRED,

                                 STRING REQUIRED,

                                 STRING OPTIONAL,

                                 STRING OPTIONAL,

                                 STRING OPTIONAL,

                                 STRING OPTIONAL) OPTIONAL

            DESCRIPTION:  “A sample may be associated with one or more diseases via the donor, i.e., a sample may

                                 be a normal kidney of a diabetic donor who also has high blood pressure. A disease is

                                 described by a term and code in some vocabulary, a longer string description, a stage and

                                 a status. The vocabulary can be either ‘informal’ or a formal one like ‘SNOMED’. If ‘informal’

                                 is specified as vocabulary, then term can be any value. If a formal vocabulary is used, then

                                 the term and code values must be valid term and code values in the vocabulary. The stage

                                 component can be used to describe the stage of the disease, if applicable, e.g., ‘early’,

                                 ‘late’. The status can be used to specify information such as ‘contributing’, ‘not-contributing’,

                                 ‘past condition’, etc.”

      ATTRIBUTE qualifier_MIAME (name, value, vocabulary) : set-of (

                                 CV_MIAME_Qualifier REQUIRED,

                                 STRING REQUIRED,

                                 STRING OPTIONAL) OPTIONAL

            DESCRIPTION: “MIAME allows users to specify property (qualifier) value pairs. The available qualifier

                                 names are specified in the MIAME qualifiers CV. “

      ATTRIBUTE qualifier_other (name, value, vocabulary) : set-of (

                                 STRING REQUIRED,

                                 STRING REQUIRED,

                                 STRING OPTIONAL) OPTIONAL

            DESCRIPTION: “If the MIAME predefined qualifiers are not sufficient, the user can make use of an open-

                                 ended  list of qualifier-value pairs.”

      ATTRIBUTE file (name, description): set-of (STRING REQUIRED, STRING OPTIONAL) OPTIONAL

            DESCRIPTION: “A submitted to the public database sample may be associated with one or more files. The

                                 files may hold a histopathology report, tissue slide, clinical record, research paper, ect.”

 

 

CONTROLLED VALUE CLASS CV_SourceType

      {     ( "Sampe", S, "The sample is derived by another sample" ),

            ( "Cell Line", C, "The sample is derived by a cell line" ),

            ( "Biorepository", U, "The origin of the sample is a biorepository" )

            (“Institution”, I, “The origin of the sample is a public institution such as a hospital”)

            (“Vendor”, V, “The origin of the sample is a commercial vendor”)

            (“Other”, O, “None of the above”)

      }

      DEFAULT: "Other”

      CODE_TYPE: CHAR(1)

      DESCRIPTION: "Types of sample sources"

 

CONTROLLED VALUE CLASS CV_SourceType

      {     ( "Sampe", S, "The sample is derived by another sample" ),

            ( "Cell Line", C, "The sample is derived by a cell line" ),

            ( "Biorepository", U, "The origin of the sample is a biorepository" )

            (“Institution”, I, “The origin of the sample is a public institution such as a hospital”)

            (“Vendor”, V, “The origin of the sample is a commercial vendor”)

            (“Other”, O, “None of the above”)

      }

      DEFAULT: "Other”

      CODE_TYPE: CHAR(1)

      DESCRIPTION: "Types of sample sources"

 

CONTROLLED VALUE CLASS CV_Gender

      {     ( "Female", F, "Female" ),

            ( "Male", M, "Male" ),

            ( "Unknown", U, "Unknown" ) 

      }

      DEFAULT: "Unknown”

      CODE_TYPE: CHAR(1)

      DESCRIPTION: "Gender"

 

CONTROLLED VALUE CLASS CV_TreatmentType

      {     ( "Compound", T, "The sample/donor has been treated by some treatment agent at a given dose and time" ),

            (“Diet”, D, “The sample/donor has been subjected to a diet”),

            ( "Surgury", S, "The sample has been surgically modified" ),

            ( "Genetic", G, " The sample has been genetically modified " )

            (“Other”, O, “None of the above”)

      }

      DEFAULT: "Other”

      CODE_TYPE: CHAR(1)

      DESCRIPTION: "Types of sample Treatments"

 

 

CONTROLLED VALUE CLASS CV_PathologyCategory

      {     ( "Normal", N, "Normal" ),

            ( "Tumor", T, "The sample is tumor; can be further detailed in malignant or benign" ),

            ( "Diseased", D, " The sample has been genetically modified " )

            (“Not Reported”, R, “Not Reported”)

      }

      DEFAULT: "Not Reported”

      CODE_TYPE: CHAR(1)

      DESCRIPTION: "The pathological condition of the samples in very general terms."

 

CONTROLLED VALUE CLASS CV_MIAME_Qualifier

      {     ( "Sex", 0, "Gender of the human donor" ),

            ( "Age", 1, "Age of the donor when the sample was taken”),

            ( "Development Stage", 2, " The developmental stage of the sample, if sample is prenatal " )

            (“Tissue”, 3, “Tissue type, same as anatomy”)

            (“Cell Type” 4, “Type of cells, if sample is a cell culture”)

            (“In Vivo Treatment”, 5, “Same as treatment, applies to human tissues”)

            (“In Vitro Treatment” 6, “Same as treatment, applies to human cell lines”)

            (‘Handling”, 7, “Description of the sample handling e.g., time to freeze, condition of packaging, etc”)

            (“Genotype”, 8, “Genetic characteristics such as disease alleles, polymorphisms, etc”)

      }

      CODE_TYPE: NUMBER

      DESCRIPTION: "Names of MIAME approved qualifiers (sort of), applicable to humans"

 

Discussion

Attributes vs Qualifiers

 

In this case study, we chose to use explicit attributes over qualifier value pairs, for certain human sample characteristics.

 

Explicit attributes were used in the following cases:

 

1.     Structured fields e.g., anatomy. A valid value of pathology can be (“SNOMED”, “T51000”, “LEFL LOBE OF LIVER”, “Liver”).  The attribute consists of components with specific meaning. First component represents the vocabulary used, second component is a code, third component is an anatomy term, and fourth component is a description in plain language.

 

2.     Stuctured fields where there is a need to express explicit constraints e.g., anatomy. The constraint on the anatomy attribute is “the ID and term must be valid in the vocabulary”.

 

3.     Structured or primitive fields valued in a Control Vocabulary (CV) domain, e.g., treatment, gender. A valid treatment value can be (C, “Acetaminophen, 500mg, 4hrs before liver sample was taken”). In this case a value constraint on treatment type is enforceable.

 

4.     Required fields, e.g., gender. Note that gender is also available as a qualifier but there is no way to express that it is mandatory.

 

5.     Set valued fields, e.g., source. A valid value for source can be {(I, “Stanford”, 10578), (C, Eisen-001, 198)} meaning the sample source is Stanford U., accession=10578, and comes from an Eisen cell line with id 198. – sorry Mike.

 

Multi-valued fields can be easily handled by the qualifier-value approach too. However, expressing structural and value constraints becomes problematic under this approach.

 

Qualifiers (property-value pairs) can be used instead of explicit attributes, if one wants to trade semantic clarity for simplicity.

Control Vocabularies and Ontologies

 

Control vocabularies seemed appropriate for attribute domains such as gender, treatment type, pathology category. This case study is an illustrative example of the succinctness that simple CVs can add to a representation.

 

Use of ontologies can enrich sample annotations even further. Ontologies allow one to take advantage of established bodies of knowledge from well-studied subject matters such as anatomy and disease. Such advantages include taxonomic inference e.g., “a sample from the left lobe of liver also qualifies as a liver sample”.

 

The above modeling example, instead of adopting specific ontologies, encourages to user to reference the control vocabulary or ontology for the various terms used. In that case, the consumers of this information can use thesauri or other similar systems, such as UMLS, to translate from one CV or ontology to another.

 

Recommendations

 

This case study demonstrated that the “qualifier-value” pair modeling approach is not precise enough for unambiguous representations. Our recommendation is to add a third required component that describes the vocabulary from which the value is taken, i.e., a “qualifier-value” pair, becomes “qualifier-value-vocabulary” triple. If the vocabulary is informal, the default value “private” must be used.

 

A list of candidate vocabularies for each MIAME mandatory sample qualifier should be developed. This case study introduced some CVs. Established standards should be used as much as possible. Since many of these standards are not that popular in this community, there is a need for cataloguing all the applicable vocabularies (preferably the public available ones). For instance, SNOMED is widely accepted by human pathologists, but requires licensing.

 

A future MIAME release, should include descriptions for sample data qualifier. The qualifier name may be interpreted in multiple ways. The description may vary by species.