Gene Expression Omnibus (GEO)

El servidor NCBI Gene Expression Omnibus (GEO) es un repositorio público para un amplio rango de datos HTD. En el nivel básico de organización de GEO, existen cuatro tipos básicos de entidades (Muestra, Plataforma, Serie y Conjunto de datos).

Plataforma

Un registro de plataforma describe la lista de elementos en el array (p. ej. cDNAs, oligonucleotidos..). A cada registro se le asigna un único número de acceso (GEO accession) de la forma GPLxxx. Por ejemplo el número GPL24676 corresponde a Illumina NovaSeq 6000 (Homo sapiens)

Muestras

Un registro de muestra describe las condiciones bajo las cuales la muestra fue manipulada y la medida de la abundancia de cada elemento derivado de ella. A cada registro de muestra se le asigna un número de acceso único de la forma (GSMxxx)

Series

Un registro de serie define un conjunto de muestras relacionadas. Una serie proporciona una descripción del experimento en su conjunto. Tambien pueden contener tablas que describen datos extraidos, conclusiones, etc.. A cada registro se le asigna un número de la forma (GSExxx).

Conjuntos de datos.

GEO DataSets (GDSxxx) son un conjunto de muestras GEO seleccionadas. Un registro GDS representan una colección biológicamente y estadísticamente comparables.

GEOquery

GEOquery es un paquete de bioconductor que sirve para realizar consultas a NCBI Gene Expression Omnibus.

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("GEOquery")

Caso práctico

Consultaremos los datos correspondientes a la serie GSE282742.

library(GEOquery)

accession <- 'GSE282742'

gse <- getGEO(accession, GSEMatrix = TRUE)
gse[[1]]

ExpressionSet (storageMode: lockedEnvironment)
assayData: 0 features, 116 samples 
  element names: exprs 
protocolData: none
phenoData
  sampleNames: GSM8649709 GSM8649710 ... GSM8649824 (116 total)
  varLabels: title geo_accession ... tissue:ch1 (46 total)
  varMetadata: labelDescription
featureData: none
experimentData: use 'experimentData(object)'
Annotation: GPL24676

gse[[1]]@experimentData@name

[1] "Yi-Long,,Huang"

gse[[1]]@experimentData@title

[1] "Transcriptomic predictors of rapid progression from mild cognitive impairment to Alzheimer's Disease"

gse[[1]]@experimentData@abstract

[1] "Background: Effective treatment for Alzheimer’s disease (AD) remains an unmet need. Thus, identifying patients with mild cognitive impairment (MCI) who are at high-risk of progressing to AD is crucial for early intervention. Methods: Blood-based transcriptomics analyses were performed using a longitudinal study cohort to compare progressive MCI (P-MCI, n=28), stable MCI (S-MCI, n=39), and AD patients (n=49). Statistical DESeq2 analysis and machine learning methods were employed to identify differentially expressed genes (DEGs) and develop prediction models. Results: We discovered a remarkable gender-specific difference in DEGs that distinguish P-MCI from S-MCI. Machine learning models achieved high accuracy in distinguishing P-MCI from S-MCI (AUC 0.93), AD from S-MCI (AUC 0.94), and AD from P-MCI (AUC 0.92). An 8-gene signature was identified for distinguishing P-MCI from S-MCI."

metadata <- pData(phenoData(gse[[1]]))[, 
        c('description', 'age:ch1', 'disease state:ch1', 'Sex:ch1', 'tissue:ch1')]

head(metadata)

	description	age:ch1	disease state:ch1	Sex:ch1	tissue:ch1
GSM8649709	Library name: VGH0075	69y	P-MCI	F	white blood cells
GSM8649710	Library name: VGH0089	78y	AD	F	white blood cells
GSM8649711	Library name: VGH0146	67y	P-MCI	M	white blood cells
GSM8649712	Library name: VGH0195	73y	AD	F	white blood cells
GSM8649713	Library name: VGH0203	85y	AD	M	white blood cells
GSM8649714	Library name: VGH0216	83y	AD	M	white blood cells

Obtener archivos suplementarios

suppl.file <- getGEOSuppFiles(accession, 
                              baseDir = 'data', 
                              fetch_files = TRUE)

suppl.file

	size	isdir	mode	mtime	ctime	atime	uid	gid	uname	grname
data/GSE282742/GSE282742_Expected_count.txt.gz	6248772	FALSE	644	2024-12-07 20:03:30	2024-12-07 20:03:30	2024-12-07 20:03:26	1000	1000	smzlogoj	smzlogoj
data/GSE282742/GSE282742_TPM.txt.gz	6690730	FALSE	644	2024-12-07 20:03:35	2024-12-07 20:03:35	2024-12-07 20:03:30	1000	1000	smzlogoj	smzlogoj