Reading 10X Spatial Data from GEO into R

Author

Mark Dunning

Published

October 23, 2025

Overview

Lots of spatial transcriptomics data is becoming available in R, but how do you load it into R for analysis? I will discuss an example of a dataset uploaded to Gene Expression Omnibus

The data

The dataset I will be looking comes from Spatial single cell analysis of tumor microenvironment remodeling pattern in primary central nervous system lymphoma

Here is an overview of the study.

If we are interested in tumour microenvironment we might want to explore the dataset for ourselves and test some hypotheses. The data are available in Gene Expression Onmibus (GEO), but unfortunately not in a form that we can immediately use.

Packages required

The following packages will handle the download and re-organisation of the data

if(!require("BiocManager")) install.packages("BiocManager")
BiocManager::install("GEOquery")
install.packages("R.utils")

Our eventual aim will be to load the data into Seurat, so if you want to start looking at the data you can download this too.

install.packages("Seurat")

Downloading using GEOquery

The GEOquery package has long been a favourite of mine for downloading data from GEO. However, don’t get too excited because it was developed during the days of microarrays and has limited functionality for sequencing datasets. The getGEO function will download some meta data about this dataset.

library(GEOquery)
geo <- getGEO("GSE230207")
geo
$GSE230207_series_matrix.txt.gz
ExpressionSet (storageMode: lockedEnvironment)
assayData: 0 features, 4 samples 
  element names: exprs 
protocolData: none
phenoData
  sampleNames: GSM7192449 GSM7192450 GSM7192451 GSM7192453
  varLabels: title geo_accession ... tissue:ch1 (43 total)
  varMetadata: labelDescription
featureData: none
experimentData: use 'experimentData(object)'
  pubMedIds: 37120690 
Annotation: GPL18573 

Since getGEO is also capable of downloading datasets that comprise multiple different technologies, the result is in the form a list. The particular dataset we want only have one type of data (spatial transcriptomics), so we subset the first item in the list. We can access the meta, or phenotypic data, using the pData function.

meta <- pData(geo[[1]])
meta[,1:5]
           title geo_accession                status submission_date
GSM7192449   hot    GSM7192449 Public on May 01 2023     Apr 20 2023
GSM7192450  cold    GSM7192450 Public on May 01 2023     Apr 20 2023
GSM7192451   IME    GSM7192451 Public on May 01 2023     Apr 20 2023
GSM7192453   IMS    GSM7192453 Public on May 01 2023     Apr 20 2023
           last_update_date
GSM7192449      May 08 2024
GSM7192450      May 08 2024
GSM7192451      May 08 2024
GSM7192453      May 08 2024

The data we are interested have been uploaded as supplementary material, and conveniently GEOquery has a function for downloading these data. Depending on the speed of your network connection this may take a while. If the download fails, you might need to increase the timeout option in R.

# You may need to un-comment the next line to increase the download timeout
#options(timeout = 10000)

getGEOSuppFiles("GSE230207")

You should now have a file called GSE230207_RAW.tar in a folder called GSE230207. This tar file is essentially a way of transferring a dataset consisting many separate files into a single download.

Note

The name tar is short for Tape ARchive and harks back to the early days of computing.

The format and its associated command-line utility were originally developed for archiving files onto magnetic tape drives on Unix systems in the early days of computing.

Although its original purpose was tape backup, the tar utility is now widely used on Unix-like systems (like Linux) as a general-purpose archive format (often called a tarball) to bundle multiple files and directories into a single file for distribution, backup, and transport, regardless of the storage medium.

file.exists("GSE230207/GSE230207_RAW.tar")
[1] TRUE

R has a built-in function, untar for extracting all the files contained in the archive to a folder of our choosing (exdir). We’ll just extract to the current working directory. After untar has finished we can list the contents of the working directory.

untar("GSE230207/GSE230207_RAW.tar", exdir = ".")
list.files("./")

You’ll notice that lots of files have appeared, and futhermore they are prefixed by the GEO IDs that we discovered using GEOquery (GSM7192449, GSM7192450, GSM7192451 and GSM7192452)

With list.files we can list everything that contains one of these IDs and notice that the files are named very predictably. This means for a given GEO ID we know what files to expect, which will be helpful to organise our files in a way that Seurat expects.

list.files(pattern = "GSM7192449")
[1] "GSM7192449_aligned_fiducials.jpg.gz"    
[2] "GSM7192449_detected_tissue_image.jpg.gz"
[3] "GSM7192449_hot.h5"                      
[4] "GSM7192449_scalefactors_json.json.gz"   
[5] "GSM7192449_tissue_hires_image.png.gz"   
[6] "GSM7192449_tissue_lowres_image.png.gz"  
[7] "GSM7192449_tissue_positions_list.csv.gz"
list.files(pattern = "GSM7192450")
[1] "GSM7192450_aligned_fiducials.jpg.gz"    
[2] "GSM7192450_cold.h5"                     
[3] "GSM7192450_detected_tissue_image.jpg.gz"
[4] "GSM7192450_scalefactors_json.json.gz"   
[5] "GSM7192450_tissue_hires_image.png.gz"   
[6] "GSM7192450_tissue_lowres_image.png.gz"  
[7] "GSM7192450_tissue_positions_list.csv.gz"

Folder structure expected by Seurat

Our eventual goal is to use a function called Load10X_Spatial in the Seurat package, which data to be organised and named in a very specific manner. For a sample called Sample_1 it should look like this:-

Sample_1
├── filtered_feature_bc_matrix.h5/ 
└── spatial/                   
    ├── tissue_hires_image.png
    ├── tissue_lowres_image.png
    ├── scalefactors_json.json
    └── tissue_positions_list.csv
    └── ... + other images files if required
    

In other words, we need to create a separate folder for each of the four samples named according to the sample name. The “h5” file should be placed here and all other image data should be in a folder named spatial. As the IDs created by GEO are completely arbitrary and not related to the underlying question we will rename to some relating to the sample groups. This labels for this are found in the title column of the meta data.

The code to rename and move files

First we’ll define a few variables

## The GEO IDs that are used in the original naming scheme
geo_ids <- rownames(meta)

## Get more meaningful names
titles <- meta$title

## Make new sample names that we will use in Seurat
new_names <- paste("TME", meta$title, sep = "_")

## These are the common names of the image files we are going to copy (each will be prefixed by a sample name)

required_files <- c("aligned_fiducials.jpg.gz", 
                   "detected_tissue_image.jpg.gz",
                   "scalefactors_json.json.gz", 
                   "tissue_hires_image.png.gz",
                   "tissue_lowres_image.png.gz",
                   "tissue_positions_list.csv.gz")

A for loop is now used to copy the files for each sample. The code is probably a bit old fashioned, but it seems to do the job and doesn’t take too long. 🤷 Maybe at some point I could rewrite using purrr….

The code also prints messages to the screen to keep track of the copying and renaming process.

## Do this for all samples

for(i in 1:length(geo_ids)){
  
  new_dir <- new_names[i]
  new_spatial_dir <- paste0(new_names[i], "/spatial")
  
  message(paste("Creating new folder", new_dir, "to copy files prefixed with", geo_ids[i]))
  dir.create(new_dir, showWarnings = FALSE)
  dir.create(new_spatial_dir, showWarnings = FALSE)
  
  
  ## Copy the h5 file
  
  h5_file <- paste0(geo_ids[i], "_",titles[i],".h5")
  new_h5_file <- paste0(new_dir, "/filtered_feature_bc_matrix.h5")
  message(paste("Copying h5 file", h5_file, "to", new_h5_file))
  file.copy(h5_file, new_h5_file)
  
  ## Copy all the image files
  
  for(j in 1:length(required_files)){
    from <- paste0(geo_ids[i], "_", required_files[j])
    to <- paste0(new_names[i], "/spatial/", required_files[j])
      
    ## Check if extracted file exists, and use gunzip if not
    if(!file.exists(gsub(".gz", "", to))) {
      message(paste("Copying", from, "to", to))

      file.copy(from, to)
      R.utils::gunzip(to)
    } else (paste(to, "has already been copied"))
  }

}

We can check for a particular sample if the correct structure is in place

list.files("TME_cold/",recursive = TRUE)
[1] "filtered_feature_bc_matrix.h5"     "spatial/aligned_fiducials.jpg"    
[3] "spatial/detected_tissue_image.jpg" "spatial/scalefactors_json.json"   
[5] "spatial/tissue_hires_image.png"    "spatial/tissue_lowres_image.png"  
[7] "spatial/tissue_positions_list.csv"

Loading into Seurat

We can load Seurat and hopefully it will be able to read the data 🤞

library(Seurat)
Loading required package: SeuratObject
Loading required package: sp

Attaching package: 'SeuratObject'
The following object is masked from 'package:BiocGenerics':

    intersect
The following objects are masked from 'package:base':

    intersect, t
tme_cold <- Load10X_Spatial("TME_cold/")

Success! 🎉 Of course, there is much that we can do with these data and this will be a topic for another time. As a quick preview we can perform QC and look at distribution of counts

VlnPlot(tme_cold, features = "nCount_Spatial", pt.size = 0.1) + NoLegend()

And what we will be interested in is how these counts are distributed across the tissue.

SpatialFeaturePlot(tme_cold, features = "nCount_Spatial")

More next time!