A549 Cells
The data used as an example here and throughout the 4D-Genome Browser project were generated for the ENCODE project by the Reddy lab at Duke University.
It includes paired-end, Hi-C sequencing libraries on A549 cells, a cancer lung cell line treated with 100 nM dexamethasone.
Specifically, the library of the first isogenic replicate, for the initial time point (ENCLB571GEP) with paired files ENCFF039FYU and ENCFF479RSE and the first isogenic replicate, for the twelve-hour time point - treated with 100 nM dexamethasone - (ENCLB870JCZ)
with paired files ENCFF668EDF and ENCFF398SQH were used to generate Hi-C contact maps via HiCExplorer.
As an example, we have included the instructions for the construction of an .h5 file of the twelve-hour timepoint for chromosome 22, below.
File gathering and preprocessing
## Set the working directory
cd ./hic-converter/data
## Activate hic-explorer environment
conda activate hicexplorerenv
## Make directories for analysis.
mkdir -p GRCh38 fastq bam bwa
## Download the two paired fastq.gz files for the first replicate.
wget https://www.encodeproject.org/files/ENCFF668EDF/@@download/ENCFF668EDF.fastq.gz -O ./fastq/ENCFF668EDF_R1_001.fastq.gz
wget https://www.encodeproject.org/files/ENCFF398SQH/@@download/ENCFF398SQH.fastq.gz -O ./fastq/ENCFF398SQH_R2_001.fastq.gz
## Download the human reference genome from the ENCODE project.
wget https://www.encodeproject.org/files/GRCh38_no_alt_analysis_set_GCA_000001405.15/@@download/GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.gz -O ./GRCh38/GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.gz
## Unzip the genome.
gunzip ./GRCh38/GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.gz
Restriction site marking and indexing
Prior to analysis we need to index the reference genome we downloaded above and then search for MobI restriction cut sites.
For this we use bwa index from the bwa suite of tools and hicFindRestSite from HiCExplorer.
## Index the human genome (this can take awhile but we only need to do it once)!
bwa index -p bwa/GRCh38_index ./GRCh38/GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta
## Find the MobI restriction sites (GATC) in the GRCh38, human reference genome.
hicFindRestSite --fasta ./GRCh38/GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta -o ./GRCh38/GRCh38_MobI_cut_sites.bed --searchPattern GATC
Building a contact map for chromosome 22
Below are the example bwa mem and hicBuildMatrix commands used to align squenced reads and build a Hi-C contact map in .h5 format.
## Align data to the human reference genome.
bwa mem -A 1 -B 4 -E 50 -L 0 -t 14 bwa/GRCh38_index ./fastq/ENCFF668EDF_R1_001.fastq.gz | samtools view -SHb - > ./bam/ENCFF668EDF_R1_001.bam
bwa mem -A 1 -B 4 -E 50 -L 0 -t 14 bwa/GRCh38_index ./fastq/ENCFF398SQH_R2_001.fastq.gz | samtools view -SHb - > ./bam/ENCFF398SQH_R2_001.bam
## Build .h5 matrix for chromosome 22.
hicBuildMatrix --samFiles ./bam/ENCFF668EDF_R1_001.bam ./bam/ENCFF398SQH_R2_001.bam \
--region chr22 --binSize 200000 \
--threads 4 --inputBufferSize 200000 \
--restrictionSequence GATC --danglingSequence GATC --restrictionCutFile ./GRCh38/GRCh38_MboI_cut_sites.bed \
--QCfolder ENCLB870JCZ.chr22.12hr --outFileName ./h5/ENCLB870JCZ.chr22.200kb.12hr.h5
Converting the .h5 contact map to .hic
Currently the 4D-Genome Workflow is only compatible with .hic files.
After generating a .h5 file as shown above, it will need to be converted.
An example of this file conversion process is shown below.
The scripts required for this conversion are available here.
## Activate hic-explorer environment
conda activate hicexplorerenv
## Change directory to tools
cd ./hic-converter/tools
## Convert a summary.txt.gz file to an .hic file for a single chromosome
./h5.to.hic.sh -m ../data/h5/ENCLB870JCZ.chr22.200kb.12hr.h5 -g ../data/sizes/GRCh38.chr22.size.bed -o ../data/hic/ENCLB870JCZ.chr22.200kb.12hr.hic