Supplementary material for the papers:


Jianjun Hu, Bin Li, and Daisuke Kihara. (2005)
Limitations and Potentials of Current Motif Discovery Algorithms, Nucleic Acids Res. 2005; 33(15): 4899–4913


Jianjun Hu, Yifeng David Yang, and Daisuke Kihara. (2006)
EMD: an ensemble algorithm for discovering regulatory motifs in DNA sequences, BMC Bioinformatics 2006; 7: 342


Contact us

 

 

1. E. coli  genome data sets

 

Data

Description

Note

RegulonDB 

(local cache) obtained from RegulonDB Database

 

ecoli.genes

gene information of E. coli

 

ecoli.genome

complete E. coli genome sequence

 

ecoli.motifs.zip

Separate files for each motif group compiled from RegulonDB

uncompress with unzip under linux

 

 

2. ECRDB70 data sets

 

Data

Description

Note

ECRDB70.txt

70 motif groups screened out of RegulonDB. Some of the records will be skipped when generating input sequence data sets

 

ECRDB70.list

A list of motif groups in ECRDB with their motif widths and other information

 

ECRDB70.stat

Some statistics of the ECRDB70 motifs

 


3. Input sequence data sets with different margins generated from ECRDB70

pls. refer to the paper for the procedures to generate the following input sequence data sets from ECRDB70

 

Data

Description

Note

ECRDB62A

input sequences extracted from intergenic regions in which the motifs in ECRDB70 are located.

 

ECRDB70B-20

training sequences with margin size of 20 on both sides of motifs

 

ECRDB70B-50

training sequences with margin size of 50 on both sides of motifs

 

ECRDB70B-100

training sequences with margin size of 100 on both sides of motifs

 

ECRDB70B-200

training sequences with margin size of 200 on both sides of motifs

 

ECRDB70B-300

training sequences with margin size of 300 on both sides of motifs

 

ECRDB70B-400

training sequences with margin size of 400 on both sides of motifs

 

ECRDB70B-500

training sequences with margin size of 500 on both sides of motifs

 

ECRDB70B-800

training sequences with margin size of 800 on both sides of motifs

 

ECRDB61B-all

training sequences with margin size 20,50,100,200,300,400,500,800 on both sides of motifs(8 data sets)

 Redundant input sequences were removed and motif groups which have just one input sequence after this processing were removed too.So there are just 61 motif groups left in each dataset

resampling

sequence files of motif groups with at least 40 sequences, used for benchmarking how the number of sequences affects prediction performance

 

 

4. Background sequences

Two types of background models are generated based on:
1) The whole E.coli genome sequence:
Download  
2) All the sequence segments located in the intergenic regions of E.coli genomes: Download. This file is generated based on the
E. coli genome and the gene information in E.coli genes. It includes intergenic segments from both strands of the E. coli genome.

 

5. Parameter settings for benchmark experiments and the minimal-parameter-tuning guideline

According to our minimal-parameter-tuning guideline, we list all the major running parameters of the five motif discovery programs used in our experiments including AlignACE, BioProspector, MDScan, MEME, and MotifSampler. Most of the parameters are unset or use the default settings. Check the parameters here.

 


Supplementary material for paper:

Jianjun Hu,Yifeng D. Yang, and Daisuke Kihara. (2006)EMD: An Ensemble Algorithm for Discovering Regulatory Motifs in DNA Sequences, (submitted to BMC Bioinformaitcs)

1. E.coli genome data sets

2. genomRDB70 data sets

3. input sequence data sets with different margins generated from ECRDB70

DataDescriptionNote
ECRDB61C-Xtraining sequences with margin size of 20,50,100,200,300,400,500,800 on both sides of motifs(8 data sets) Modified from ECRDB61B-X datasets, the margin sequences are artificially shuffled, while preserving the di-mer nucleotide frequency of intergenic regions of the E. coli genome

4. Background Sequences

5. Parameter settings for benchmark experiments and the minimal-parameter-tuning guideline



 

 

Contact Information:

 

Lilly Bld. B235

Department of Biological Sciences

Purdue University

West Lafayette, IN, 47906

Tel: 765-494-2744

Email: hujianju@purdue.edu

          dkihara@purdue.edu

          yang41@purdue.edu