Design of the custom whole-genome malaria oligonucleotide array

Serge Batalov1, Elizabeth A. Winzeler2
1batalov@gnf.org, GNF; 2winzeler@scripps.edu, TSRI/GNF

To study the transcriptome of the malaria parasite, we designed a custom no-mismatch high-density oligonucleotide array. This design was based on the P. falciparum sequence data released in July 2001. The custom array contains probes to at least 5159 P. falciparum genes. This includes 260,596 25 mer probes corresponding to P. falciparum predicted coding sequence, 106,630 probes from P. falciparum non-coding sequence (intergenic regions, gene sequence corresponding to the non-coding strand and introns). In addition 124,957 probes from Plasmodium yoelli contigs are include on the array. A further 6000 control sequences corresponding to human and mouse genes that are reported to be highly expressed in blood cell, 3602 traditional Affymetrix controls and 2397 background probes were selected. We also included P. falciparum mitochondrion and plastid genome sequences. Because the P. falciparum genome sequence is relatively small (26 Mb), this design allowed the placement of the 367,226 probes, on average, every 150 bases on both strands. Though the opposite strand of a predicted transcript was often predicted to be non-coding, resulting in fewer probes on this strand, coverage was still substantial.

Probes were selected using non-proprietary rules as described earlier [1]. No mismatch sequences were used in the design of the array. One of the main challenges for the selection of the specific and effective probes was the high A+T content and the low complexity of the P. falciparum genome. Following production and release of the complete P. falciparum genome sequence and annotations in October 2002, we validated the design of the array by mapping probe nucleotide sequences to the sense nucleotide sequence of predicted coding regions. Out of 5409 published coding sequences downloaded October 11, 2002, 18 were duplicates and 11 more were subsequences of others. For example we would not be able to resolve the expression of a number of duplicated genes all sharing the same sequence. The BLAST analysis showed that out of the 5409 predicted coding sequences, 203 were not represented on the array. In some of these the predicted coding sequence was small (48 protein were less than 100 amino acids in length). Some other sequences were repetitive and low complexity. Finally, some nucleotide sequence from chromosome 6 and 7 were unavailable at the time the array was designed.

Reference:

1. K.G.Le Roch, Y.Zhou, S.Batalov, and E.A.Winzeler. Monitoring the chromosome 2 intraerythrocytic transcriptome of Plasmodium falciparum using oligonucleotide arrays. Am.J.Trop.Med.Hyg. 67(3), 2002, pp.233-243.