Supplementary information 1: Perl scripts used to search the Pfam sequence database for multi-domain proteins of specific domain architecture. This set of scripts identifies multi-domain proteins consistent with the input of domains and inter-domain linker lengths. Two sets of perl scripts are provided below. Script 1 Sample Input a)domain_arch_in.txt % 85 PF01369 36 PF00169 39 286-114 126-31 2 The first line of the file "domain_arch_in.txt " contains the details of maximum cut-off lengths for N,C terminal extensions and domain linkers, the second line gives the window range for domain boundary i.e "286-114" means that the domain size can go to a minimum of 172 and maximum of 400 and the third line indicates the position of the domain of interest. b)Pfam alignment files It contains the separate Pfam domain alignment files for the required Pfam domain families (pfam ids), which is the subset of the database file "Pfam-A.full.uniprot.gz". script 1 generates intermediate files. script 2 Input files a)domain_arch_in.txt b)Intermediate files generated by script 1 are also used as input files. c)uniprot sequences for all the uniprot entries of the previous intermediate files are downloaded and given as input in the following format >tr|Q3T9T9|Q3T9T9_MOUSE Uncharacterized protein OS=Mus musculus OX=10090 GN=Cyth1 PE=2 SV=1 MEDDDSYVPSDLTAEERQELENIRRRKQELLADIQRLKEEIAEVANEIESLGSTEERKNMQRNKQVAMGRKKFNMDPKKGIQFLIENGLLKNTCEDIAQFLYKGEGLNKTAIGDYLGERDEFSIQVLHAFVELHEFTDLNLVQALRQFLWSFRLPGEAQKIDRMMEAFAQRYCQCNTGVFQSTDTCYVLSFAIIMLNTSLHNPNVKDKPTVERFIAMNRGINDGGDLPEELLRNLYESIKNEPFKIPEDDGNDLTHTFFNPDREGWLLKLGGGRVKTWKRRWFILTDNCLYYFEYTTDKEPRGIIPLENLSIREVEDSKKPNCFELYIPDNKDQVIKACKTEADGRVVEGNHTVYRISAPTPEEKEDWIKCIKAAISRDPFYEMLAARKKEVSSTKRH >tr|Q3TZ02|Q3TZ02_MOUSE Cytohesin-1 OS=Mus musculus OX=10090 GN=Cyth1 PE=1 SV=1 MVLKEEGEDVPSDLTAEERQELENIRRRKQELLADIQRLKEEIAEVANEIESLGSTEERKNMQRNKQVAMGRKKFNMDPKKGIQFLIENGLLKNTCEDIAQFLYKGEGLNKTAIGDYLGERDEFSIQVLHAFVELHEFTDLNLVQALRQFLWSFRLPGEAQKIDRMMEAFAQRYCQCNTGVFQSTDTCYVLSFAIIMLNTSLHNPNVKDKPTVERFIAMNRGINDGGDLPEELLRNLYESIKNEPFKIPEDDGNDLTHTFFNPDREGWLLKLGGGRVKTWKRRWFILTDNCLYYFEYTTDKEPRGIIPLENLSIREVEDSKKPNCFELYIPDNKDQVIKACKTEADGRVVEGNHTVYRISAPTPEEKEDWIKCIKAAISRDPFYEMLAARKKKVSSTKRH >tr|Q5ZM97|Q5ZM97_CHICK Cytohesin 1 OS=Gallus gallus OX=9031 GN=CYTH1 PE=2 SV=1 MDEEGGYVPSDLTPEECQELENIRRRKQELLADIQRLKDEIAEVTNEIENLGSTEERKNMQRNKQVAMGRKKFNMDPKKGIQFLIENDLLKNTCEDIAQFLYKGEGLNKTAIGDYLGERDEFNIQVLHAFVELHEFTDLNLVQALRQFLWSFRLPGEAQKIDRMMEAFAQRYCQCNPGVFQSTDTCYVLSFAIIMLNTSLHNPNVKDKPTAERFIAMNRGINDGGDLPEELLRNLYESIKNEPFKIPEDDGNDLTHTFFNPDREGWLLKLGGGRVKTWKRRWFILTDNCLYYFEYTTDKEPRGIIPLENLSIREVEDSKKPNCFELYIPDNKDQVIKACKTEADGRVVEGNHTVYRISAPTPEEKEEWIKCIKAAISRDPFYEMLAARKKKVSSTKRH Expected output: Uniprot entries of sequences containing the desired domain architecture. After running these scripts fragments of uniprot sequences and multi-domain proteins with incomplete domains or with long insertions are further removed manually. script 1: #!/usr/bin/perl open (ID,"dom_arch_in.txt")or die $!;@ID = ;$line=$ID[0];chomp($line); if($line=~/^\#.*/) {$x=1;$d_dlinker=40;} elsif($line=~/^\%.*/) {$x=2;} @DA=split('\s',$line);shift(@DA); if($x==1) { $k=0; for($d=0;$d<=(@DA);$d++) { $k++;push(@DL,$d_dlinker); } } elsif($x>1) { for($i=1;$i<=(@DA);$i=$i+2) { push @AA, $DA[$i] if $DA[$i] ne ''; push(@DL,$DA[$i-1]); } @DA=@AA; } $len1=(@DA);$len2=(@DL);print "@DA\t$len1\n";print "@DL\t$len2\n";$len3=$len2-1; for($p=0;$p<(@DA);$p++) { $q=$p+1;print"$DL[$p]++\t$p\n";print"**$DA[$p]**\t$p\t$len3\n"; open (FH, "/All_PfamA_uniprot_pfam_families/$DA[$p]")or die $!;@FH = ; foreach(@FH) { if($_ !~ m/^\#/ and $_ !~ m/^\// ) { $val_line=$_;@array=split('/',$val_line); if($array[0]=~m/\./) { @arr=split('\.',$val_line);$uid_A=$arr[0];} else {$uid_A=$array[0];} @b_array=split('\s+',$array[1]); $bndry=$b_array[0];@sp_bndry=split('\-',$bndry); $bndry_N=$sp_bndry[0]; $bndry_C=$sp_bndry[1]; if($p==0) { $f=$p+1; if($bndry_N<=$DL[$p]) {push(@init_ar,"$uid_A|$bndry\n");} open (OUT1,">$DA[$p]_file$f.txt")or die $!;print OUT1 @init_ar;close OUT1; } if($p>0) { open (FH1,"$DA[$p-1]_file$p.txt")or die $!;@FH1=;open (OUT2,">>$DA[$p]_file$q.txt")or die $!; foreach(@FH1) { chomp($_);@new=split('\|',$_);@new_bndry=split('\-',$new[1]);$bndry_C1=$new_bndry[1];$diff=$bndry_N-$bndry_C1; if($uid_A eq $new[0] and $diff<=$DL[$p]) { print "$array[0]\t$bndry\t$_\t**$bndry_N-$bndry_C1\t$diff\t$p\n";print OUT2 "$uid_A|$bndry\n";push(@next_ar,"$uid_A|$bndry\n"); } } close OUT2; } } } } script 2: #!/usr/bin/perl open (ID,"dom_arch_in.txt")or die $!;@ID = ;$line=$ID[0];chomp($line); if($line=~/^\#.*/) {$x=1;$d_dlinker=40;} elsif($line=~/^\%.*/){$x=2;} @DA=split('\s',$line);shift(@DA); if($x==1) { $k=0; for($d=0;$d<=(@DA);$d++) { $k++;push(@DL,$d_dlinker); } } elsif($x>1) { for($i=1;$i<=(@DA);$i=$i+2) { push @AA, $DA[$i] if $DA[$i] ne '';push(@DL,$DA[$i-1]); } @DA=@AA; } $len1=(@DA);$len2=(@DL);$DAJ=join('_',@DA);$bndry_C=$DL[$len1-1]; open (SEQ,"$DAJ.fasta")or die $!;@seq = ;open (FH,"$DA[$len1-1]_file$len1.txt")or die $!;@FH = ;open (DAP,">$DAJ"."_id_non_final.txt")or die $!; for($j=0;$j<(@seq);$j=$j+2) { $sq=$seq[$j+1];chomp($sq);$sqlen=length($sq);$count=0;@array1=split('\|',$seq[$j]);$id1=$array1[1]; foreach(@FH) { @array2=split('\|',$_);$id2=$array2[0];$bndry=$array2[1];chomp($bndry);@bn=split('\-',$bndry);$diff=$sqlen-$bn[1]; if($id1 eq $id2 and $diff<=$bndry_C) { $count++; print "$id1\t$id2\t$bndry\t$count\t$bn[0]\t$bn[1]\t$sqlen\t$diff\n";print DAP "$id2\n";} } } close DAP;