# Tutorial

The following tutorial shows how to carry out KODAMA analysis using the freely available software R.

1) To install the R software, go to http://www.r-project.org

2) After installing R, you need to install the R package “KODAMA“.

In the example below, KODAMA is performed on MetRef data set. It contains NMR spectra of urines. The data belong to a cohort of 22 healthy donors (11 male and 11 female) where each provided about 40 urine samples over the

time course of approximately 2 months, for a total of 873 samples and 416 variables.

# Loading of the MetRef dataset.data(MetRef)# Zero values are removed. MetRef$data=MetRef$data[,- which(colSums(MetRef$data)==0)] # Normalization of the data set with the Probabilistic Quotient Normalization method.MetRef$data= normalization(MetRef$data)$newXtrain # Centers the mean to zero and scales data by dividing each variable by the variance.MetRef$data= scaling(MetRef$data)$newXtrain donor= as.numeric(as.factor(MetRef$donor))gender= as.numeric(as.factor(MetRef$gender))# Principal Component Analysis is computedpca= prcomp(MetRef$data)$xplot(pca,pch=20+gender,bg=rainbow(25)[donor],xlab=”First Principal Component”, ylab=”Second Principal Component”) |

Principal Component Analysis, performed on this data set, is shown in figure. Colors are indicative of the belonging to each donor and shape to the gender of the donor.

## Principal Component Analysis

Here, KODAMA is performed on data set using the *k*-NN classifier (*k*=10) and the results are shown computing the Multidimensional Scaling on the dissimilarity matrix of KODAMA.

# KODAMA is computed using k-NN as supervised classifier kk0= KODAMA(MetRef$data,FUN=KNN.CV) # Multidimensional Scaling is computed on the KODAMA dissimilarity matrix mds_kodama_kk0= cmdscale(kk0$dissimilarity) |

## KODAMA with *k*NN classifier

KODAMA analysis can be performed using different classifiers (*e.g.*, *k*-NN, PCA-CA-*k*NN, SVM). In the next example, KODAMA is performed using the PCA-CA-*k*NN classifier. By default the initialization of starting vector *W* is set up with each samples labeled to a different class. Here, the initialization vector *W* is set up with the output of the *k*-means clustering. In the KODAMA function, *W* can be a vector or a function.

# KODAMA is computed using PCA-CA-kNN as supervised classifier. The initialization vector W is set up with the result of k-means.kk1= KODAMA(MetRef$data,FUN=PCA.CA.KNN.CV,W=function(x) as.numeric(kmeans(x,50)$cluster))# MDS is computed on the KODAMA dissimilarity matrix mds_kodama_kk1= cmdscale(kk1$dissimilarity)plot(mds_kodama_kk1,pch= 20+gender,bg=rainbow(25)[donor],xlab=”First Component”, ylab=”Second Component”) |

## KODAMA with PCA-CA-*k*NN classifier

Multidimensional Scaling is not the only method that it is possible to use to visualize the dissimilarity matrix of KODAMA. t-SNE or Tree Preserving Embedding methods can be used as well even with better results.

# Tree Preserving Embedding method provides a better visualization than MDS # install.packages(“tpe”) library(“tpe”)tpe_kodama_kk1= tpe2d(kk1$dissimilarity)plot(tpe_kodama_kk1,pch= 20+gender,bg=rainbow(25)[donor],xlab=”First Component”, ylab=”Second Component”) |

## KODAMA & TPE

KODAMA can use external information to work in a semisupervised way. The KODAMA procedure can be started by different initializations of the vector *W.* Supervised constraints can be imposed by linking some samples in such a way that if one of them is changed the linked ones must change in the same way (*i.e.*, they are forced to belong to the same class). This will produce solutions where linked samples are forced to have the lowest values in the KODAMA dissimilarity matrix.

In next example, urine samples from the same donor are forced to stay together. This semisupervised approach may highlight hidden features. In this case, a clear gender separation emerges from the figure.

# KODAMA is computed using PCA-CA-kNN as supervised classifier. The initialization vector W is set up with the classification to each donor.kk3= KODAMA(MetRef$data,FUN=PCA.CA.KNN.CV,W=donor,constrain=donor )# MDS is computed on the KODAMA dissimilarity matrix mds_kodama_kk3= cmdscale(kk3$dissimilarity)plot(mds_kodama_kk3,pch=20+gender,bg=rainbow(25)[donor],xlab=”First Component”, ylab=”Second Component”) |

## KODAMA Semisupervised Type-I

Information can be provided also imposing to the KODAMA to maintain the labels of vector *W*. The value of the vector fix must be TRUE of FALSE. By default all elements are FALSE. Samples with the TRUE fix value will not change the class label defined in W during the maximization of the cross-validation accuracy procedure.

Here, information on the gender belonging is provided for the first ten donor. Color coding indicates the gender. Square data points are samples with supervised information. Circle data points are samples without any supervised information.

FIX=donor<=10 inform=gender inform[!FIX]=NA kk4= KODAMA(MetRef$data,FUN=PCA.CA.KNN.CV,W=inform,fix=FIX)# MDS is computed on the KODAMA dissimilarity matrix mds_kodama_kk4= cmdscale(kk4$dissimilarity)color=c(“#2c7ac8″,”#e3b80f”,”#7b7979″,”#333333″); plot(mds_kodama_kk4,pch=21+as.numeric(FIX),bg=color[gender],xlab=”First Component”, ylab=”Second Component”) |

## KODAMA SEMISUPERVISED TYPE-II

Here, information of the last two examples are provided together.

kk5=KODAMA(MetRef$data,FUN=PCA.CA.KNN.CV,W= inform,constrain=donor,fix=FIX)# MDS is computed on the KODAMA dissimilarity matrix mds_kodama_kk5= cmdscale(kk5$dissimilarity)color=c(“#2c7ac8″,”#e3b80f”,”#7b7979″,”#333333″); plot(mds_kodama_kk5,pch=21+as.numeric(FIX),bg=color[gender],xlab=”First Component”, ylab=”Second Component”) |

## KODAMA SEMISUPERVISED TYPE-I & Type-II

Clustering can be performed on KODAMA dissimilarity matrix providing better results than clustering performed on Euclidean distance matrix. Here, heatmap ordered with hierarchical clustering based on KODAMA dissimilarity matrix is shown. Bar colors are indicative of the belonging to the donor.

kkcol=KODAMA(t(MetRef$data),FUN=PCA.CA.KNN.CV,W=function(x) as.numeric(kmeans(x,50)$cluster)) row_hclust=hclust(kk1$dissimilarity,method=”ward”) col_hclust=hclust(kkcol$dissimilarity,method=”ward”) classes=unique(donor) x=MetRef$data vv=sort(c(sort(x[x>0],decreasing=TRUE)[1+(0:9)*round(length(x[x>0])/11)],0,sort(x[x<0])[1+(0:9)*round(length(x[x<0])/11)])) temp=x for(i in 1:20) temp[x>=vv[i] & x<=vv[i+1]]=i x=temp cool=colorRampPalette(c(“#ffff00”, “#000000″,”#aaffff”)) heatmap(x,RowSideColors= rainbow(length(classes))[donor],col=cool(20), reorderfun = function(d, w) {d}, hclustfun=function(x) if(nrow(as.matrix(x))==length(donor)){row_hclust}else{col_hclust}, labRow=NA,labCol=NA) |