\documentstyle[11pt,psfig,epsf]{article}

 \textwidth 7.0in
 \textheight 9.0in
 \oddsidemargin -0.2in
 \evensidemargin -0.2in

 \topmargin -0.5in


\title{ \vspace*{-1.2cm}  \Large\bf
CS 5623 Simulation Techniques \\
Fall 2003\\
{\large How to model data using matlab  }}


\author{Instructor Dr. Turgay Korkmaz}
%\date{}


\begin{document}
\maketitle

This tutorial along with matlab's statistical toolbox manual will
be useful for HW 4.

\section{Data collection}
I got  npd-routes-1 data from {\tt  http://ita.ee.lbl.gov/html/contrib/NPD-Routes.html}\\
Basically, they collected data using {\tt traceroute}.
After un-tar'ing the data, I used
\begin{verbatim}
 > cat r*/* | awk '{if (NF ==8)  print $1, $3, $5, $7 }' > tall
\end{verbatim}
The file {\tt tall} has the following format
\begin{verbatim}
         hopcount time1 time2 time3
\end{verbatim}
where {\tt time1, time2, and time3} denote the round-trip-times
between the source and the router which is {\tt hopcount} hops
away from the source node.

After cleaning some invalid rows manually, I selected the rows
with hopcount=15 and saved them into {\tt t15} using
\begin{verbatim}
 > cat tall | awk '{if (($1==15))  print $1, $2, $3, $4 }' > t15
\end{verbatim}

Now we have some data to analyze!

We can load our data in matlab as follows
\begin{verbatim}
 load t15
\end{verbatim}

Assume that we consider only the {\tt time1} data which can be
accessed using t15(:,2)

We can compute sample mean, variance, median, min, max for {\tt time1} data
\begin{verbatim}
 min(t15(:,2))
 max(t15(:,2))
 median(t15(:,2))
 mean(t15(:,2))
 var(t15(:,2))
\end{verbatim}

To get help about any matlab function use
\begin{verbatim}
 help func-name
 help mean
\end{verbatim}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\newpage
\section{Identify the prob dist family}
In this part, we need to have a frequency or histogram diagram.
For this, matlab has a function called {\tt hist} which (by
default) puts data points into 10 equally spaced intervals.
\begin{verbatim}
     hist(t15(:,2));
     ylabel('frequency');
     xlabel('intervals: values of data points');
\end{verbatim}
These three commands will give the histogram in Figure~\ref{fig:fig1}.
%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}[htbp]
 \centerline{
 \hbox{\psfig{figure=fig/fig1.eps,height=4in}}} % ,width=7.0in}}}
\caption{Histogram within 10 intervals (default).}
\label{fig:fig1}
\end{figure}
%%%%%%%%%%%%%%%
This  histogram shows that the distribution of our data looks like
an exponential distribution, right! But this will not be a good
fit because default {\tt hist} hides  so much valuable data.

Remember the recommended number of intervals is $\sqrt{n}$, where
$n$ is the number of data points. To determine the number of data
points and the number of intervals, we can use the followings in
matlab:
\begin{verbatim}
    n=size(t15(:,2));
    ninterval=ceil(sqrt(n(1)));
\end{verbatim}
Then, we can draw the histogram using
\begin{verbatim}
    hist(t15(:,2),ninterval);
\end{verbatim}
These three commands will give the histogram in Figure~\ref{fig:fig2}.
%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}[htbp]
 \centerline{
 \hbox{\psfig{figure=fig/fig2.eps,height=4in}}} % ,width=7.0in}}}
\caption{ Histogram within $\sqrt{n}$ intervals (recommended).}
\label{fig:fig2}
\end{figure}
%%%%%%%%%%%%%%%

Now if you run {\tt disttool} in matlab  and consider various
distributions and their pdfs, you will see that histogram looks
like gamma (Weibull also seems good), as shown in Figure~\ref{fig:fig3}.
%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}[htbp]
 \centerline{
 \hbox{\psfig{figure=fig/fig3.eps,height=4in}}} % ,width=7.0in}}}
\caption{The {\tt disttool} showing the pdf of gamma distribution.}
\label{fig:fig3}
\end{figure}
%%%%%%%%%%%%%%%
Accordingly, we can select family of gamma dist in this step as the best
possible model for our data.

Actually, matlab also provides another nice function to make the
first guess accurately. For example, we can also draw the
empirical pdf to have better idea in deciding which distribution
family to choose. For this, we can use {\tt ksdensity} function as
follows.
\begin{verbatim}
     [f,x, u ]= ksdensity(t15(:,2));
     plot(x,f,'r')
     hold on
     [f,x ] = ksdensity(t15(:,2),'width',2*u);
     plot(x,f,'b')
     [f,x ] = ksdensity(t15(:,2),'width',4*u);
     plot(x,f,'g')
     [f,x ] = ksdensity(t15(:,2),'width',6*u);
     plot(x,f,'y')
\end{verbatim}
This function produces the smoothed forms of the empirical
distribution, as shown in Figure~\ref{fig:fig4}.
%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}[htbp]
 \centerline{
 \hbox{\psfig{figure=fig/fig4.eps,height=3in}}} % ,width=7.0in}}}
\caption{Empirical distribution.}
\label{fig:fig4}
\end{figure}
%%%%%%%%%%%%%%%
This figure also shows that the empirical pdf of our data looks
like gamma distribution, right!.

So, we assume that our data has gamma distribution.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\newpage
\section{Determine parameters for the chosen distribution}

Matlab has several functions to estimate parameters of several distributions
and the 95\% confidence intervals for those estimations.

For our data,
        if we chose exp, we would use {\tt [muhat, cihat]=expfit(t15(:,2))}

But, since we selected gamma, we will use
\begin{verbatim}
     [phat, pcihat]=gamfit(t15(:,2))
phat =
    2.1757   87.6674   % means A and B of gamma distribution, respectively.

pcihat =
    2.0682   84.9318
    2.2833   90.4030
\end{verbatim}


So, we assume that our data has gamma distribution with \\
A($\theta$ in our book)=phat(1)=2.1757  and \\
B($\beta$ in our book)=phat(2)=87.6674).

Theoretical pdf of gamma distribution with these parameters can be
drawn as follows
\begin{verbatim}
     x=0:0.1:2500;
     y = gampdf(x,phat(1), phat(2));
     plot(x,y)
\end{verbatim}
As a result, we get the pdf of gamma with the given parameters  as
shown in Figure~\ref{fig:fig5}.
%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}[htbp]
 \centerline{
 \hbox{\psfig{figure=fig/fig5.eps,height=4in}}} % ,width=7.0in}}}
\caption{Theoretical pdf of gamma distribution.}
\label{fig:fig5}
\end{figure}
%%%%%%%%%%%%%%%

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\newpage
\section{Evaluate how good the chosen distribution}

In matlab, we can do several graphical comparisons between the
theoretically assumed distribution and the empirical distribution.

So let's first do graphical comparisons. We can then talk about
formal statistical tests.

%%%%%%%%%%%%
\subsection{Graphical (heuristic) tests}
Using matlab, we can  draw the theoretical pdf and cdf of gamma dist
 with estimated parameters A=phat(1) and B=phat(2). We can also draw
empirical pdf and cdf. We can then compare them, respectively.

To compare pdfs, we can use
\begin{verbatim}
     [f,x,u ]= ksdensity(t15(:,2));
     [f,x ]= ksdensity(t15(:,2),'width',5*u);
     plot(x,f,'r')
     hold on
     x=0:0.1:2000;
     y = gampdf(x,phat(1), phat(2));
     plot(x,y)
\end{verbatim}
%%%%%%%%%%%%%%%%%%%%%%%
As a result, we will get Figure~\ref{fig:fig6a}.
\begin{figure}[htbp]
 \centerline{
 \hbox{\psfig{figure=fig/fig6a.eps,height=4in}}} % ,width=7.0in}}}
\caption{Compare theoretical and empirical pdfs in case of gamma
distribution.} \label{fig:fig6a}
\end{figure}


To compare cdfs, we can use
\begin{verbatim}
     hold off
     [F,x]=ecdf(t15(:,2));
     plot(x,F,'r');
     hold on
     x=0:0.1:2000;
     y=gamcdf(x,phat(1), phat(2));
     plot(x,y)
\end{verbatim}
As a result, we will get Figure~\ref{fig:fig6b}.
\begin{figure}[htbp]
 \centerline{
 \hbox{\psfig{figure=fig/fig6b.eps,height=4in}}} % ,width=7.0in}}}
\caption{Compare theoretical and empirical cdfs in case of gamma
distribution.} \label{fig:fig6b}
\end{figure}
%%%%%%%%%%%%%%%

How do they look? They seem similar, right! But we cannot trust
these similarities. So we need to look at the qqplot, ppplot,
scater, to check if the assumed and empirical distributions are
really similar

To draw qqplot, you need to use followings.
\begin{verbatim}
    clear
    load t15
    n=size(t15(:,2));
    ninterval=ceil(sqrt(n(1)));
    [phat, pcihat]=gamfit(t15(:,2))

    x=[];
    a=sort(t15(:,2));
    prev=-1;
    j=0;
    for i=1:n(1),
      if prev~=a(i)
        j = j+1;
        x(j) = a(i);
        prev=a(i);
      end
    end
    for i=1:j,
      y(i)=gaminv(((i-.5)/j), phat(1), phat(2));
    end
    plot(x,y,'.');
    hold on
\end{verbatim}

The above procedure might take too much time.
Another simple approach could be the following.
Simply, generate some random data from the given theoretical gamma
distribution and then use the {\tt qqplot} in matlab to draw the qqplot
between the randomly generated data and the empirical data.
\begin{verbatim}
    y = gamrnd(phat(1), phat(2),1,n(1));
    qqplot(t15(:,2),y);
    legend('Actual procedure','Simple procedure');
\end{verbatim}
As a result, we get approximately the same qqplot when we use the
actual and simple approach, as shown in Figure~\ref{fig:fig6}.
%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}[htbp]
 \centerline{
 \hbox{\psfig{figure=fig/fig6.eps,height=4in}}} % ,width=7.0in}}}
\caption{qqplot using actual and a simple procedures.}
\label{fig:fig6}
\end{figure}
%%%%%%%%%%%%%%%
As shown in qqplot, the assumed gamma is not a good fit at the tail
of the distribution while it is a good fit in the main part of the
distribution.


We can also look at the pp-plot. For pp-plot we can use the followings
\begin{verbatim}
     [eF,x]=ecdf(t15(:,2));
     tF =  gamcdf(x, phat(1), phat(2));
     plot(eF,tF,'.');
\end{verbatim}
As a result, we get the pp-plot
in Figure~\ref{fig:fig7}.
%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}[htbp]
 \centerline{
 \hbox{\psfig{figure=fig/fig7.eps,height=4in}}} % ,width=7.0in}}}
\caption{pp-plot.}
\label{fig:fig7}
\end{figure}
%%%%%%%%%%%%%%%

pp-plot also shows that gamma is a better fit in the middle of
dist than the tail. But still the fit is not exact.


At this point, there is no need to use chi-square or KS
statistical tests. We need to consider another distribution like
Weibull or simply use empirical distribution.

\subsection{Goodness-of-fit tests}
  left as an exercise.

\section{Learn by doing it}

Download {\tt tall} and then do the followings.

1. For the same data used in this tutorial, consider the Weibull
distribution and repeat the above operations.

2. Consider {\tt time1} data when hopcount=10 and repeat the above
operations.


\end{document}
