2 Commits

Author SHA1 Message Date
Pitchaya Boonsarngsuk
bb01fc8eba แก้ตำแหน่งรูป 2018-03-23 13:45:16 +00:00
Pitchaya Boonsarngsuk
57af88b97c แก้นิดหน่อย ถถถ 2018-03-23 13:29:04 +00:00

View File

@@ -157,11 +157,11 @@ Because strain assumes Euclidean distances, making it incompatible with other di
This project focuses on several non-linear MDS algorithms using force-directed layout. The idea is to attach each pair of data points with a spring whose equilibrium length is proportional to the high-dimensional distance between the two points, although the spring model we know today does not necessary use Hooke's law to calculate the spring force\cite{Eades}. Several improvements have been introduced to the idea over the past decade. For example, the concept of 'temperature' purposed by Fruchterman and Reingold\cite{SpringTemp} solves the problem where the system is unable to reach an equilibrium state and improves execution time. The project focuses on an iterative spring-model-based algorithm introduced by Chalmers\cite{Algo1996} and the Hybrid approach which will be detailed in subsequent sections of this chapter.
There is a number of other non-linear MDS algorithms. t-distributed Stochastic Neighbour Embedding (t-SNE)\cite{tSNE}, for example, is very popular in the field of machine learning. It is based on SNE\cite{SNE} where probability distributions are constructed over each pair of data point in a way that the more similar objects have higher probability of being picked. The distributions derived from both high-dimensional and low-dimensional distances are compared using the KullbackLeibler divergence, a metric to measure the similarity between two probability distributions. Then, the 2D position of each data points are then iteratively adjusted to maximize the similarity. The biggest downside is that it have both time and memory complexity of $O(N^2)$ per iteration. In 2017, Bartasius\cite{LastYear} implemented t-SNE in D3 and found that not only is it the slowest algorithm in his test, the produced layout is also the many times worse in term of Stress, a metric which will be introduced in section \ref{sec:bg_metrics}. However, comparing the Stress of a t-SNE layout is unfair as t-SNE is designed to optimise the KullbackLeibler divergence and not Stress.
There is a number of other non-linear MDS algorithms avilable. t-distributed Stochastic Neighbour Embedding (t-SNE)\cite{tSNE}, for example, is very popular in the field of machine learning. It is based on SNE\cite{SNE} where probability distributions are constructed over each data point in a way that the other similar objects have higher probability of being picked. The distributions derived from both high-dimensional and low-dimensional distances are compared using the Kullback-Leibler divergence, a metric for measuring the similarity between two probability distributions. Then, the 2D position of each data point is iteratively adjusted to maximise the similarity. The biggest downside is that it have both time and memory complexity of $O(N^2)$ per iteration. In 2017, Bartasius\cite{LastYear} implemented t-SNE in D3 and found that not only is it the slowest algorithm in his test, the produced layout is also the many times worse in term of Stress, a metric which will be introduced in section \ref{sec:bg_metrics}. However, comparing the Stress of a t-SNE layout is unfair as t-SNE is designed to optimise the distribution divergence and not Stress.
Other algorithms use different approaches. Kernel PCA tricks classical MDS (PCA) into being non-linear by using the kernels\cite{kPCA}. Simply put, kernel functions are used to create new dimensions from the existing ones. These kernels can be non-linear. Hence, PCA can use these new dimensions to create a non-linear combination of the original dimensions. The limitation is that the kernels are user-defined, thus, it is up to the user to define good kernels to create a good layout.
Local MDS\cite{LMDS} performs a different trick on MDS by using MDS in local regions and stitching them together, using convex optimization. While it focuses on Trustworthiness and Continuity, the errors concerning each data points' neighbourhood, its overall layouts fail to form any visible clusters.
Sammon's mapping\cite{Sammon}, on the other hand, find a good position for each data point by using gradient descent to minimise Sammon's error, a function similar to Stress (section \ref{sec:bg_metrics}). However, gradient descent can only find a local minimum and the solution is not guaranteed to converge.
Other algorithms use different approaches. Kernel PCA tricks the classical MDS (PCA) into being non-linear by using the kernels\cite{kPCA}. Simply put, kernel functions are used to create new dimensions from the existing ones. These kernels can be non-linear. Hence, PCA can use these new dimensions to form a non-linear combination of the original dimensions. The limitation is that the kernels are user-defined, thus, it is up to the user to select appropriate kernels to produce a good layout.
Local MDS\cite{LMDS} performs a different trick on MDS by only using MDS in local regions and stitching each region together, using convex optimisation. While it focuses on Trustworthiness and Continuity, the error metrics concerning each data point's neighbourhood, its overall layouts fail to form any meaningful clusters.
Sammon's mapping\cite{Sammon}, on the other hand, find a good position for each data point by using gradient descent to minimise Sammon's error, another function similar to Stress (section \ref{sec:bg_metrics}). However, gradient descent can only find a local minimum and the solution is not guaranteed to ever converge.
The rest of this chapter will describes each of the algorithm and performance metrics used in this project in detail.
@@ -215,18 +215,21 @@ Previous evaluations show that this method is faster than the Chalmers' 1996 alg
\section{Hybrid MDS with Pivot-Based Searching algorithm}
\label{sec:bg_hybridPivot}
\begin{wrapfigure}{rh}{0.3\textwidth}
\centering
\includegraphics[width=0.3\textwidth]{images/pivotBucketsIllust.png}
\caption{Diagram of a pivot (dark shaded point) with five buckets, illustrated as discs between dotted circle. Each of the other points in $S$ are classified into buckets by the distances to the pivot.}
\label{fig:bg_pivotBuckets}
\end{wrapfigure}
The bottleneck of the Hybrid Layout Algorithm is the nearest-neighbour searching process during the interpolation. The previous brute-force method results in the time complexity of $O(N\sqrt{N})$. This improvement introduces pivot-based searching to approximate a near-neighbour and reduces the time complexity to $O(N^\frac{5}{4})$\cite{Algo2003}.
The main improvements is gained by pre-processing the set $S$ ($\sqrt{N}$ samples) so that each of the $N-\sqrt{N}$ other points can find the parent is faster. To begin, $k$ points were selected from $S$ as `parent'. Each pivot $p\in{k}$ have a number of buckets. Every other points in $S-\{p\}$ assigned a bucket number, based on the distance from to $p$ as illustrated in figure \ref{fig:bg_pivotBuckets}.
To find a parent of an object, a distance calculation is first performed against each pivot to determine which bucket of each pivot is the object in. From this, the content of each bucket is searched for the nearest neighbor.
To find a parent of an object, a distance calculation is first performed against each pivot to determine which bucket of each pivot is the object in. From this, the content of each bucket is searched for the nearest neighbour.
\break
\begin{wrapfigure}{Rh}{0.3\textwidth}
\vspace{-230pt}
\centering
\includegraphics[width=0.28\textwidth]{images/pivotBucketsIllust.png}
\caption{Diagram of a pivot (dark shaded point) with five buckets, illustrated as discs between dotted circle. Each of the other points in $S$ are classified into buckets by the distances to the pivot.}
\label{fig:bg_pivotBuckets}
\end{wrapfigure}
\begin{algorithmic}
\item Pre-processing: