4 Commits

Author SHA1 Message Date
Pitchaya Boonsarngsuk
e4acdcdae4 แก้ตำแหน่งรูป 2.... เพราะ latex 2018-03-23 14:03:35 +00:00
Pitchaya Boonsarngsuk
bb01fc8eba แก้ตำแหน่งรูป 2018-03-23 13:45:16 +00:00
Pitchaya Boonsarngsuk
57af88b97c แก้นิดหน่อย ถถถ 2018-03-23 13:29:04 +00:00
Pitchaya Boonsarngsuk
4fad3f36ab แก้ดราฟต์ 3 2018-03-22 17:51:56 +00:00
2 changed files with 26 additions and 18 deletions

View File

@@ -254,4 +254,9 @@ doi = "https://doi.org/10.1016/j.neunet.2006.05.014",
url = "http://www.sciencedirect.com/science/article/pii/S0893608006000724", url = "http://www.sciencedirect.com/science/article/pii/S0893608006000724",
author = "Jarkko Venna and Samuel Kaski", author = "Jarkko Venna and Samuel Kaski",
keywords = "Information visualization, Manifold extraction, Multi-dimensional scaling (MDS), Nonlinear dimensionality reduction, Non-linear projection, Gene expression" keywords = "Information visualization, Manifold extraction, Multi-dimensional scaling (MDS), Nonlinear dimensionality reduction, Non-linear projection, Gene expression"
} }
@misc{eslint,
title={ESLint - Pluggable JavaScript linter}, url={https://eslint.org/}, journal={ESLint - Pluggable JavaScript linter}}
@misc{js-beautify, title={js-beautify}, url={https://www.npmjs.com/package/js-beautify}, journal={npm}}

View File

@@ -157,11 +157,11 @@ Because strain assumes Euclidean distances, making it incompatible with other di
This project focuses on several non-linear MDS algorithms using force-directed layout. The idea is to attach each pair of data points with a spring whose equilibrium length is proportional to the high-dimensional distance between the two points, although the spring model we know today does not necessary use Hooke's law to calculate the spring force\cite{Eades}. Several improvements have been introduced to the idea over the past decade. For example, the concept of 'temperature' purposed by Fruchterman and Reingold\cite{SpringTemp} solves the problem where the system is unable to reach an equilibrium state and improves execution time. The project focuses on an iterative spring-model-based algorithm introduced by Chalmers\cite{Algo1996} and the Hybrid approach which will be detailed in subsequent sections of this chapter. This project focuses on several non-linear MDS algorithms using force-directed layout. The idea is to attach each pair of data points with a spring whose equilibrium length is proportional to the high-dimensional distance between the two points, although the spring model we know today does not necessary use Hooke's law to calculate the spring force\cite{Eades}. Several improvements have been introduced to the idea over the past decade. For example, the concept of 'temperature' purposed by Fruchterman and Reingold\cite{SpringTemp} solves the problem where the system is unable to reach an equilibrium state and improves execution time. The project focuses on an iterative spring-model-based algorithm introduced by Chalmers\cite{Algo1996} and the Hybrid approach which will be detailed in subsequent sections of this chapter.
There is a number of other non-linear MDS algorithms. t-distributed Stochastic Neighbour Embedding (t-SNE)\cite{tSNE}, for example, is very popular in the field of machine learning. It is based on SNE\cite{SNE} where probability distributions are constructed over each pair of data point in a way that the more similar objects have higher probability of being picked. The distributions derived from both high-dimensional and low-dimensional distances are compared using the KullbackLeibler divergence, a metric to measure the similarity between two probability distributions. Then, the 2D position of each data points are then iteratively adjusted to maximize the similarity. The biggest downside is that it have both time and memory complexity of $O(N^2)$ per iteration. In 2017, Bartasius\cite{LastYear} implemented t-SNE in D3 and found that not only is it the slowest algorithm in his test, the produced layout is also the many times worse in term of Stress, a metric which will be introduced in section \ref{sec:bg_metrics}. However, comparing the Stress of a t-SNE layout is unfair as t-SNE is designed to optimise the KullbackLeibler divergence and not Stress. There is a number of other non-linear MDS algorithms avilable. t-distributed Stochastic Neighbour Embedding (t-SNE)\cite{tSNE}, for example, is very popular in the field of machine learning. It is based on SNE\cite{SNE} where probability distributions are constructed over each data point in a way that the other similar objects have higher probability of being picked. The distributions derived from both high-dimensional and low-dimensional distances are compared using the Kullback-Leibler divergence, a metric for measuring the similarity between two probability distributions. Then, the 2D position of each data point is iteratively adjusted to maximise the similarity. The biggest downside is that it have both time and memory complexity of $O(N^2)$ per iteration. In 2017, Bartasius\cite{LastYear} implemented t-SNE in D3 and found that not only is it the slowest algorithm in his test, the produced layout is also the many times worse in term of Stress, a metric which will be introduced in section \ref{sec:bg_metrics}. However, comparing the Stress of a t-SNE layout is unfair as t-SNE is designed to optimise the distribution divergence and not Stress.
Other algorithms use different approaches. Kernel PCA tricks classical MDS (PCA) into being non-linear by using the kernels\cite{kPCA}. Simply put, kernel functions are used to create new dimensions from the existing ones. These kernels can be non-linear. Hence, PCA can use these new dimensions to create a non-linear combination of the original dimensions. The limitation is that the kernels are user-defined, thus, it is up to the user to define good kernels to create a good layout. Other algorithms use different approaches. Kernel PCA tricks the classical MDS (PCA) into being non-linear by using the kernels\cite{kPCA}. Simply put, kernel functions are used to create new dimensions from the existing ones. These kernels can be non-linear. Hence, PCA can use these new dimensions to form a non-linear combination of the original dimensions. The limitation is that the kernels are user-defined, thus, it is up to the user to select appropriate kernels to produce a good layout.
Local MDS\cite{LMDS} performs a different trick on MDS by using MDS in local regions and stitching them together, using convex optimization. While it focuses on Trustworthiness and Continuity, the errors concerning each data points' neighbourhood, its overall layouts fail to form any visible clusters. Local MDS\cite{LMDS} performs a different trick on MDS by only using MDS in local regions and stitching each region together, using convex optimisation. While it focuses on Trustworthiness and Continuity, the error metrics concerning each data point's neighbourhood, its overall layouts fail to form any meaningful clusters.
Sammon's mapping\cite{Sammon}, on the other hand, find a good position for each data point by using gradient descent to minimise Sammon's error, a function similar to Stress (section \ref{sec:bg_metrics}). However, gradient descent can only find a local minimum and the solution is not guaranteed to converge. Sammon's mapping\cite{Sammon}, on the other hand, find a good position for each data point by using gradient descent to minimise Sammon's error, another function similar to Stress (section \ref{sec:bg_metrics}). However, gradient descent can only find a local minimum and the solution is not guaranteed to ever converge.
The rest of this chapter will describes each of the algorithm and performance metrics used in this project in detail. The rest of this chapter will describes each of the algorithm and performance metrics used in this project in detail.
@@ -169,16 +169,16 @@ The rest of this chapter will describes each of the algorithm and performance me
\label{sec:linkbg} \label{sec:linkbg}
D3 library, which will be described in section \ref{sec:des_d3}, have several different force models implemented for creating a force-directed graph. One of them is Link Force. In this brute-force method, a force is applied between the two nodes at the end of each link. The force pushes the nodes together or apart with varying strength, proportional to the error between the desired and current distance on the graph. Essentially, is the spring model with a custom spring-force calculation formula. An example of a graph produced by the D3 link force is shown in figure \ref{fig:bg_linkForce}. In MDS where the high-dimensional distance between every pair of nodes can be calculated, a link will be created to represent each pair, resulting in a complete graph. D3 library, which will be described in section \ref{sec:des_d3}, have several different force models implemented for creating a force-directed graph. One of them is Link Force. In this brute-force method, a force is applied between the two nodes at the end of each link. The force pushes the nodes together or apart with varying strength, proportional to the error between the desired and current distance on the graph. Essentially, is the spring model with a custom spring-force calculation formula. An example of a graph produced by the D3 link force is shown in figure \ref{fig:bg_linkForce}. In MDS where the high-dimensional distance between every pair of nodes can be calculated, a link will be created to represent each pair, resulting in a complete graph.
\begin{figure}[h] The Link Force algorithm is inefficient. In each time step (iteration), a calculation has to be done for each pair of nodes connected with a link. This means that for MDS with $N$ nodes, the algorithm will have to perform $N(N-1)$ force calculations per iteration, essentially $O(N^2)$. It is also believed that the number of iterations required to create a good layout is proportional to the size of the data set, hence the total time complexity of $O(N^3)$.
The model also cache the desired distance of each link in memory to improve speed across multiple iterations. While this greatly reduces the number of calls to the distance-calculating function, the memory complexity also increases to $O(N^2)$. Because JavaScript memory heap is limited, it runs out of memory when trying to process a complete graph of more than around three thousands points, depending on the features of the data.
\begin{figure}[ht]
\centering \centering
\includegraphics[height=9.2cm]{d3-samples/d3-force.png} \includegraphics[height=9.2cm]{d3-samples/d3-force.png}
\caption{An example of a graph produced by D3 Link Force.} \caption{An example of a graph produced by D3 Link Force.}
\label{fig:bg_linkForce} \label{fig:bg_linkForce}
\end{figure} \end{figure}
The Link Force algorithm is inefficient. In each time step (iteration), a calculation has to be done for each pair of nodes connected with a link. This means that for MDS with $N$ nodes, the algorithm will have to perform $N(N-1)$ force calculations per iteration, essentially $O(N^2)$. It is also believed that the number of iterations required to create a good layout is proportional to the size of the data set, hence the total time complexity of $O(N^3)$.
The model also cache the desired distance of each link in memory to improve speed across multiple iterations. While this greatly reduces the number of calls to the distance-calculating function, the memory complexity also increases to $O(N^2)$. Because JavaScript memory heap is limited, it runs out of memory when trying to process a complete graph of more than around three thousands data points, depending on the features of the data.
\section{Chalmers' 1996 algorithm} \section{Chalmers' 1996 algorithm}
In 1996, Matthew Chalmers proposed a technique to reduce the time complexity down to $O(N^2)$, which is a massive improvement over link force's $O(N^3)$, potentially at the cost of accuracy. This is done by reducing the number of spring force calculations per iterations, using random samples\cite{Algo1996}. In 1996, Matthew Chalmers proposed a technique to reduce the time complexity down to $O(N^2)$, which is a massive improvement over link force's $O(N^3)$, potentially at the cost of accuracy. This is done by reducing the number of spring force calculations per iterations, using random samples\cite{Algo1996}.
@@ -215,18 +215,21 @@ Previous evaluations show that this method is faster than the Chalmers' 1996 alg
\section{Hybrid MDS with Pivot-Based Searching algorithm} \section{Hybrid MDS with Pivot-Based Searching algorithm}
\label{sec:bg_hybridPivot} \label{sec:bg_hybridPivot}
\begin{wrapfigure}{rh}{0.3\textwidth}
\centering
\includegraphics[width=0.3\textwidth]{images/pivotBucketsIllust.png}
\caption{Diagram of a pivot (dark shaded point) with five buckets, illustrated as discs between dotted circle. Each of the other points in $S$ are classified into buckets by the distances to the pivot.}
\label{fig:bg_pivotBuckets}
\end{wrapfigure}
The bottleneck of the Hybrid Layout Algorithm is the nearest-neighbour searching process during the interpolation. The previous brute-force method results in the time complexity of $O(N\sqrt{N})$. This improvement introduces pivot-based searching to approximate a near-neighbour and reduces the time complexity to $O(N^\frac{5}{4})$\cite{Algo2003}. The bottleneck of the Hybrid Layout Algorithm is the nearest-neighbour searching process during the interpolation. The previous brute-force method results in the time complexity of $O(N\sqrt{N})$. This improvement introduces pivot-based searching to approximate a near-neighbour and reduces the time complexity to $O(N^\frac{5}{4})$\cite{Algo2003}.
The main improvements is gained by pre-processing the set $S$ ($\sqrt{N}$ samples) so that each of the $N-\sqrt{N}$ other points can find the parent is faster. To begin, $k$ points were selected from $S$ as `parent'. Each pivot $p\in{k}$ have a number of buckets. Every other points in $S-\{p\}$ assigned a bucket number, based on the distance from to $p$ as illustrated in figure \ref{fig:bg_pivotBuckets}. The main improvements is gained by pre-processing the set $S$ ($\sqrt{N}$ samples) so that each of the $N-\sqrt{N}$ other points can find the parent is faster. To begin, $k$ points were selected from $S$ as `parent'. Each pivot $p\in{k}$ have a number of buckets. Every other points in $S-\{p\}$ assigned a bucket number, based on the distance from to $p$ as illustrated in figure \ref{fig:bg_pivotBuckets}.
To find a parent of an object, a distance calculation is first performed against each pivot to determine which bucket of each pivot is the object in. From this, the content of each bucket is searched for the nearest neighbor. To find a parent of an object, a distance calculation is first performed against each pivot to determine which bucket of each pivot is the object in. From this, the content of each bucket is searched for the nearest neighbour.
\break
\begin{wrapfigure}{Rh}{0.3\textwidth}
\vspace{-230pt}
\centering
\includegraphics[width=0.28\textwidth]{images/pivotBucketsIllust.png}
\caption{Diagram of a pivot (dark shaded point) with five buckets, illustrated as discs between dotted circle. Each of the other points in $S$ are classified into buckets by the distances to the pivot.}
\label{fig:bg_pivotBuckets}
\end{wrapfigure}
\begin{algorithmic} \begin{algorithmic}
\item Pre-processing: \item Pre-processing:
@@ -350,7 +353,7 @@ Figure \ref{fig:des_gui} shows the modified GUI used in this project. At the top
%============================ %============================
\section{Summary} \section{Summary}
In this chapter, several technologies and alternatives were discussed. In the end, the project is set out to reuse Bartasius's repository, using D3.js with standard JavaScript, HTML, CSS and SVG for their learning resources, with ESLint tool to format the JavaScript code. In this chapter, several technologies and alternatives were discussed. In the end, the project is set out to reuse Bartasius's repository, running on D3.js with standard JavaScript, HTML, CSS and SVG for their learning resources. The ESLint tool is also setup to format the JavaScript code and check for possible errors.
%============================================================================== %==============================================================================