แก้ดราฟต์ 2
This commit is contained in:
50
l4proj.bib
50
l4proj.bib
@@ -204,4 +204,54 @@ publisher={Springer},
|
|||||||
author={Borg, Ingwer and Groenen, Patrick J. F.},
|
author={Borg, Ingwer and Groenen, Patrick J. F.},
|
||||||
year={1997},
|
year={1997},
|
||||||
pages={207–-212},
|
pages={207–-212},
|
||||||
|
}
|
||||||
|
|
||||||
|
@article{kPCA,
|
||||||
|
author = { Bernhard Schölkopf and Alexander Smola and Klaus-Robert Müller },
|
||||||
|
title = {Nonlinear Component Analysis as a Kernel Eigenvalue Problem},
|
||||||
|
journal = {Neural Computation},
|
||||||
|
volume = {10},
|
||||||
|
number = {5},
|
||||||
|
pages = {1299-1319},
|
||||||
|
year = {1998},
|
||||||
|
doi = {10.1162/089976698300017467},
|
||||||
|
|
||||||
|
URL = {
|
||||||
|
https://doi.org/10.1162/089976698300017467
|
||||||
|
|
||||||
|
},
|
||||||
|
eprint = {
|
||||||
|
https://doi.org/10.1162/089976698300017467
|
||||||
|
|
||||||
|
}
|
||||||
|
,
|
||||||
|
abstract = { A new method for performing a nonlinear form of principal component analysis is proposed. By the use of integral operator kernel functions, one can efficiently compute principal components in high-dimensional feature spaces, related to input space by some nonlinear map—for instance, the space of all possible five-pixel products in 16 × 16 images. We give the derivation of the method and present experimental results on polynomial feature extraction for pattern recognition. }
|
||||||
|
}
|
||||||
|
|
||||||
|
@ARTICLE{Sammon,
|
||||||
|
author={J. W. Sammon},
|
||||||
|
journal={IEEE Transactions on Computers},
|
||||||
|
title={A Nonlinear Mapping for Data Structure Analysis},
|
||||||
|
year={1969},
|
||||||
|
volume={C-18},
|
||||||
|
number={5},
|
||||||
|
pages={401-409},
|
||||||
|
keywords={Clustering, dimensionality reduction, mappings, multidimensional scaling, multivariate data analysis, nonparametric, pattern recognition, statistics.;Algorithm design and analysis;Computer errors;Data analysis;Data structures;Euclidean distance;Helium;Multidimensional systems;Pattern recognition;Testing;Vectors;Clustering, dimensionality reduction, mappings, multidimensional scaling, multivariate data analysis, nonparametric, pattern recognition, statistics.},
|
||||||
|
doi={10.1109/T-C.1969.222678},
|
||||||
|
ISSN={0018-9340},
|
||||||
|
month={May},}
|
||||||
|
|
||||||
|
@article{LMDS,
|
||||||
|
title = "Local multidimensional scaling",
|
||||||
|
journal = "Neural Networks",
|
||||||
|
volume = "19",
|
||||||
|
number = "6",
|
||||||
|
pages = "889 - 899",
|
||||||
|
year = "2006",
|
||||||
|
note = "Advances in Self Organising Maps - WSOM’05",
|
||||||
|
issn = "0893-6080",
|
||||||
|
doi = "https://doi.org/10.1016/j.neunet.2006.05.014",
|
||||||
|
url = "http://www.sciencedirect.com/science/article/pii/S0893608006000724",
|
||||||
|
author = "Jarkko Venna and Samuel Kaski",
|
||||||
|
keywords = "Information visualization, Manifold extraction, Multi-dimensional scaling (MDS), Nonlinear dimensionality reduction, Non-linear projection, Gene expression"
|
||||||
}
|
}
|
||||||
12
l4proj.tex
12
l4proj.tex
@@ -152,13 +152,17 @@ With the emergence of more complex data, each having many features, the need of
|
|||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
Unlike the two previously mentioned techniques, multidimensional scaling (MDS) took another approach by aiming to reduce data dimension by preserving the level of similarity, rather than the values.
|
Unlike the two previously mentioned techniques, multidimensional scaling (MDS) took another approach by aiming to reduce data dimension by preserving the level of similarity, rather than the values.
|
||||||
Classical MDS (also known as Principal Coordinates Analysis or PCoA)\cite{cMDS} achieves this goal by creating new dimensions for scatter-plotting, each made up of a linear combination of the original dimensions, while minimising a loss function called strain. For simple cases, it can be thought of as finding the a camera angle to project the high-dimensional scatterplot onto a 2D image.
|
Classical MDS (also known as Principal Coordinates Analysis or PCA)\cite{cMDS} achieves this goal by creating new dimensions for scatter-plotting, each made up of a linear combination of the original dimensions, while minimising a loss function called strain. For simple cases, it can be thought of as finding the a camera angle to project the high-dimensional scatterplot onto a 2D image.
|
||||||
Because strain assumes Euclidean distances, making it incompatible with other dissimilarity ratings. Metric MDS improves upon classical MDS by generalising the solution to support a variety of loss functions\cite{mcMDS}. However, the disadvantage of $O(N^3)$ time complexity still remains and linear combination may not be enough for some data sets.
|
Because strain assumes Euclidean distances, making it incompatible with other dissimilarity ratings. Metric MDS improves upon classical MDS by generalising the solution to support a variety of loss functions\cite{mcMDS}. However, the disadvantage of $O(N^3)$ time complexity still remains and linear combination may not be enough for some data sets.
|
||||||
|
|
||||||
This project focuses on several non-linear MDS algorithms using force-directed layout. The idea is to attach each pair of data points with a spring whose equilibrium length is proportional to the high-dimensional distance between the two points, although the spring model we know today does not necessary use Hooke's law to calculate the spring force\cite{Eades}. Several improvements have been introduced to the idea over the past decade. For example, the concept of 'temperature' purposed by Fruchterman and Reingold\cite{SpringTemp} solves the problem where the system is unable to reach an equilibrium state and improves execution time. The project focuses on an iterative spring-model-based algorithm introduced by Chalmers\cite{Algo1996} and the Hybrid approach which will be detailed in subsequent sections of this chapter.
|
This project focuses on several non-linear MDS algorithms using force-directed layout. The idea is to attach each pair of data points with a spring whose equilibrium length is proportional to the high-dimensional distance between the two points, although the spring model we know today does not necessary use Hooke's law to calculate the spring force\cite{Eades}. Several improvements have been introduced to the idea over the past decade. For example, the concept of 'temperature' purposed by Fruchterman and Reingold\cite{SpringTemp} solves the problem where the system is unable to reach an equilibrium state and improves execution time. The project focuses on an iterative spring-model-based algorithm introduced by Chalmers\cite{Algo1996} and the Hybrid approach which will be detailed in subsequent sections of this chapter.
|
||||||
|
|
||||||
There is a number of other non-linear MDS algorithms. t-distributed Stochastic Neighbour Embedding (t-SNE)\cite{tSNE}, for example, is very popular in the field of machine learning. It is based on SNE\cite{SNE} where probability distributions are constructed over each pair of data point in a way that the more similar objects have higher probability of being picked. The distributions derived from both high-dimensional and low-dimensional distances are compared using the Kullback–Leibler divergence, a metric to measure the similarity between two probability distributions. Then, the 2D position of each data points are then iteratively adjusted to maximize the similarity. The biggest downside is that it have both time and memory complexity of $O(N^2)$ per iteration. In 2017, Bartasius\cite{LastYear} implemented t-SNE in D3 and found that not only is it the slowest algorithm in his test, the produced layout is also the many times worse in term of Stress, a metric which will be introduced in section \ref{sec:bg_metrics}. However, comparing the Stress of a t-SNE layout is unfair as t-SNE is designed to optimise the Kullback–Leibler divergence and not Stress.
|
There is a number of other non-linear MDS algorithms. t-distributed Stochastic Neighbour Embedding (t-SNE)\cite{tSNE}, for example, is very popular in the field of machine learning. It is based on SNE\cite{SNE} where probability distributions are constructed over each pair of data point in a way that the more similar objects have higher probability of being picked. The distributions derived from both high-dimensional and low-dimensional distances are compared using the Kullback–Leibler divergence, a metric to measure the similarity between two probability distributions. Then, the 2D position of each data points are then iteratively adjusted to maximize the similarity. The biggest downside is that it have both time and memory complexity of $O(N^2)$ per iteration. In 2017, Bartasius\cite{LastYear} implemented t-SNE in D3 and found that not only is it the slowest algorithm in his test, the produced layout is also the many times worse in term of Stress, a metric which will be introduced in section \ref{sec:bg_metrics}. However, comparing the Stress of a t-SNE layout is unfair as t-SNE is designed to optimise the Kullback–Leibler divergence and not Stress.
|
||||||
|
|
||||||
|
Other algorithms use different approaches. Kernel PCA tricks classical MDS (PCA) into being non-linear by using the kernels\cite{kPCA}. Simply put, kernel functions are used to create new dimensions from the existing ones. These kernels can be non-linear. Hence, PCA can use these new dimensions to create a non-linear combination of the original dimensions. The limitation is that the kernels are user-defined, thus, it is up to the user to define good kernels to create a good layout.
|
||||||
|
Local MDS\cite{LMDS} performs a different trick on MDS by using MDS in local regions and stitching them together, using convex optimization. While it focuses on Trustworthiness and Continuity, the errors concerning each data points' neighbourhood, its overall layouts fail to form any visible clusters.
|
||||||
|
Sammon's mapping\cite{Sammon}, on the other hand, find a good position for each data point by using gradient descent to minimise Sammon's error, a function similar to Stress (section \ref{sec:bg_metrics}). However, gradient descent can only find a local minimum and the solution is not guaranteed to converge.
|
||||||
|
|
||||||
The rest of this chapter will describes each of the algorithm and performance metrics used in this project in detail.
|
The rest of this chapter will describes each of the algorithm and performance metrics used in this project in detail.
|
||||||
|
|
||||||
\section{Link Force}
|
\section{Link Force}
|
||||||
@@ -342,7 +346,7 @@ Figure \ref{fig:des_gui} shows the modified GUI used in this project. At the top
|
|||||||
%============================
|
%============================
|
||||||
|
|
||||||
\section{Summary}
|
\section{Summary}
|
||||||
In this chapter, several technologies and alternatives were discussed. In the end, the project is set out to build on Bartasius's repository, using D3.js with standard JavaScript, HTML, CSS and SVG.
|
In this chapter, several technologies and alternatives were discussed. In the end, the project is set out to reuse Bartasius's repository, using D3.js with standard JavaScript, HTML, CSS and SVG for their learning resources.
|
||||||
|
|
||||||
|
|
||||||
%==============================================================================
|
%==============================================================================
|
||||||
@@ -355,9 +359,9 @@ In this chapter, several technologies and alternatives were discussed. In the en
|
|||||||
\label{ch:imp}
|
\label{ch:imp}
|
||||||
|
|
||||||
\section{Outline}
|
\section{Outline}
|
||||||
D3-force module provides a simplified Simulation object to control various calculations. Each Simulation contain a list of nodes and Force objects. Interfaces were defined, allowing each Force to access the node list. To keep track of positions, each node will be assigned values representing its current location and velocity vector. These values can then be used by the application to draw a graph. In each constant unit time step (iteration), the Simulation will trigger a function in each Force object, allowing them to calculate and add values to each particle's velocity vector, which will then be added to the particle's position by the Simulation.
|
The D3 library is modular. D3-force is the most relevant module for this project. The module provides a simplified Simulation object to simulate various physical force calculations. Each Simulation contain a list of nodes and Force objects. Interfaces were defined, allowing each Force to access the node list. To keep track of positions, each node will be assigned values representing its current location and velocity vector. These values can then be used by the application to draw a graph. In each constant unit time step (iteration), the Simulation will trigger a function in each Force object, allowing them to calculate and add values to each particle's velocity vector, which will then be added to the particle's position by the Simulation. For MDS, each data point is represented as a particle in the simulation.
|
||||||
|
|
||||||
Because D3-force are libraries to be built into other web applications, the algorithms implemented can not be used on their own. Fortunately, as part of Bartasius' level 4 project in 2017, a web application for testing and evaluation has already been created with graphical user interface designed to allow the user to easily select an algorithm, data set, and parameter values to use. Various distance functions, including one specifically created for the Poker Hands data set\cite{UCL_Data} which will be used for evaluation (section \ref{sec:EvalDataSet}), are also in place and fully functional.
|
Because D3 is a library to be built into other web applications, the algorithms implemented can not be used on their own. Fortunately, as part of Bartasius' level 4 project in 2017, a web application for testing and evaluation has already been created with graphical user interface designed to allow the user to easily select an algorithm, data set, and parameter values to use. Various distance functions, including one specifically created for the Poker Hands data set\cite{UCL_Data} which will be used for evaluation (section \ref{sec:EvalDataSet}), are also in place and fully functional.
|
||||||
|
|
||||||
The csv-formatted data file can be loaded locally. Next, it is parsed by Papa Parse JavaScript library\cite{PapaParse} and then loaded into the Simulation.
|
The csv-formatted data file can be loaded locally. Next, it is parsed by Papa Parse JavaScript library\cite{PapaParse} and then loaded into the Simulation.
|
||||||
Depending on the distance functions, per-dimension mean, variance, and other attributes may also be calculated as well. These values are used in several general distance functions to scale values of each feature. The D3 simulation layout is shown on an SVG canvas with zoom functionality to allow graph investigation. The distance function scaling was tweaked to only affect rendering and not the force calculation.
|
Depending on the distance functions, per-dimension mean, variance, and other attributes may also be calculated as well. These values are used in several general distance functions to scale values of each feature. The D3 simulation layout is shown on an SVG canvas with zoom functionality to allow graph investigation. The distance function scaling was tweaked to only affect rendering and not the force calculation.
|
||||||
|
|||||||
Reference in New Issue
Block a user