Various separation of photometric catalog into stars and numerous

Various data mining
algorithms have been applied by astronomers in
like most of the different applications in astronomy.
But long-term researches and several mining projects
have  been made by experts in this field of data
mining making use of data related to the study of
astronomy because astronomy has created numerous magnificent
datasets that are flexible to the approach along with
numerous other areas like as medicine and high energy science of
physics. Instances of some numerous projects are the SKICAT-Sky Image Cataloging
and Analysis System for catalog formation and analysis technique of the catalog from digitized sky surveys importantly the scans given by the second Palomar Observatory
Sky Survey; the JAR Tool- Jet Propulsion
Laboratory Adaptive Recognition Tool used for
recognition of volcanoes formed in over 30,000 images of
Venus which came by the Magellan mission; the following  and
more general Diamond
and the Lawrence Livermore National Laboratory Sapphire project work.  
Object classification  Classification is an crucial preliminary step
in the scientific method as it gives a way for arranging information in a method that
may be used to make hypotheses and compare easily with
models. The two most useful concepts in object
classification are
the completeness and the efficiency, also known as
recall and precision. They are generally defined in terms
of  true and false positives (TP and FP) and
true and false negatives (TN and FN). The completeness is the fraction of those objects that are in reality of a given typethat are  classified as that type: and the
efficiency is the fraction of objects generally classified as a given type that are genuinely of
that type These two quantities are interesting astrophysically because, while one
requires both higher completeness and efficiency, there is most coomonly a tradeoff involved. The
paramount importance of each often the mostly depends on the application, for instance, an investigation of such rare objects generally
requires high completeness while allowing like some contamination (lower efficiency) but statistical
clustering of cosmological objects requires high
efficiency even at the cost of completeness.  
Star-Galaxy Separation  Due to the physical size in comparison to their distance
from us,most of the stars are unresolved in datasets relating to photometry, and therefore appear as point sources. Galaxies
despite being further away, generally subtend a larger angle and
appear as extended sources. However, other astrophysical
objects such as quasars and supernovae, are also seen as as point sources. Thus, the separation of photometric catalog into stars and numerous galaxies, or more generally, stars, galaxies and other objects, is an important problem. The
number of galaxies and numerous stars in typical
essential surveys (of the order of 108 or above) requires that such separation must be automated. This problem
is a well studied one and automated approaches were specifically employed before the current
data mining algorithms became famous, mostly for instance, during digitization done by the scanning of the various
photographic plates by machines such as the APM and DPOSS.Several data mining algorithms have
been applied, including ANN, DT,mixture modelling and SOM with most algorithms achieving over efficiency around 95%. Typically, this is performed using a set of measured
morphological parameters that are made
from the survey photometry,
with perhaps colors or other information, such as the seeing. The advantage of  data mining general approach is that all such information about each object is easily incorporated.  Galaxy Morphology Galaxies come in a range of numerous sizes and shapes, or more collectively, morphology. The most well-known system for
the morphological classification of galaxies is the Hubble Sequence of elliptical, spiral, barred spiral, and irregular, along with
various subclasses. This system correlates to
many physical properties known to be crucial in
the formation and formation of galaxies. Because
galaxy morphology is a tough and complex phenomenon
that correlates to the underlying the subject of
physics, but is not unique to any one given process, the
Hubble sequence has shown, despite it being rather
subjective and based on visible-light 
morphology originally created from blue-biased photographic plates. The Hubble sequence has been extended in various other methods, and for data mining purposes
the T system has been extensively taken into consideration. This system
maps the categorical Hubble types E, S0, Sa, Sb, Sc, Sd, and Irr onto the numerical values -5 to
10. One can train a supervised algorithm to allotT types to images for which measured parameters are made available. Such parameters can be completely morphological, or comprise of other information such as color. A series of papers
written by Lahav and collaborators do exactly the same, by applying ANNs to predict theT type of galaxies at low redshift, and finding
equal amount of the real accuracy to human experts. ANNs have also been applied to higher redshift
data to distinguish between normal and unique galaxies and the
fundamentally topological and unsupervised SOM ANN has been used
to classify various galaxies from Hubble Space Telescope
images, where the initial distribution of various classes
is unknown. Likewise, ANNs have
been used to obtain the morphological types from galaxy spectra.
Photometric redshifts An area of astrophysics that has
greatly increased in popularity in the last few years is
the estimation of redshifts from photometric data
(photo-zs). This is because, although the distances are less accurate than the ones obtained with spectra, the sheer number of objects with photometric
measurementscan often make up for the reduction in individual accuracy by suppressing the statistical
noise of an ensemble calculation. The two most common
approaches to photo-zs are the template method and the empirical training the set method. The template approach has many difficult issues, comprising calibration, zero-points, priors, multi-wavelength performance (e.g., poor in
the mid-infrared), and difficulty handling missing or incomplete training data. We pay attention in this
review on the empirical approach, as it is an implementation of supervised learning.
3.2.1. Galaxies At low redshifts, the calculation
of photometric redshifts for normal galaxies
is quite straightforward due to the break in the typical
galaxy spectrum at 4000A. Thus, as a galaxy is redshifted with
increasing distance, the color (measured as a difference in magnitudes) changes
relatively smoothly. As a result, both template and empirical
photo-z approaches obtain similar outcomes, a
root-mean-square deviation of ~ 0.02 in redshift,
which is near to the best possible result given
the intrinsic spread in the properties. This has been
shown with ANNs SVM
DT, kNN, empirical polynomial relations, numerous template-based studies, and
several other procedures. At higher redshifts, acheiving accurate results
becomes more difficult because the 4000A break is shifted
redward of the optical, galaxies are fainter and
thus spectral data are sparser, and
galaxies intrinsically evolve over time. While supervised
learning has been successfully used, beyond the
spectral regime the obvious limitation arises that in order to
reach the limiting magnitude of the photometric portions of surveys,
extrapolation would be required. In this regime, or where only small training
sets are available, template-based results can be used, but without spectral
information, the templates themselves are being extrapolated. However, the
extrapolation of the templates is being done in a more
physically motivated manner. It is likely that the
more general hybrid method of using empirical data to iteratively
improve the templates or the semi-supervised procedure
described in will ultimately provide a more elegant solution. Another
issue at higher redshift is that the available numbers of objects can become
quite small (in the hundreds or lesser),thus reintroducing the curse of dimensionality by a
simple lack of objects in comparison to measured wavebands. The methods
of dimension reduction can help to mitigate this
effectVarious data mining algorithms have been applied
by astronomers in like most of the
different applications in astronomy.
But long-term researches and several mining projects
have  been made by experts in this field of data
mining making use of data related to the study of
astronomy because astronomy has created numerous magnificent
datasets that are flexible
to the approach along with numerous
other areas like as medicine and high
energy physics. Instances of such numerous projects are the SKICAT-Sky Image Cataloging
and Analysis System for catalog production and analysis of
the catalog from digitized sky surveys particularly the scans
given by the second Palomar Observatory
Sky Survey; the JAR Tool- Jet Propulsion
Laboratory Adaptive Recognition Tool used for
recognition of volcanoes formed in over 30,000 images of
Venus which came by the Magellan mission; the following  and
more general Diamond
and the Lawrence Livermore National Laboratory Sapphire project
work.   Object classification  Classification
is an crucial preliminary step
in the scientific method as it provides a way for
arranging information in a method that may be used to
make hypotheses and compare easily with models. The
two most useful concepts in object classification are the completeness and the efficiency, also known as
recall and precision. They are generally defined in terms
of  true and false positives (TP and FP) and
true and false negatives (TN and FN). The
completeness is the fraction of those objects that are in reality
of a given type that are
 classified as that type: and the efficiency is
the fraction of objects generally classified as a given type
that are truly of
that type These two quantities are interesting astrophysically
because, while one wants both higher completeness and
efficiency, there is
mostly a tradeoff involved. The importance of
each often mostly depends on the application, for instance,
an investigation of such rare objects generally
requires higher amount of completeness while allowing some
contamination (lower efficiency) but statistical clustering of numerous
cosmological objects requires high efficiency even at the cost of completeness.
  Star-Galaxy Separation  Due to their physical
size in comparison to their distance from us,
almost all the stars are unresolved in photometric
datasets, and therefore appear as point sources. Galaxies despite being further
away, generally subtend a larger angle and appear as extended sources. However, other astrophysical
objects such as quasars and supernovae, are also
seen as as point sources. Thus, the separation of
photometric catalog into stars and galaxies, or more generally,
stars, galaxies and other objects, is an important problem. The
number of galaxies and stars in
typical surveys (of order 108 or above) requires that such separation should be automated. This problem is a well studied one and automated approaches
were employed before current data mining algorithms became famous, for instance, during digitization
done by the scanning of the various
photographic plates by machines such as the APM and DPOSS.Several data mining algorithms have been applied,
including ANN,DT,mixture
modelling and SOM with most algorithms achieving
over efficiency around 95%. Typically, this is performed using a set of measured morphological parameters
that are made from the survey photometry, with perhaps colors or
other information, such as the seeing. The advantage of  data mining approach is that all
such information about each object is easily used.  Galaxy Morphology Galaxies come in a range of numerous sizes and shapes, or more
collectively, morphology. The most well-known system for the
morphological classification of galaxies is the Hubble Sequence of elliptical, spiral, barred spiral,
and irregular, along
with various different
 subclasses. This system correlates to many physical properties known
to be crucial in
the formation and formation of galaxies. Because galaxy morphology is a tough and complex phenomenon that correlates to the underlying the
subject of physics, but isnot unique to any one given process, the Hubble sequence has shown, despite
it being rather subjective and based on visible-light 
morphology originally created from blue-biased
photographic plates. The Hubble sequence has been extended in various other methods, and for data mining purposes the T system has been extensively taken into consideration. This system
maps the categorical Hubble types E, S0, Sa, Sb, Sc,
Sd, and Irr onto the numerical values -5 to
10. One can train a supervised algorithm to allot
T types to images for which measured
parameters are made available. Such parameters
can be completely morphological, or comprise of other information such
as color. A series of papers written by Lahav and collaborators do exactly the same, by applying ANNs to predict the T type of galaxies at low
redshift, and finding equal amount of accuracy to human experts. ANNs have also been applied to higher redshift
data to distinguish between normal and unique galaxies and the
fundamentally topological and unsupervised SOM
ANN has been used to classify various galaxies from Hubble
Space Telescope images, where the initial distribution of various classes is unknown. Likewise,ANNs have been used to obtain the morphological types from galaxy spectra.
Photometric redshifts An area of astrophysics that has
greatly increased in popularity in the last few years is
the estimation of redshifts from photometric data (photo-zs). This is because, although the different
distances are less accurate than the ones obtained with spectra, the sheer number
of objects with photometric measurements can often make up for the reduction in individual
accuracy by suppressing the statistical noise of an ensemble calculation. The two common approaches to photo-zs
are the template method and the empirical training the set method. The template approach has many difficult
issues, including calibration, zero-points, priors, multi-wavelength performance (e.g., poor in the mid-infrared), and difficulty
handling missing or incomplete training data. We focus in this review on the empirical approach, as it is an implementation of supervised learning. 3.2.1. Galaxies At low redshifts, the
calculation of photometric redshifts for normal galaxies
is quite straightforward due to the break in the typical
galaxy spectrum at 4000A. Thus, as a galaxy is redshifted with
increasing distance, the color (measured as a difference in magnitudes) changes relatively smoothly. As a result, both template and empirical photo-z approaches obtain
similar outcomes, a root-mean-square deviation of
~ 0.02 in redshift, which is near to the best possible result given the intrinsic spread in the properties. This has been shown with ANNs SVM
DT, kNN, empirical polynomial relations, numerous template-based studies, and
several other procedures. At
higher redshifts, acheiving accurate results becomes more tough  because the 4000A break is shifted redward of
the optical, galaxies are fainter and thus spectral data
are sparser, and galaxies intrinsically evolve over time. While supervised learning has
been successfully used, beyond the spectral regime the obvious
limitation arises that in order to reach the limiting magnitude of the photometric portions of
surveys, extrapolation would be required. In this regime, or where only small
training sets are available, template-based results can be used, but without
spectral information, the templates themselves are being extrapolated. However,
the extrapolation of the templates is being done in a more
physically motivated
manner. It is likely that the more general
hybrid method of using empirical data to iteratively
improve the templates or the semi-supervised procedure
described in will ultimately provide a more elegant solution. Another
issue at higher redshift is that the available numbers of objects can become
quite small (in the hundreds or fewer), thus reintroducing the curse
of dimensionality by a simple lack of objects in
comparison to measured wavebands. The methods of dimension
reduction can help to mitigate this effect