Course Project
Overview
Each student must complete a course project. There are three types of projects.
-
Research Project. The student must choose a topic in
a current data mining research area, formulate a new research problem and
propose one or more solutions to the problem, or propose new solutions
to an existing research problem. A close interaction with an instructor
is expected. This should usually be a one-person project (two-person team
may be allowed by the instructors). The final product is a technical report.
-
Implementation Project. The student will apply a data
mining approach to an application area and implement a complete solution.
Substantial programming activity is involved. The project team may have
up to two members. The final product is usually an operational software
package completed with a manual. A demo is required.
-
Data Analysis Project. The student will obtain a source of
data, load the data into a database, and analyze the data using data mining
algorithms. The data should have a significant size (at least 1M numbers
and strings), the data should be from a publicly available source, the
database should have a minimal complexity (at least 5 tables), the database
should be implemented on our DB2 server, and the data mining algorithms
should be from DB2MINER and/or WEKA. Any deviations from these guidelines
must be approved. This can be a one- or two-person project. Ask your instructors
if you need ideas about where to obtain data. The final product is a technical
report.
For all three types of projects, students are required to submit a proposal,
a progress report, and a final product.
Project Team
Typical project team has one or two persons.
Grading
Separate grades will be given for different parts of the project. For two-person
teams, both members will be given the same grade.
Description of Research Project
-
Choose a topic in an area of current data mining research. Some suggestions
are given below. You may choose your own topic as long as it is related
to knowledge discovery and data mining.
-
Given the time limit, implementation is not required. You should concentrate
on conceptual framework and development of good ideas. Some questions
you need to keep in mind include:
-
What is the current status of research?
-
What is the problem? Why is it interesting/important?
-
What method/technique/approach can be utilized to solve the problem? and
how?
-
What are existing solutions? Why are they not good/not enough?
-
How good is your solution? Why is your solution better than others?
-
You should work towards convincing arguments/examples/demostrations that
support your idea/framework/design
-
Write up a technical paper that clearly describes your ideas, arguments,
and solutions.
Some Suggestions for Projects
-
Predicting prices of initial public offerings. Determine how much
you should pay for an initial public offering on the first day of offering.
See here
for further description. [Data: ipo1,
ipo2].
For more info, talk to Dr. Kwek.
-
[Taken]
Predicting selling price of houses. This is a list describing the
characteristics and prices of the houses sold in Houston in the year 1999.
The task here is to predict the selling price of a house given its description.
There are some issues involving the "distribution" of the instances. If
you are interested in analyzing this data set, please talk to Dr. Kwek
before performing any experiments. This project has the potential of developing
into an interesting M.S. project. Click here
for data.
-
Determining whether the pulse rate of a (single actual live) neuron has
any embedded regular pattern. This is somewhat difficult problem as the
techniques described in class do not fit in well in this framework. You
may want to take this on ONLY IF you know principal component analysis
and are comfortable with linear algebra. If you are looking for an idea
to develop into a M.S. thesis, this is a good project. Data: test,
regular,
irregular
-
[Taken] Neuropsychological data. Due to confidentiality, Dr. Kwek
is not posting the data on the web. Briefly, the data contains about 458
instances of neuropsychological data of about 58 numeric measurements and
other attributes like age, sex, income, race, ... etc. This project is
essentially a clustering project. This can be extended to a M.S. thesis
if the results look good.
-
Sports statistics. Statistics for several popular sports can be obtained
on the internet, including baseball, basketball,
football,
and hockey. It will take some effort to
download all the web pages for the players and the teams (implement or
find a web robot) and to extract all the information (implement a simple
wrapper). This data is already in tables, so constructing a database should
be straightforward.
-
Weather data. There is an abundance of weather data online. In particular,
the National Climatic Data Center
has some free
datasets online. One of the data mining tasks you might try is to predict
the weather at a given time period from previous time periods. Another
task you might try is clustering to partition a region into different climates.
-
More datasets. There are many large datasets available from the UCI
KDD Archive and from KDnuggets
web site.
Research Projects:
-
Missing Attribute Values[Taken]: Most techniques for dealing with
missing attribute-values tend to be very simplistic; Like taking the mean
values, values that occur most frequently (mode), ... etc. Could an EM(expectation
maximization)-like approach provides a better solution? In the first iteration,
we can fill up the missing values with some simple default values and then
build a collection of predictors for each attribute and also the class
label. We then use these predictors to further enhance (hopefully) our
predictions of the missing values (and hence hopefully increase our accuracy
on predicting the class labels). We repeat this process until the class
label prediction can no longer be improved or start to deteriorate.
-
Association Rules and Missing Attribute Values: Another way of filling
up missing attribute values is to use association rules. First, we generate
association rules (A = a1) and (B = b1) => C = C1, .... If the value for
C is missing for a training instance, we can fill it up with C1 but with
probability computed from the support and confidence. If there are multiple
rules that have antecedent C = C1, then we can use the support and confidence
to compute the combined probability. Does this improves the prediction
accuracy? I STRONGLY ENCOURAGE SOMEONE TO TALK THIS UP.
-
Enhancing Ensemble Methods[Taken]: In ensemble methods, we create
a collection of predictors. Suppose in addition, we create for each predictor
h, a secondary predictor h' that predicts whether h
is likely to predict an unlabeled instance x correctly. How do we
use h'(x) to adjust the weight of h in predicting
x's label? Does this approach help enhance classification accuracy?
-
Learning Intermediate Concepts[Maybe taken, talk to Dr. Kwek]: Almost all classification
problems investigated assume that each instance x has a single class
label f(x). What if the target concept f is somehow
related to other `intermediate' concepts? How do we incorporate these intermediate
concepts to enhanced our learning of the target concepts. Here, a labeled
instance x has labeled f1(x), ...., fk(x),
f(x) and a test instance does not have any of these labels.
-
Ensemble Methods and Radial Basis Functions (RBF): Currently, the training
of RBF involves two stages. First, the number of hidden units is determined
(quite arbitrarily) and then each hidden unit is assigned a kernel function.
The second stage is to assign weights between the output node and the hidden
layers (a simple gradiate decent here will do, assuming that the kernel
functions are chosen correctly). The question here is whether the idea
of boosting provides a better way of training RBF networks?
-
Almost all learning systems assume that the instances in the training data
share the same probability distribution as the test data. Certainly, if
training and test data come from two totally different distributions, then
we cannot learn to predict well. However, if the two distributions share
some commonality, though they are not exactly the same, then we may be
able to exploit this commonality to achieve learning.
Requirements of The Project
-
Proposal.
-
Progress Report.
-
Final Report.