Department of Computer Science
University of Texas at San Antonio
Schedule:
Abstract:
Modern machine learning techniques provide robust approaches for data-driven modeling and critical information extraction, while human experts hold the advantage of possessing high-level intelligence and domain-specific expertise. We combine the power of the two for anomaly detection in GPS data by integrating them through a visualization and human-computer interaction interface. In this paper we introduce GPSvas (GPS Visual Analytics System), a system that detects anomalies in GPS data using the approach of visual analytics: a conditional random field (CRF) model is used as the machine learning component for anomaly detection in streaming GPS traces. A visualization component and an interactive user interface are built to visualize the data stream, display significant analysis results (i.e., anomalies or uncertain predications) and hidden information extracted by the anomaly detection model, which enable human experts to observe the real-time data behavior and gain insights into the data flow. Human experts further provide guidance to the machine learning model through the interaction tools; the learning model is then incrementally improved through an active learning procedure.
References:
Recently, a new image representation has been proposed in both image classification and retrieval. It performs a non-linear
feature transformation on descriptors and then aggregates the results together to form image-level representations. This new method
called Super-Vector coding is a simple extension of Vector Quantization. The experiments on image classification achieves state-of-the-art
accuracy on the well-known PASCAL benchmarks. To utilize Super-Vector in large scale image search, a joint optimization of dimension
reduction and indexing algorithm is proposed. The evaluation also shows significant improvement.
References:
Abstract:
The large amount of social images significantly facilitate web image search. However, the tags associated with images are
nor in the order of the importance or relevance. In this paper, the authors propose a tag ranking scheme to rank the tags based on its
image content. They estimate initial relevance score for tags by kernel density estimation and random walk is adopted to refine
the relevance scores. The approach is tested on a 50,000 Flickr photo collection.
References:
Abstract:
GPU (graphics processors units) cards have been evolving so fast last decade, constantly improvements on
capacity, speed and flexibility, cheap and widespread, GPU are has an intrinsic parallel architecture
that is useful for execute many calculations at the same time or split a complex problem into a several
simple iterations. Many researchers and developers have become interested in harnessing the power of
graphics hardware for general-purpose computing. Recent years have seen an explosion in interest in
such research efforts, collectively known as GPGPU computing, and now GPU are most flexible and
programmable and highly used for Scientific and Engineering research.
CUDA is technology developed by NVIDIA for use GPGPU on recents models of their graphics cards, with
expansive support for programming languages and hundreds of research projects finished is one the most
important and popular GPU computing techniques. With millions of CUDA-enabled GPUs sold to date,
software developers, scientists and researchers are finding broad-ranging uses for CUDA, including image and video processing, computational biology and chemistry, fluid dynamics simulation, CT image reconstruction, seismic analysis, ray tracing, and much more.
interest links: http://www.nvidia.com/object/GPU_Computing.html About GPU computing
http://www.nvidia.com/object/cuda_home_new.html NVIDIA's CUDA zone
References:
Abstract:
In this paper, they report our system that disambiguates person names in Web search results.
The system uses named entities, compound key words, and URLs as features for document similarity calculation,
which typically show high precision but low recall clustering results. They propose to use a two-stage
clustering algorithm by bootstrapping to improve the low recall values, in which clustering
results of the first stage are used to extract features used in the second stage clustering.
Experimental results revealed that our algorithm yields better score than the best systems at the
latest WePS workshop.
References:
Abstract:
Today, the publication of microdata poses a privacy threat: anonymous personal records can be re-identified
using third data sources. Past research has tried to develop a concept of privacy guarantee that an anonymized
data set should satisfy before publication, culminating in the notion of t-closeness. To satisfy t-closeness,
the records in a data set need to be grouped into Equivalence Classes (ECs), such that each EC contains records
of indistinguishable quasi-identifier values, and its local distribution of sensitive attribute (SA) values
conforms to the global table distribution of SA values. However, despite this progress, previous research has
not offered an anonymization algorithm tailored for t-closeness. This paper covers this gap with SABRE, a SA
Bucketization and REdistribution framework for t-closeness. SABRE first greedily partitions a table into buckets
of similar SA values and then redistributes the tuples of each bucket into dynamically determined ECs. This
approach is facilitated by a property of the Earth Mover's Distance (EMD) that is employed as a measure of
distribution closeness: If the tuples in an EC are picked proportionally to the sizes of the buckets they hail
from, then the EMD of that EC is tightly upper-bounded using localized upper bounds derived for each bucket.
This paper includes the proof that if the t-closeness constraint is properly obeyed during partitioning, then
it is obeyed by the derived ECs too. Two instantiations of SABRE are developed and extended to a streaming
environment. The extensive experimental evaluation demonstrates that SABRE achieves information quality superior
to schemes that merely applied algorithms tailored for other models to t-closeness, and can be much faster as well.
References:
Abstract:
The popularity of location-aware applications (e.g., Restaurant Finder on
the IPhone) is growing with the ubiquitous availability of GPS. Besides
GPS, those applications require digital street maps. Small portions of
maps are freely available from Google Maps or Map Quest, but entire maps
are very expensive and only available from proprietary vendors. Thus, the
lack of availability of free digital maps limits the development of such
applications. Considering the need of affordable digital maps, in July
2004, OpenStreetmap was founded with the objective of creating a free
editable map of the world. The maps are created using data from portable
GPS devices, aerial photography, and other free sources or simply from
local knowledge. The crowd-sourced approach allows anyone to update maps
with his or her own uploaded trajectories, and anyone can download maps
into a GPS device for free. However, the map updates and map
construction from GPS traces are performed manually, due to the lack of
algorithms to support these tasks. The objective of this research is to
develop geometric algorithms with quality and performance guarantees
for map construction and map updates from geo-referenced trajectory data.
References:
Abstract:
In the recent years, fast and reliable database for large scale (semi) structured data have
been needed in multiple fields. Relational database was found expensive and sometime even
difficult to map with the structures in hand for fast queries. Bigtable is a distributed
storage system for managing structured data that is designed to scale to a very large size:
petabytes of data across many commodity servers. Many projects at Google store data in Bigtable,
including web indexing, Google Earth, and Google Finance. These applications place very different
demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery)
and latency requirements (from backend bulk processing to real-time data serving). Despite these
varied demands, Bigtable has successfully provided a _exible, high-performance solution for all
of these Google products. In this paper we describe the simple data model provided by Bigtable,
which gives clients dynamic control over data layout and format, and we describe the design and
implementation of Bigtable. For our research work, we need these kind of solutions to store/retrieve
existing MATLAB structures which are very hard to map to relational databases due to non-uniformity
of attribute fields and their types.
References:
Abstract:
Background: It has been suggested that, in the human protein-protein interaction network, changes
of co-expression between highly connected proteins ("hub") and their interaction neighbours might
have important roles in cancer metastasis and be predictive disease signatures for patient outcome.
However, for a cancer, such disease signatures identified from different studies have little overlap.
Results: Here, we propose a systemic approach to evaluate the reproducibility of disease signatures
at multiple levels, on the basis of some statistically testable biological models. Using two datasets
for breast cancer metastasis, we showed that different signature hubs identified from different
studies were highly consistent in terms of significantly sharing interaction neighbours and displaying consistent co-expression changes with their overlapping neighbours, whereas the shared interaction neighbours were significantly over-represented with known cancer genes and enriched in pathways deregulated in breast cancer pathogenesis. Then, we showed that the signature hubs identified from the two datasets were highly reproducible at the protein interaction and pathway levels in three other independent datasets.
Conclusions: Our results provide a possible biological model that different signature hubs altered
in different patient cohorts could disturb the same pathways associated with cancer metastasis through their interaction neighbours.
References:
Abstract:
We present the first results showing that the Frechet distance
between non-flat surfaces can be approximated within a constant factor in
polynomial time. Computing the Frechet distance for surfaces is a
surprisingly hard problem. It is not known whether it is computable, it has
been shown to be NP-hard, and the only known algorithm computes the Frechet
distance for flat surfaces (Buchin et al.). We adapt this algorithm to create
one for computing the Frechet distance for a class of surfaces which we call
folded polygons. Unfortunately, if extended directly the original algorithm
no longer guarantees that a homeomorphism exists between the surfaces. We
present three different methods to address this problem. The first of which
is a fixed-parameter tractable algorithm. The second is a polynomial-time
approximation algorithm which approximates the optimum mapping. Finally, we
present a restricted class of folded polygons for which we can compute the
Frechet distance in polynomial time.
References:
Abstract:
References:
Abstract:
References:
Please send emails to qitian@cs.utsa.edu, or seminar co-organizers: Kay Robbins, Weining Zhang, Yufei Huang, Carola Wenk, Jianhua Ruan, and Qi Tian.