Selected Projects


Active Projects

1. Efficient Data Processing in Molecular Dynamics Simulation: Framework and Algorithms

In this project, we will design and develop a molecular simulation data management system, we call P2DMS, to analyze large distributed molecular dynamics (MD) simulation data. The salient features of the system include: (1) a push-based local query engine design that handles data in a batch processing manner and processes many queries at the same time; (2) optimized MD analytics tools using modern many-core hardware such as GPUs; and (3) efficient management and access to distributed data over wide area networks, which is quite common for large scale MD simulations. This will be done by building a data analysis layer on top of state-of-the-art distributed big data management systems. The outcome of this project will not only improve the efficiency of MD data processing, but also enable new knowledge discovery that is currently regarded difficult or infeasible. In particular, we will integrate the P2DMS program into existing MD simulation packages, and validate the new design with important real-world biological and MD methodological problems. In particular, we will (1) model the structure-function relationships of how the spike protein of SARS-CoV-2 interacts with the human angiotensin converting enzyme 2 (ACE2) receptor; and (2) enhance the performance of a recently developed parameter optimization software for active control of MD simulations. This project is supported by a National Institutes of Health (NIH) R01 grant (1R01GM140316). Most of the aims of this project can be viewed as a continuation of that of a previous NIH R01 grant 1R01GM086707.




Past Projects

1. II-New: A Research Platform for Heterogeneous, Massive Parallel Computing

The aim of this project is to build a new computing research infrastructure that will accelerate existing research and enable ground-breaking new research that shares the common theme of massive parallel computing at USF. The requested infrastructure is a computer cluster with about 120 many-core co-processors including GPUs and Intel Phi cards. This acquisition will provide the much needed parallel computing capacity that is currently lacking in the existing USF IT infrastructure. This project will bring together eight investigators with research projects in several core disciplines of computer science and engineering: big data management, scientific computing, system security, hardware design, data mining, computer vision and pattern recognition. The proposed new resource will enable or enhance research in design, implementation, and evaluation of parallel programs at various levels of the software stack, ranging from cluster-level systems to individual application components and domain-specific languages. The infrastructure will also enhance learning experience of students as well as course content, curriculum development, and laboratory environment. The team of investigators will form a “USF Parallel Computing Workgroup” to coordinate related research, education, and outreach activities. This ensures the project team act in an integrated way to maximize the impact of the new infrastructure on the entire USF community. This project is supported by an NSF CRI award (CNS-1513126).

2. G-SDMS: A High-Throughput Scientific Data Management System

Many scientific domains have entered a data-driven era, in which scientific discovery depends heavily on effective and efficient analysis of large-scale data generated by wet-bench experiments or computer simulations. Current database management systems (DBMSs), while being very popular in the business world, fall short in high-throughput data processing required by scientific applications. The goal of this project is to design and implement a novel data management software architecture that enables high-throughput data management services for general scientific communities. The project achieves this goal via (1) a novel one-scan-fits-all data processing framework based on repetitive scans of large data sources; (2) a query engine that leverages the massive computing power of modern Graphics Processing Units (GPU) hardware; and (3) design and implementation of algorithms for popular analytics in three scientific domains on top of the query engine to demonstrate the effectiveness and efficiency of the proposed architecture. The project also aims at building a software prototype and evaluating this prototype with real-world scientific datasets and query workloads. The project is expected to provide a highly efficient solution to satisfy the data management needs of a wide range of scientific fields. To deliver comparable performance, the proposed architecture requires only a fraction of the hardware and energy costs needed by existing systems. As a result, it will make scientific studies that are regarded as difficult or infeasible a reality. The proposed research will be integrated into educational endeavors that contribute to broadening the influence of computer science, nurturing the next generation of multidisciplinary scientists, and boosting the success of minority and women students in the computer science and engineering field. This project is supported by an NSF CAREER award (IIS-1253980). Sample publications: [SSDBM13] [TKDE14] [BigData14] [SIGMOD17]

3. Efficient Data Processing in Molecular Dynamics Simulation: Framework and Algorithms

Atomistic level Molecular Dynamics can provide insights into the structure, dynamics, and thermodynamic characteristics of biological structures. Studies on molecular dynamics thus attracted a great deal of attention from the research communities of biology, physics, and chemistry. Computer simulation is an important method for molecular dynamics research. The outcome of an MD simulation is a “pseudo trajectory” of positions and velocities of all of the atoms in the simulation system. Due to the large number of atoms/molecules and time instances to simulate, such simulations generate tremendous amounts of data. How to effectively and efficiently reason about such data is a critical problem that directly affects the success of any molecular simulation study. Current strategies to store, access, and query such data are all based on computer files. This renders some serious efficiency problems in querying and sharing the simulation data. In this project, we propose the idea of storing simulation data in databases and develop novel indexing strategies to help optimize the processing of a wide range of queries. This project is supported by a National Institutes of Health (NIH) R01 grant (R01-GM086707). Sample publications: [ICDE09][VLDBJ10]EDBT12[TKDE13][Code download]

4. Power-Aware Database Systems

Maintaining a sustainable society via technological innovations has been a major challenge to scientists and engineers in the 21st century. With the total energy consumption of computing equipment increasing at a steep rate, much attention has been paid to the design of energy-effecient computing systems. Power-aware computing research at the applications level has been found to be synergistic to that at the system level (e.g., hardware and OS) because it can provide more opportunities of energy reduction for the underlying systems. In this project, we focus on database management systems (DBMS), which often consume a majority of the computing resources (and power) in modern data centers. We aim the design and implementation of a power-aware DBMS that can significantly reduce energy use with graceful performance degradation. Our preliminary results have demonstrated great potential of this line of research. This project is supported by an NSF grant via its CISE core program (IIS-001117699) and a Florida Energy Systems Consortium (FESC) seed grant. Sample publications: [ICDE10][SIGMODREC14][ICDCS13]

5. Load Shedding in Data Stream Management Systems

The continuous feature of both data and queries in data stream management systems (DSMSs) place great demand on system resources. However, queries can be processed with different levels of quality such as timeliness, reliability, and uncertainty. In this project, we study the problem of how to maintain tuple processing delays in query processing in DSMSs. A widely-used approach to maintain QoS (especially tuple delays) in DSMS query processing is load shedding, i.e., dropping data. Current load shedding solutions utilize simple, intuitive ideas in determining the time and amount of load to be discarded and do not work well in the presence of system/environmental disturbances. We propose a solution based on feedback control theory with significantly improved long-term performance. Currently, we are exploring the strategy of semantic load shedding: selectively shedding data according to their importance to the query results. Publications: [ICDE07][VLDB06][DEXA05a][Code download]