Research
Supporting drug development in biotech/pharma
At Genentech, I provide statistical and computational support to different therapeutic areas. My past and current studies include phase Ib, Ib/II, III clinical trials under Mosunetuzumab, Glofitamab, and Satralizumab molecules in hematology, pediatrics, and ophthalmology. My main responsibilities include SDTM mapping, ADaM derivation and TLG generation by CDISC standards delivered to clinical science teams across Genentech/Roche Product Development. I also build R-Shiny applications for clinical exploratory purposes, ie. examine cytokine-release syndrome in patients with B-cell non-Hodgkin lymphoma or patient characteristics in thyroid eye diseases.
Besides these therapeutic molecules, I co-develop admiraldiscovery which documents the functionality of the admiral family of packages as part of Pharmaverse movement for clinical reporting across biotech/pharma industry, and support trial designs through real-world evidence. On the side, I take on different roles to facilitate knowledge sharing and enhance internal workflow across Genentech/Roche Data Sciences:
- Co-lead Git/Gitlab training session on internal workflow for Analytical Data Scientists
- On-board study teams with molecule migration into new R-based tools and systems
- Co-host internal North America Data Sciences Forum (NADF) meetings for Genentech SSF/Roche Mississauga
Estimating causal effects of organ quality and health policies in kidney transplants
I worked as a research assistant to Dr. Douglas Schaubel at the University of Pennsylvania, conducting research in the intersection of survival analysis and causal inference, with applications to kidney diseases. In these projects, I performed data processing and conducted statistical analyses to: measure causal effects of transplant centers, investigate the impact of multiple waitlisting on survival, and evaluate a novel prognostic-score based weighting method for transplant center evaluation.
One project was to evaluate the impact of receiving organs from deceased-donors with Hepatitis C Virus positive (HCV+) on post-transplant survival (Schaubel et al., 2022). Patients are unweighted in the HCV- group. To address confounding in the HCV+ group, we derived a two-dimensional prognostic score, one for donors risk and one for recipients risk. The donor-associated risk score is the continuous kidney donor risk index while the recipient-associated risk is the prognostic score estimated from the center-stratified Cox model with recipient characteristics. The two-dimensional score is then used to classify patients into risk classes, from which we obtained individual weights to generate weighted Nelson-Aalen survival curves and log-rank tests.
Quantifying physical activity through accelerometry data from wearable devices
During my internship at Regeneron, I worked with accelerometry data from wearable devices under the mentorship of Dr. Jacek Urbanek and Debra McIntyre. My project focused on quantifying physical activity characteristics of minute-level accelerometry data from National Health and Nutrition Examination Survey (NHANES) and Regeneron clinical trials with the arctools package. We investigated NHANES as a study sample with the goal to create a pipeline for digital biomarker development for national population.
Wearable devices provide objective measurements of physical activity through an accelerometer in gravitational units. Raw data is collected in a 3D time series format, along three orthogonal axes corresponding to the device’s reference frame of up-down, left-right, and backward-downward at sub-second level. These raw data is then aggregated into minute-level data and summarized by open-source reproducible metrics in non-overlapping time windows: Monitor Independent Movement Summary (MIMS), Euclidean Norm Minus One (ENMO), and Vector Magnitude Count (VMC), etc. In this project, I quantified minute-level accelerometry data from NHANES on MIMS scale for physical activity summaries and performed harmonization mapping on internal clinical data for comparison purposes.
Simulating wave propagation with physics-informed neural networks models
In summer 2021, I had the opportunity to participate in the RIPS program at Institute for Pure and Applied Mathematics, an NSF Math Institute at UCLA. I was assigned in a team of five working under the supervision of Dr. Laurent White and Dr. Kyung Ha to develop physics-informed neural networks to simulate wave propagation. Our project was sponsored by Advanced Micro Devices Inc. (AMD).
Machine learning surrogate models are widely used for engineering applications thanks to the attractive computational efficiency property. However, these models suffer from a lack of extrapolation accuracy. To design an optimal network architecture to simulate wave, we embedded physics constraints, ie. PDEs of the wave equation and initial/boundary conditions, into the loss function for regularization. In addition, we sampled unlabeled input values for model training to reduce the cost of data acquisition, and extrapolated in time for acoustic wave and in space from different source locations. We presented our work at the RIPS symposium, the AMD headquarter in Santa Clara, and the Joint Mathematics Meeting 2021.
Evaluating the performance of joint model for longitudinal and survival data
In summer 2020, I had the opportunity to conduct research through the QSURE program at Memorial Sloan Kettering Cancer Center under the mentorship of Dr. Audrey Mauguen. Our goal was to investigate the association between biomarker serum bilirubin and overall survival in Primary Biliary Cirrhosis with the Cox Proportional Hazards model, time-dependent Cox model, and Joint Model for longitudinal and time-to-event data. We then compared the estimated hazards ratios from these approaches and evaluated the benefits and drawbacks of the Joint Models.
Intuitively, the differences in the estimated hazard ratios are due to different levels of information considered: Cox PH model uses the baseline values of bilirubin; time-dependent Cox uses the current values of bilirubin by accounting for its changes overtime; the Joint Model captures the internal progression of bilirubin through its measurement errors. In this project, I performed data manipulation in R, produced data visualizations, and conducted statistical analyses for survival comparisons. I presented my work at the MSK departmental symposium, the MHC Learning through Applications symposium, and the Electronic Undergraduate Statistics Research Conference, for which I was awarded the Best Video Presentation.
During my senior year, I completed this project as my honors thesis under the supervision of Dr. Marie Ozanne, for which I was awarded the highest honors.
Developing hierarchical Archimedean copula models for dependent data
In summer 2019, I conducted my first statistical research in copula under the supervision of Dr. Evan Ray at Mount Holyoke College. Copula is a joint function used to measure the dependency between random bivariates and is widely used in time series modeling. The goal of the project was to construct Archimedean copula trees with different nesting structures to develop an Archimedean random forest.
Intuitively, the more correlated covariates are grouped closer to the bottom of a nested copula tree. However, we can introduce flexibility into the tree structure by varying the number of covariates in each node, or assigning different copula families, ie. Frank, Gumbel, Clay, etc. for each node depending on its correlation property. In this project, I developed the ncopula R package, which calculates the probability density function and cumulative distribution function to estimate parameters of nested Archimedean copulas with maximum likelihood estimation. Additionally, I provided supplementary functions for mathematical transformations, designed unit tests for estimation stability checking, and collaborated on Github platform.