Data Scientist at Certilytics
MS in Data Science, Artificial Intelligence specialization completed through Northwestern University in June 2022.
View My LinkedIn Profile
View My GitHub Profile
This project focused on leveraging advanced generative AI to model complex treatment pathways and provide actionable insights for improving patient outcomes and provider performance while reducing costs. Utilizing a GPT-2 language model, I engineered a novel approach to learn underlying disease pathways from extensive claims data.
A key innovation involved developing a custom tokenizer with a specialized vocabulary derived directly from healthcare claims elements, ensuring precise representation of medical events. Furthermore, I proactively implemented a Rotary Position Embedding (RoPE) mechanism into the GPT-2 architecture prior to its mainstream availability, which significantly enhanced model performance by nearly 2x. This robust generative model was then applied to critical healthcare tasks, including high-cost claimant classification, enabling early identification of at-risk individuals; provider rank ordering for optimized resource utilization; and in-depth at-risk population analysis to inform targeted interventions.
This project addressed a critical challenge in healthcare operations: the efficient and accurate prediction of health insurance claim denials and their underlying reasons.
I designed and developed custom transformer-based neural network architectures specifically tailored for this task, moving beyond off-the-shelf solutions to capture the complex relationships within claims data. These models achieved high performance metrics on both binary classification (denial vs. approval) and multi-class prediction (specific denial reason codes), demonstrating their robust predictive power.
I moved both models into production for daily batch inference on incoming client claims, leveraging MLFlow for comprehensive model lifecycle management (experiment tracking, versioning, and deployment) and AWS for scalable and reliable cloud infrastructure. This solution empowers healthcare organizations to proactively address potential denials, streamline claims processing, and ultimately improve financial efficiency and patient experience.
I retrained a word embedding deep representation model on newly-acquired data containing patient-level medical system utilization sequences and applied hyperparameter tuning testing and analysis. I tested a hyperparameter grid by generating 22 model configurations, and the optimal hyperparameter setting ultimately selected resulted in an average improvement of 5% AUC or R2 score (depending on whether the application was a classification or regression problem) across the entire model suite.
In order to evaluate the performance of each word embedding, I trained four end-to-end models for each embedding contained in the four model pipelines, and I scored these four models to allow me to conduct cross-model extrinsic evaluations. After completing the extrinsic evaluations to select the optimal final configuration, I partnered with Certilytics' internal clinical expert to conduct an intrinsic evaluation of the model using a custom clustering challenge on hand-selected medical codes which would be naturally expected to exhibit clusters or distance between similar and dissimilar codes respectively.
The final embedding sits centrally in most model pipelines within Certilytics model suite.
I was tasked with building a program to create projection scenarios for future leases across AIR Communities' apartment property portfolio. The projections informed the budget and forecasting process for the organization, and I was initially approached to own this project after a single property projection (of ~100 owned) built in Excel was unable to handle the complete unit-level output and had a runtime of close to an hour.
I built the logic into a Python program which output results to a new SQL table available to the Decision Support team consuming the projections for forecasting. While the logic and calculations feeding the forecast are highly proprietary, I have included the video below which shows the GUI application I built on top of the program and bundled into an executable to enable Decision Support staff to independently rerun the program while tweaking model inputs. I used Tkinter to develop the GUI and PyInstaller to create the executable.
This project proved to be a disruptive innovation to the forecasting process at AIR Communities, expanding the forecast horizon and predictive capabilities of the financial future of the organization across a greater number of scenarios due to the quick, user-friendly deliverable.
This project sought to collect over forty attributes for more than 850 competitor multi-family apartment home properties from the CoStar property research platform. The program achieved data collection, cleansing, and injection into storage in less than eight minutes start to finish. CoStar recently updated the service's Terms of Use to explicitly prohibit the web scraping techniques and reverse-engineering of the CoStar product utilized in this program. I ultimately led the project in an alternate direction to acquire similar data while keeping the business in compliance with CoStar's Terms of Use, and have shared the original program as proof of work.
I created a tutorial and video demonstration of the automatic machine learning (AutoML) tool DataRobot. The tutorial provides a simple demonstration of DataRobot integration into a project applying sentiment analysis to daily chatbot message data to rank order prospect follow-up outreach conducted the following day. The final application can be viewed in the separate Prospect Ranked Follow-up Application repository.
This project is a case study on developing NLP applications in a low-resource corporate environment operating a client-centric, service-based business model. I pretrained miniature BERT masked language models on domain-adapted vocabulary sourced from client-facing research documents. I demonstrated light improvements in model performance over baseline when finetuned to categorize client consultation requests by topic.
I developed this toolkit to automate the collection of video recordings, recording metadata, and transcripts from a variety of different video conference, video hosting, and transcription service platforms. I personally utilized the tools during my four years working in client relationship management remotely supporting a territory containing hundreds of clients.
As the lucky husband to the founder of The Beverly Collective, a Colorado-based art collective, I built this program to reduce the manual workload of sending out biweekly sales reports emails to the 30+ artists and makers vending through the collective. I completed coding for this program in less than 5 hours and reduced the hourly workload from 10 hours per month to only 2 hours focused on email validation, payment processing, and vendor support each month. I successfully leveraged the Gmail API to gather user permissions and create email drafts within the user email and consumed Excel files into the Python-based program using the OpenPyxl library.
This project is designed to extend personal diabetes data and insights into the realm of real-time streaming, IoT integrations, and data science predictive modeling techniques. The project is launched from a foundation of diabetes data democratization facilitated by Nightscout, an open-source cloud application used by people with diabetes, providers, and caretakers to visualize, store and share the data from their Continuous Glucose Monitoring sensors in real-time.
Having recently established sensor data accessibility via a web-hosted MongoDB database, I am actively pursuing two aims with this project:
I co-authored a blog series hosted on Ultiworld Disc Golf predicting disc golf player performance at elite series events. I contributed player performance web scraping and GIS data collection capabilities, cleaned and preprocessed data, and edited post content. The scripts hosted in this repository demonstrate some of the larger data collection efforts feeding parts of the model. This was my first time ever using Python, and I am in the process of revisiting the files to spruce up the content.
The blog posts are available on the Ultiworld Disc Golf website.