National Data Bank for Rheumatic Diseases

A case study on the use of Stata.

Stata at the National Data Bank for Rheumatic Diseases

Fred Wolfe, MD, leads the National Data Bank for Rheumatic Diseases (NDB). The NDB collects self-reported information directly from patients using 28-page questionnaires mailed at six-month intervals, gathering information on use of services, medical costs, financial status, functional ability, quality of life, psychological status, treatments received and their side effects, and long-term outcomes pertaining to illness, work, and death. Patients are typically referred to the NDB by their rheumatologist, and critical medical events are validated by obtaining medical records.

"Stata allowed us to do data management in a flexible, useful, cost-effective way that we couldn’t do otherwise." — Fred Wolfe

The NDB unlike other data banks that collect data primarily from administrative sources like Medicare or insurance companies or from physicians, hospitals, and laboratories, the NDB data sourced directly from patients allows researchers to answer questions that are most germane to patients but that cannot be answered based on other databases’ information. These questions include things like treatment efficacy in the community rather than efficacy in randomized clinical trials, whether patients use the treatments, whether patients report less pain, and how the disease affects patients’ daily lives

NDB staff input the nearly 10,000 variables of data into a Microsoft SQL database. Wolfe and his team then use Stata programs to build and update the dataset, a step that is done nightly and takes about 6 hours to run. Stata programs then check data consistency. “Immortalized” datasets are created upon the completion of a six-month survey phase. The current immortalized dataset contains nearly 600,000 observations on 89,000 patients. Auxiliary programs allow database managers to apply value labels and account for missing values and allow users to extract, manipulate, and process variables of interest. All told, the NDB uses over 1,000 programs and dofiles, mostly written by Wolfe or his colleague, Kaleb Michaud, a senior analyst at NDB and assistant professor of medicine at the University of Nebraska Medical Center.

The database is used to publish research about rheumatic diseases in peer-reviewed journals. Roughly 125 papers that rely on this data have been published. Because the data bank is complex and accessed by Stata commands, researchers using the NDB data typically work with a member of Wolfe’s staff. Michaud adds, “When research collaborators want to work with the data, we highly recommend that they use Stata for the analysis; serious medical students, residents, and fellows who take on research projects with me all use it.”

The NDB also maintains safety registries, longterm observational studies that monitor adverse events among patients receiving new drugs.

Getting started with Stata

Learning that many colleagues were switching to Stata, Wolfe was tempted when he discovered that data management in Stata would free him from the arduous task of writing SAS loops. Finding SAS to be expensive and bloated, and S-Plus and (later) R to be hard to learn and unwieldy for data management, he quickly fell for Stata in 1995. Wolfe writes, “Stata programs were one of the things that made Stata great for us.” Stata’s flexible programming language allowed for all sorts of contingencies in the data and facilitated reporting. Wolfe continues, “In a data bank that was always changing, Stata allowed us to do data management in a flexible, useful, cost-effective way that we couldn’t do otherwise.”

Wolfe also credits the Stata community, including the Statalist email group, the Stata Journal and Stata Technical Bulletin, and the Statistical Software Components (SSC) archive. He says that postings written by Nick Cox, a long-time member of the Stata community, taught him more about programming than he could have learned anywhere else. Wolfe summarizes the Stata community: “Today, between the manuals, the archives, and Googling Stata issues, Stata is a continuing teacher. I certainly learned what not to do.”

Stata’s role — data management, analysis, and reporting

The NDB relies heavily on Stata’s datamanagement facilities, including its support for ODBC connectivity, extensive macro manipulation features, and commands like egen and merge. Statistical techniques such as linear regression, logit modeling, fixed- and randomeffects estimation, mixed modeling, and survival analysis are all carried out using Stata. Stata’s comprehensive graphics capabilities are a vital component of the NDB’s data-management and reporting tasks. Wolfe asks, “How could I live without Stata and the Statalist?”


Research produced with NDB has led to important insights and real changes in recommended treatments. Among their many findings, NDB researchers were the first to show that methotrexate reduced mortality in rheumatoid arthritis (RA) patients. They demonstrated or confirmed the association between RA and heart attacks, stroke, skin cancer, and lymphomas, and they showed there is no increase in cancer and cardiovascular risk from biologic therapy of RA. The NDB documented rates of work disability among RA patients and identified predictors thereof. They published the first longitudinal study on joint replacement in RA patients, and they published the classic and definitive papers on the erythrocyte sedimentation rate in rheumatic diseases. Recently, they determined the rate of retinal toxicity of the common arthritis and lupus drug hydroxychloroquine, and their findings are now being turned into recommendations for treatment monitoring.

The NDB also used Stata to develop clinical assessments, including the HAQ-II functional questionnaire and the fibromyalgia diagnostic questionnaire. They showed that patient questionnaires could be used in clinical settings and were important in predicting outcomes. Their database has also allowed them to learn lessons that are more broadly applicable. For example, the NDB’s extensive questionnaires have led them to understand rates of nonresponse and what makes a survey question good or bad. As Wolfe describes the process, “The data bank is an epidemiologic textbook on data collection and errors, missing data, biases, causality, and on and on. Stata made it easy to learn these things.”


Stata plays a crucial role at the NDB under the direction of Fred Wolfe. From managing raw survey data to integrity checking, to advanced statistical analyses and report generation, Stata provides the tools the NDB needs to get the job done.

Brian Poi, Executive Editor and Senior Economist

Reproduced with permission from The Stata News Vol 25, No 1, March 2010