Course Content
Clinical Research Data Management Course

Variable names are the machine-readable identifiers used in databases, exports, and analysis scripts. Good variable names are short enough to be practical, but descriptive enough to be understood. They should follow a consistent naming convention across the study. Poor variable names create confusion during database development, data cleaning, analysis, and sharing.
A recommended convention is to use lowercase letters, underscores between words, and units where appropriate. Examples include age_months, weight_kg, height_cm, temp_c, visit_date, and malaria_rdt_result. Names such as Age, AGE1, wt, newweight, or patient status are less useful because they are inconsistent, unclear, or difficult to use in programming.
Variable labels complement variable names. The name temp_c maybe suitable for analysis, but the label should be more descriptive, such as “Measured axillary temperature at enrollment in degrees Celsius.” The label helps users understand the field, especially during data entry or review. A good data dictionary includes both names and labels.
Coding standards define how categorical responses are represented. For example, sex may be coded as 1 = Male, 2 = Female, and 3 = Unknown. A Yes/No variable may use 1 =Yes and 0 = No. Coding should be consistent across all forms. If one part of the database uses 1 = Yes, 0 = No and another uses 1 = No, 2 = Yes, errors are likely during analysis.
Coding should also consider missing and special values. It is usually better to distinguish true missing values from meaningful categories such as “not applicable,””unknown,””not done,”
refused,” or “unable to assess.” For example, if pregnancy status is missing because the question was not applicable to a male participant, that is different from missing because the field was accidentally skipped. The CRF and data dictionary should make such distinctions clear.
Standardized coding supports interoperability. If a research network uses similar codes across studies, data become easier to combine, compare, and reuse. This is particularly important for multicenter networks, disease registries, and long-term surveillance systems. Coding standards should be documented before database build and reviewed before data collection
begins.