The following is a collection of tips and guidelines for preparing datasets for analysis.

BEEP analysts address these issues throughout the course of a project, as required by various steps in the analysis. For a typical project, 75% to 90% of the total time is spent on data preparation. We are happy to work with datasets in any format, but recognize that investigators working on a tight budget may want to focus their BEEP time on analysis, rather than processing.

We have identified several places in the following non-sequential steps (denoted by "TIMESAVER") where investigators can substantially reduce BEEP data processing time in one of three ways:

  • Providing additional information
  • Making specific choices during the data collection process
  • Conducting pre-processing before handing the data to BEEP.

Tips for Data in Excel Spreadsheets

  • Ensure that no text or numbers are entered in any cells outside the data rows and columns (i.e., no headers or footnotes).
  • Ensure that variable labels are entered on Row 1 only. Do not repeat the label row elsewhere on the worksheet.
  • Do not skip rows or columns.
  • "Unhide" any hidden columns.
  • Ensure that no essential information is conveyed using text color, cell shading, or highlighting.
  • Ensure that no essential information is conveyed only on a spreadsheet filename or tab label. For example, if you are providing one file named "intervention.xls" and one file named "control.xls", also include a variable called "studyarm" in each file that identifies the subjects as control or intervention.
  • Do not include footnotes. If you have additional explanation of irregularities, data errors, data collection methods or other important information, please include in a separate Word document.
  • TIMESAVER: Scan your dataset to ensure that all cells are empty outside of your data rows and columns.
  • TIMESAVER: Separate text and numeric data into separate columns wherever possible (e.g., for recording a value of 200 mcg, put the "200" in one column and the label "mcg" in another).

Tips for Working with Multiple Datasets and Merging Datasets

  • If you have different datasets for different groups of subjects (e.g., control and intervention), please ensure that the order of variables, as well as variable names and formats, are identical between datasets.
  • TIMESAVER: For data provided in multiple datasets, ensure that identical variables are given identical names. From our point of view, "StudyID" and "Study_ID" are two distinct variables.
  • TIMESAVER: If you have collected the same information on study subjects over time, give these longitudinal variables similar names. If subject response to the question "Do you have a pet?" is entered in a variable named "Q24_baseline" in your baseline dataset, please enter subject response to the same question in your followup survey as "Q24_followup", not "Q44_followup", even if the order of survey questions changed.

Variable Labeling

  • For example, variable named "Gender" has 2 values, 1 and 2, which are labeled 1 = male and 2 = female.
  • Datasets received in other formats will be labeled, as needed, after conversion into Stata format. This step fully describes key variables, and minimizes error.
  • TIMESAVER: Provide a clearly labeled data dictionary (i.e., key to variable names and values) in a Word document. For survey data, providing a copy of the survey tool is helpful as well.

Variable Naming

  • No spaces in variable names
  • Variable names must not begin with a number or a symbol.
  • Variable names < 32 characters
  • TIMESAVER: Edit variable names to remove spaces.

Tips for ID Variables

  • Use numbers and letters only (preferably numbers only).
  • No spaces or special characters (hyphens, commas, etc.) allowed
  • Structure ID variables identically for all subjects. If the first subject in your dataset is assigned ID #12345A, the second subject in your dataset should not be assigned ID #431997.
  • IDs must be unique for each subject. Do not reuse ID numbers.
  • TIMESAVER: If your patient IDs contain special characters, create a new ID variable in the dataset that is numeric only. It's OK to leave the original ID variable in the dataset.

Tips for Numeric Variables

  • If needed, variable is converted from string (i.e., text format) into numeric format.
  • Indicate missing data by leaving the field blank. Do not enter "missing" or "NA."
  • If needed, text entered in numeric fields is changed to numeric or deleted.
  • Characters and symbols such as not allowed.
  • TIMESAVER: Scan your dataset to make sure no notes are entered in a field that should only contain numbers. Create a second variable to contain those notes if necessary.
  • TIMESAVER: If you want to indicate two values for a numeric variable, create a second variable rather than putting two values in a single field. It is not OK to type "4 and 7" in a field that otherwise contains single-digit numbers.

Tips for Date Variables

  • If needed, entries are standardized (e.g., if year is entered both as four-digit and two-digit number, convert to consistent number of digits to enable analysis)
  • TIMESAVER: Scan your dataset to ensure that dates are entered consistently. Consistency is more important than the actual format.

Tips for String (Text) Variables

  • If needed, correct misspelled entries to enable analysis.
  • Indicate missing data by leaving the field blank. Do not enter "missing" or "NA."
  • Avoid using commas! We may need to convert your data into a comma-delimited format, and extraneous commas disrupt this process.
  • If needed, edit text entries to create consistency, and to enable analysis.
  • TIMESAVER: Scan entries in text fields to find and fix misspellings. If variable is going to be used to group subjects into categories, ensure consistency throughout entries. From Stata's point of view, "Male," "male," "M," "m" and "boy" are all different categories. One space is treated differently than two spaces.

Tips for ICD and Other Medical Code Variables

  • Inspect variable to ensure that it only includes ICD codes. Delete or edit any notes, or text descriptions of conditions/procedures. Convert text descriptions into appropriate ICD codes.
  • If needed, standardize entries so that all codes are either three-digit or five-digit ICDs.
  • If needed, standardize use of "." within ICD codes.
  • If needed, look up text labels corresponding to ICD codes and label variable appropriately.
  • TIMESAVER: Provide a list of the conditions and/or procedures that correspond to the ICD codes in your dataset. Ideally, a spreadsheet with the ICD code in one column, and the description in an adjacent column.
  • TIMESAVER: If your analysis will group similar ICDs together, provide a list of how these groups should be defined.

Tips for Longitudinal Data or Datasets with More Than One Row per Subject

  • Ensure that patient/subject ID exists on every row, not just the first row in clustered data.
  • Ensure that there is a set of variables that uniquely identifies each entry (e.g., MRN and admission date).
  • Fill in values of static variables (e.g., gender, race) for all rows corresponding to a subject.
  • TIMESAVER: Fill in patient/subject ID on every row.
  • TIMESAVER: Ensure that there are no duplicate rows in the dataset.

Considerations for Providing BEEP with Data Updates or Corrections After the Analysis Process Has Begun

  • Ensure that updated/corrected data is otherwise identical to data previously provided to the BEEP Core, as all the above checks and corrections will be implemented using programming code. Changing variable contents and/or formatting will add extra time, as cleaning steps will have to be re-programmed to match the updated dataset.