IRB Guidance: Identifiability

This page addresses what makes data identifiable and what needs to be stripped from the data to make it de-identified.

Study teams often have questions about what makes data identifiable or what is considered truly de-identified data. This guidance discusses what it means for data to be identifiable under the Common Rule (45 CFR 46) and what needs to be stripped from the data to make it truly de-identified. The guidance also describes what it means for a data set to be coded, de-identified, or anonymous.

Identifiability under the Common Rule

An identifier includes any information that could be used to link research data with an individual subject.
  • The Common Rule defines "individually identifiable" to mean that the identity of the subject is, or may be, readily ascertained by the investigator or associated with the information.
  • A data set may be identifiable under the Common Rule if it contains: initials, address, zip code, phone number, gender, age, birth date, occupation, employer, racial or ethnic group, type of biopsy performed, date sample taken, diagnosis, primary care physician, referring physician, and genealogy.
  • Age, ethnicity/race, gender may be identifiers under the Common Rule if fewer than 5 individuals possess a particular cluster of traits.
  • Data may be identifiable if any combination of variables could potentially identify a subject.
  • Some of the identifiers listed above become less problematic if the sample size is large enough so that the potential identifiers could describe several individuals and thus cannot be linked to only one person. Conversely, if the sample size is small, the potential to identify an individual may increase, even in the absence of direct identifiers.

What needs to be stripped from the data to be considered a de-identified data set?

The 18 identifiers listed below need to be stripped from the data to be considered a de-identified data set.  Inclusion of even one of the following identifiers makes a data set identifiable. However, there are levels of identifiability. The following are considered limited identifiers: (date of birth, date of death, dates of clinical service), and age over age 89.  The remaining identifiers in the list below are considered to be direct identifiers. If the data set contains any limited identifiers, but none of the direct identifiers, it is considered a limited data set and not a de-identified data set.

  • Names
  • All geographic subdivisions smaller than a state
  • All elements of dates (except year) for dates that are directly related to an individual, including birth date, admission date, discharge date, death date, and all ages over 89 and all elements of dates (including year) indicative of such age, except that such ages and elements may be aggregated into a single category of age 90 or older
  • Telephone numbers
  • Vehicle identifiers and serial numbers, including license plate numbers
  • Fax numbers
  • Device identifiers and serial numbers
  • Email addresses
  • Web Universal Resource Locators (URLs)
  • Social security numbers
  • Internet Protocol (IP) addresses
  • Medical record numbers
  • Biometric identifiers, including finger and voice prints
  • Health plan beneficiary numbers
  • Full-face photographs and any comparable images (including video)
  • Account numbers
  • Any other unique identifying number, characteristic, or code 
  • Certificate/license numbers
Be mindful of indirect identifiers- these are not direct identifiers but depending on the size of your population may become identifiers either on their own or when combined with each other. The IRB will take into consideration indirect identifiers and sample size when determining if a data set is truly de-identified:

  • Names or other identifiers of the individual’s relatives, employers or household members 
  • Direct quotes taken from websites or social media sites as these can be searched and traced back to original setting or direct quotes, that if published, could identify/be traced back to a participant
  • Medical conditions (ex.the 40-year old male who had a brain tumor), hospitalizations, accidents
  • Job titles, number of years with an employer, education, income
  • Gender, race, ethnicity, age, marital status, household composition, number of children, place of birth, etc…
  • Dates- marriage, divorce, graduation, arrest, crime, trial or conviction
  • Non-randomly assigned ID numbers- assigning the first participant ID #1, making codes based on personal characteristics- birthdates, etc…

Coded data

This refers to data which have been stripped of all direct subject identifiers, but in this case each record has its own study ID or code, which is linked to identifiable information such as name or medical record number. The linking file must be separate from the coded data set. This linking file may be held by someone on the study team (e.g. the PI) or it could be held by someone outside of the study team (e.g. a researcher at another institution). A coded data set may include limited identifiers. Of note, the code itself may not contain identifiers such as subject initials or medical record number.

De-identified data

This refers to data which have been stripped of all subject identifiers, including all 18 identifiers listed above. This means that there can be no data points that are considered limited identifiers, i.e. geographic area smaller than a state, elements of dates (date of birth, date of death, dates of clinical service), and age over age 89. If the data set contains any limited identifiers, it is considered a limited data set. If the data includes an indirect link to subject identifiers (e.g. via coded ID numbers), then the data is considered by the IRB to be coded, not de-identified.

Anonymous data

Essentially the same thing as de-identified data, this refers to data which have been stripped of all subject identifiers and which have no indirect links to subject identifiers. There should be no limited identifiers in an anonymous data set.

Keywords:anonymous, de-identified, identifiable, coded   Doc ID:76643
Owner:Casey P.Group:Education and Social/Behavioral Science IRB
Created:2017-09-19 07:58 CDTUpdated:2019-06-10 10:48 CDT
Sites:Education and Social/Behavioral Science IRB, VCRGE and Graduate School
Feedback:  1   0