Overview

What is “private” or “public” data in the context of Internet-based research and privacy?

As the use of online and social media platforms have expanded over the years, so have the interest of researchers in leveraging the availability of online data for research purposes. However, whether or not the online information is considered “public” or “private” is important to differentiate in the context of human subjects research.

Private information is defined as information which has been provided for specific purposes by an individual and which the individual can reasonably expect will not be made public (e.g., a medical or school record).

Publicly available is when the general public obtains the data and they are readily available to anyone (without special permission/application) regardless of occupation, purpose, or affiliation. This excludes data accessed through the ‘Deep Web’, which contains online information that are not searchable by standard search engines. If data will be accessed through the ‘Deep Web’, then this study will likely need IRB review.

Public sites generally fall into one or more of the following categories.

  • Sites containing information that, by law, is considered "public." In most cases information from these sites will be available without restriction, although access to the information may require payment of a fee. Many federal, state, and local government sites are included in this category: property tax records, birth and death records, real estate transactions, certain court records, voter registration and voting history records, etc.
  • News, entertainment, classified, and other information-based sites where information is posted for the purpose of sharing with the public.
  • Open access data repositories, where information has been legally obtained (with IRB approval if necessary) and is made available with minimal or no restriction.
  • Discussion forums that are freely accessible to any individual with Internet access, and do not involve terms of access or terms of service that would restrict research use of the information.

If individuals voluntarily post or otherwise provide information on the Internet, such information should be considered public unless existing law and the privacy policies and/or terms of service of the entity/entities receiving or hosting the information indicate that the information should be considered “private.” If it is determined that the data were not intended for public use, even if the data are technically available to the public, the data should be considered private.

Note: Information posted on the Internet might be regularly mined and shared with authorized third parties (e.g. vendors or service providers).Third party sites have their own terms of service and their level of access to the data (e.g., identifiable data) and security measures may not match those of the first party website.

Data accessible only through special permission are generally not considered public. However, if steps are required to access data (e.g., registration/login, payment, etc.), but access is not restricted beyond these steps (e.g., anyone who creates a username and password can access the data), the data may qualify as publicly available. Regardless of privacy controls, it cannot generally be believed that online data will really be private.

Examples

Below are general examples of online information in varying degrees of access and considerations for CPHS review.

Scenario #1: “Public” data

Access to online data must not require joining or logging into a platform (i.e., with a username and password), nor for users to affirm that they have any specific characteristics. Examples might be publicly accessible forums or comments sections where users have no expectation of privacy (e.g., New York Times, YouTube, etc.), Twitter posts not set to private, Facebook public profiles found from Google searches. This category has the strongest claim to be considered “public.” Projects that fall under this category are unlikely to be human subjects research and thus will not require CPHS review. Researchers should contact OPHS (ophs@berkeley.edu) for consultation.

Scenario #2: “Might be private” data
Access to online data requires joining or logging into a platform but does not require the user to affirm any specific characteristics beyond this. An example might be Facebook posts available to any Facebook account holder, but not available to individuals without a Facebook account. Projects that fall under this category might be human subjects research. Researchers should contact OPHS (ophs@berkeley.edu) for consultation.

Scenario # 3: “Private” data
Access to online data requires both joining or logging into a platform and affirming the user has specific characteristics (e.g., certain health conditions). An example might be forums, chats, support groups, where users must register as belonging to a certain group (e.g., cancer survivor). This category has the strongest claim to be considered “private”, in which individuals whom the data is about have the expectation of privacy.  Projects that fall under this category are likely to be human subjects research. Human subjects research must be reviewed and either determined to be exempt (45 CFR 46.104(4)(i)) or obtain CPHS approval before the research begins.

Note: There may be instances where “publicly available” datasets are combined with other datasets containing identifiable private information, and in turn result in the identification of subjects. Projects of this nature will meet the threshold of human subjects research and require CPHS review.

References

SACHRP. (2013, March 13). Considerations and Recommendations Concerning Internet Research and Human Subjects Research Regulations, with Revisions.

Gelinas, L. (2020, April 22). Differentiating “Public” and “Private” Internet Spaces in IRB Review. [Blog].

Stanford Encyclopedia of Philosophy. (Rev. 2021, January 12). Internet Research Ethics.

Williams, L. (2021, October 7). Deep Web vs Dark Web: Must Know Differences.