# Privacy and Redaction

## Release Principle

BenSyc preserves conversational language needed for research while reducing avoidable identification risk. Public social-media text can still contain sensitive personal information, allegations, and contextual details that enable indirect identification.

## Automatically Redacted

The release builder replaces detected:

- phone numbers;
- URLs;
- email addresses;
- Reddit usernames and explicitly labeled usernames;
- explicitly labeled private-person names;
- explicitly labeled addresses;
- ride and vehicle identifiers.

Typed placeholders preserve conversational structure without retaining the identifier.

## Manual Review

`audits/manual_review_queue.csv` identifies examples that need human review because they contain sensitive-context indicators, automatic redactions, evidence issues, or ambiguous annotations.

Automatic scanning cannot reliably identify every Bengali/Banglish name, address, institution, allegation, or combination of quasi-identifiers. The authors selected public access, but reviewers should continue resolving the manual queue and publish corrections when needed.

## Prohibited Uses

Users must not attempt to:

- identify, contact, locate, or investigate source users or people mentioned in the text;
- reconstruct removed identifiers;
- link examples to Reddit posts or other datasets;
- use allegations in the dataset as factual claims about individuals.

## Audit Artifacts

- `audits/redaction_report.csv`: count and type of automatic replacements per example.
- `audits/manual_review_queue.csv`: examples requiring human review.
- `audits/release_statistics.json`: aggregate release and privacy statistics.
