"Ah," you think to yourself. "I probably shouldn't be seeing this."
How often in your career have you muttered these words? With so much personal data flowing through applications, it can be all too easy to chance upon sensitive information while debugging issues. To prevent privacy leaks while retaining useful information, developers need a system that finds and redacts only the sensitive parts of each data sample. Recent breakthroughs in natural language processing (NLP) have made PII detection and redaction in unseen datasets feasible.
What was my role in this?
While working on a Kubernetes observability tool called Pixie, I developed a machine learning architecture to detect and redact 60 types of Personally Identifiable Information (PII) from logs and network data to help developers track the flow of PII in their applications, detect data exfiltration, and better comply with privacy regulations. To help train the model, I built a synthetic data generator that draws from over 3,800 APIs to produce synthetic network traffic.
It's 10 pm and you're on-call. A few minutes ago, you received a slack message about performance issues affecting users of your application. You sigh…