Knowing your data

Insights

Elizabeth Nammour

June 21, 2024

At Teleskope, we talk to a lot of security leaders. And we always ask them how they rank their priorities across the numerous areas of security – appsec, DLP, endpoint, cloud, zero trust… What we hear a lot is that data security has recently become a big focus for organizations, and often times that’s because they haven’t invested much into data security to date. There are a lot of reasons this might be––it’s a newer field with products that have generally been sub-par, increased scrutiny due to the rise of AI, the prevalence of data leaks, or simply that small security teams have to pick and choose where to focus. But above all, I think teams recognize that they have collected, processed, stored and transmitted enormous amounts of data, including personal and sensitive data, and that this in and of itself creates substantial risk.

*Volume of data/information created, captured, copied, and consumed worldwide. Source:* *Statista*

Data Sprawl

As companies collect more and more data, it naturally proliferates across systems, products, and devices. People tend to copy and transmit data because it's often easier than modifying what's already available. You might think you know all the data you have, but in reality, that's rarely the case. Your data science team might have created multiple copies of the same table for analysis, your marketing team could have downloaded data into Google Sheets, and your engineering team might be sending data to third parties. So, how can you get a full picture of your data footprint, and ensure that you understand what data you have, where is lives, and how it’s being used?

‍

Data Discovery

The first step to understanding your data is creating a comprehensive inventory of all the data that exists across your different systems. Effective data discovery requires two things:

Discover All Tools and Data Stores: Begin by identifying all the tools and data stores your organization uses. This can involve reviewing bills to see what services you pay for, examining network logs, inspecting code, and even manual checks. This can be painful at first, and does require some ongoing maintenance.
‍Automate Connection and Crawling: Once you’ve identified the tools and data stores, connect to them automatically. Crawl and gather data, then store this information in a centralized metadata system.

This approach ensures you have a complete and up-to-date inventory of your data landscape, making it easier to manage and protect. And creating process and controls around onboarding and integrating new people, tools and systems can make this relatively painless to maintain.

‍

Data Classification

Once you have a comprehensive inventory of your data, the next step is to identify and understand where your crown jewels reside—your sensitive, proprietary, or customer-related data. If you have billions of files, you need to be able to find the proverbial needle in the haystack to ensure sensitive data is properly secured. This is essential because while some data might be safe to share or make public, sensitive data must be locked down to reduce attack surface and potential exposure. But how do you tackle classification across billions of files, hundreds of thousands of tables, and dozens of third-party services? You could take the manual approach, which involves using table metadata and high-level folder names to infer the contents of each data bucket. But this method doesn't scale and lacks accuracy. When new tables or files are added, teams must manually classify them. Moreover, ambiguous names like "data" or "folder1" provide no insight into the type of data they contain, making manual inspection necessary, but impractical. The only scalable solution is to automatically scan your data and classify it in two key ways:

Data Elements: Determine what types of personal and sensitive data reside in each file, table, or any piece of data living in your cloud/on prem data stores, or third party vendors. This can be accomplished using a combination of strategies, such as regex, machine learning, and validation algorithms. This might sound straightforward, but in reality it's a very difficult problem as you must balance accuracy vs throughput
Data Subjects: Quickly and accurately identifying a phone number is essential, but not always enough. To reduce false positives, a classification engine must also determine who this data is about. Is that phone number associated with a customer, an internal employee, or someone else? Is the country name “Canada” tied to a user or is it related to server locations? Understanding data subjects is crucial to figuring out whether the data is truly PII, or whether it can be accessed and used freely.

As you build or evaluate classification tools, it's important to measure:

‍Accuracy: An accurate classification engine detects all the PII and sensitive data that’s out there, but also doesn’t produce more false positives than true positives. For example, is a column called number a phone number, or does it represent some other number that isn’t PII? Same goes for files, is a variable name called DL in your go file a driver’s license number, or is it just a random variable name? This is the hardest part, and is where most classification systems begin to break down.‍
Throughput: A performant classification engine must scale to scan petabytes or even terabytes of data across cloud, on-premises, and third-party systems, all in a reasonable amount of time and with an acceptable compute cost. This is also a huge obstacle–while you can employ many different sampling strategies, you still might have hundreds of thousands of tables to classify, and billions of files that have no relation to one another.

‍

Configuration and Access Risk

Once you’ve identified where your sensitive data resides, whether in a data store or a third-party system, it’s crucial to understand the configurations surrounding that data. Simply knowing the location of your sensitive data is not enough; you need to assess the environment in which it lives to determine its risk level, and figure out what the remediation steps need to be.

Key Configuration Questions:

‍Encryption: Is the data encrypted at rest? Are the contents encrypted? Encryption is a fundamental security measure that ensures data remains protected even if unauthorized access occurs. Without encryption, sensitive data is vulnerable to exposure, making it imperative to verify and enforce robust encryption standards.‍
Network Configurations: Are the network configurations overly permissive? Overly permissive network settings can expose sensitive data to unauthorized access. Assessing and tightening network configurations helps prevent potential breaches by ensuring that only authorized personnel and systems can access the data.‍
Access Configurations: Is the data publicly shared? Can anyone with a link view the information? Is the bucket policy public? Ensuring that data is not inadvertently shared publicly is necessary to prevent easily avoidable breaches.‍
Access: Who has access to the data? Are the users internal or external? When did they last read or write to the data? Is this data store still being accessed and therefore needed? This not only helps enforce least privilege principles to minimize access to sensitive data, but it can also uncover entire data stores that no longer need to exist. Why store data that you don’t use, incurring unnecessary costs and risks?

‍

Data Lineage

Data discovery and classification will give you a solid grasp of your data landscape. It’s likely that those exercises lead you to uncovered PII in a location where it shouldn’t be, and you can take action to address those issues. But who’s to say that more PII won’t end up there again? Data lineage provides visibility into data flows across your systems, mapping where data originated and where it ended up. Knowing how data is flowing allows you to identify and address unauthorized or unintentional data movements, and can also help you take real corrective action. This means you can "kill it at the source," preventing future occurrences of the same issue. By tracing the data back to its origin, you can address the root cause, ensuring similar data doesn't end up in the wrong place again.

‍

Data Ownership

Let’s say you identify PII in a place where it shouldn’t be. Who should you ask or notify to gain clarity? This is where data ownership becomes crucial. Without knowing which team is responsible for the data, it’s impossible to understand why the data is there or who to contact when something goes wrong. Data ownership is often assigned to specific teams rather than individuals. They are responsible for the data's lifecycle, including its security and compliance. They provide the necessary context to manage the data properly and can take action to resolve any issues. Knowing the data owner ensures accountability and helps maintain data integrity and security.

‍

Is Understanding Your Data Enough?

Understanding your data landscape and gaining actionable insights into its security and privacy risks is crucial for any company that collects, stores and transmits PII and sensitive information. However, insights without remediation and actions are not enough; they are just the beginning. To truly and continuously protect your data, you need to take concrete, automated steps to remediate identified issues, and implement necessary security measures. You can’t have data security without understanding your data, but insights are just the first piece of the data security puzzle.

‍

About Our Blog

We’re on a mission to build not only the best-in-class data protection platform, but also a transparent one. Stay tuned for more in-depth, technical blogs about all things data security and governance.

Introduction

Kyte unlocks the freedom to go places by delivering cars for any trip longer than a rideshare. As part of its goal to re-invent the car rental experience Kyte collects sensitive customer data, including driver’s licenses, delivery and return locations, and payments information. As Kyte continues to expand its customer base and implement new technologies to streamline operations, the challenge of ensuring data security becomes more intricate. Data is distributed across both internal cloud hosting as well as third party systems, making compliance with privacy regulations and data security paramount. Kyte initially attempted to address data labeling and customer data deletion manually, but this quickly became an untenable solution that could not scale with their business. Building such solutions in-house didn’t make sense either, as they would require constant updates to accommodate growing data volumes which would distract their engineers from their primary focus of transforming the rental car experience.

list
list
list
list

Continuous Data Discovery and Classification

In order to protect sensitive information, you first need to understand it, so one of Kyte’s primary objectives was to continuously discover and classify their data at scale. To meet this need, Teleskope deployed a single-tenant environment for Kyte, and integrated their third-party saas providers and multiple AWS accounts. Teleskope discovered and crawled Kyte’s entire data footprint, encompassing hundreds of terabytes in their AWS accounts, across a variety of data stores. Teleskope instantly classified Kyte’s entire data footprint, identifying over 100 distinct data entity types across hundreds of thousands of columns and objects. Beyond classifying data entity types, Teleskope also surfaced the data subjects associated with the entities, enabling Kyte to categorize customer, employee, surfer, and business metadata separately. This automated approach ensures that Kyte maintains an up-to-date data map detailing the personal and sensitive data throughout their environment, enabling them to maintain a structured and secure environment.

Securing Data Storage and Infrastructure

Another critical aspect of Kyte’s Teleskope deployment was ensuring the secure storage of data and maintaining proper infrastructure configuration, especially as engineers spun up new instances or made modifications to the underlying infrastructure. While crawling Kyte’s cloud environment, Teleskope conducted continuous analysis of their infrastructure configurations to ensure their data was secure and aligned with various privacy regulations and security frameworks, including CCPA and SOC2. Teleskope helped Kyte identify and fortify unencrypted data stores, correct overly permissive access, and clean up stale data stores that hadn’t been touched in a while. With Teleskope deployed, Kyte’s team will be alerted in real time if one of these issues surfaces again.

End-to-End Automation of Data Subject Rights Requests

Kyte was also focused on streamlining data subject rights (DSR) requests. Whereas their team previously performed this task manually and with workflows and forms, Kyte now uses Teleskope to automate data deletion and access requests across various data sources, including internal data stores like RDS, and their numerous third-party vendors such as Stripe, Rockerbox, Braze, and more. When a new DSR request is received, Teleskope seamlessly maps and identifies the user’s data across internal tables containing personal information, and triggers the necessary access or deletion query for that specific data store. Teleskope also ensures compliance by automatically enforcing the request with third-party vendors, either via API integration or email, in cases where third parties don’t expose an API endpoint.

Conclusion

With Teleskope, Kyte has been able to effectively mitigate risks and ensure compliance with evolving regulations as their data footprint expands. Teleskope reduced operational overhead related to security and compliance by 80%, by automating the manual processes and replacing outdated and ad-hoc scripts. Teleskope allows Kyte’s engineering team to focus on unlocking the freedom to go places through a tech-enabled car rental experience, and helps to build systems and software with a privacy-first mindset. These tangible outcomes allow Kyte to streamline their operations, enhance data security, and focus on building a great, secure product for their customers.

Decentralized Data Risks: How Cloud Sprawl is Changing the Game for CISOs

Classification engine identifies personal and sensitive information with unparalleled accuracy, and contextually distinguishes between.

Lessons from Recent Breaches — Why Data Security Must be Your First Line of Defense

Classification engine identifies personal and sensitive information with unparalleled accuracy, and contextually distinguishes between.