DLP and Security Service Edge (SSE)

Why you should just do DLP with SSE

Dec 03, 2022

TL;DR

Security Service Edge (SSE) technologies allow you to control access from users (and machines if you are good) to Internet accessible and private systems. They not only provide a way to remove expensive and clunky remote access technologies they are a much better way of doing Data Loss Prevention (DLP). If you don’t have any DLP, do SSE instead, if you have existing DLP, see if you can integrate and use SSE as an enforcement channel.

Introduction

After bitching about Gartner and analysts in general I am back talking about another buzz word bingo and hot topic: Security Service Edge (SSE) and why it is the best way to do Data Loss Prevention (DLP) which is a key part of implementing a Zero Trust Architecture (3 buzz words in one sentence! Drink!). A more complete post on Zero Trust in a #futurePost.

Sidenote: Interestingly Gartner originally called it Secure Access Service Edge (SASE) (pronounced “sassy). Then they decided to focus on the security part, dropping SD-WAN (Software Defined Wide Area Network) as it was … more saxxy </dadjoke>?

SSE is a set of technologies that basically allow you to connect users working anywhere and resources running anywhere to other resources running anywhere while improving security. It replaces VPN’s and Virtual Desktops and makes it more secure to access SaaS services and other Internet accessible resources. Unlike everything badged as “Zero Trust” and similar to Cloud Native Application Protection Platform’s (CNAPP) this is one of the buzz words/hype cycle that I think is practically very useful.

Why do you need SSE

I think there is no question covid changed the world. Remote working once frowned upon by pointy haired bosses everywhere suddenly became the mandate. Companies rushed to get remote access solutions working, cope with and scale overloaded VPN’s and suddenly remote workers needed to be treated as first class citizens. Fast forward 3 years at time of writing (can you believe covid started in March 2020?) and almost no one wants to go back to full time in the office. The new reality is hybrid working. So basically some of your users will be working remotely some of the time, you need to offer hybrid or remote working to stay competitive so why not make that experience as great as possible and as secure.

At the same time and this trend had started well before covid, maybe as far back as my favorite Jericho forum, your resources are no longer in a nice perimeter in your data centers.

Your resources now run:

Still in your datacenters because you are smart (#futurePost)
In your offices (still…): network equipment, whiteboards, some NAS etc.
In cloud as IaaS, PaaS, FaaS e.g. AWS, GCP, Azure
As SaaS (what we used to call a website…)

What you have is a many to many problem…

Wouldn’t it be nice if:

It didn’t matter where your user was: office or remote or where your source resource was running
What device they were using: managed or device choice
Where the destination resource they are trying to access is hosted/running

They can access what they need to:

Easily, with no additional software like Virtual Desktops or VPN’s. Low friction and improved user experience.
Lower costs and complexity - vs VDI’s and multiple proxies and remote access technologies
More securely - they only access what they need and what functions they perform consider their device and overall security posture. Adopting the zero trust principles that authentication and authorization is continuous and down to the transactional level.

SSE Architecture

The good news is unlike when the Jericho commandments were written, the technology to do this relatively cheaply is now available. High level architecture:

Key components of SSE:

Agent - no getting away from it unfortunately. You need one on all managed devices to steer traffic, to verify the device, to perform ongoing posture monitoring, for end to end user experience monitoring, sign-on without passwords (!!!). Embrace the agent sprawl…. muhahahahaha. :)

Agentless for unmanaged devices - most solutions offer agentless access for web based resources with a few like Palo promising a fully SSL VPN in the browser.
SaaS - Basically the control plane like in SD-WAN. Configure the solution, set policies. Some solutions require all traffic to go via the SaaS even when the user is in an office accessing private resources. Avoid those solutions :).
Private edge - I was so disappointed when the new fancy SD-WAN solution still required us to plug in a bunch of hardware appliances into the datacenters. A very smart Network Architect told me, well end of the data need somewhere for the packets to actually travel… Good point! SSE is no different, you need some compute where your private compute lives to funnel traffic. No getting around it, but it can be virtualized and hopefully soon as a container you can deploy on Fargate and equivalent.

The high level sequence diagram is:

User on a managed device or machine anywhere needs to connect to a private or public resource.
- Users on unmanaged devices authenticate to the SSE via your IDP on the browser first.
The agent authenticates the user against your IDP. Most solutions if your IDP also has an agent (e.g. Okta Fastpass, Azure AD with Windows Hello) can do this without needing the user to ever enter a username and password. Trusted device with a separate MFA prompt if risky (e.g. new device, new location, new IP).
The agent talks to the SaaS and gets the routing policy:
- Public Internet - provide access via the cloud proxy service. You can do some fun things to improve security here because SSE solutions understand the transactions for the popular applications e.g. cloud storage, Office365 and Google Workspace:
  - If they are on an unmanaged device: don’t allow them to download sensitive data. You can block access of course but don’t be that guy… :) Be risk based.
  - Allow upload to your corporate Google drive tenant but not the users personal Google drive. Works for a large amount of popular services that have public and corporate tenants e.g. M365.
- Private resources - without changing the IP’s, DNS or anything in your private applications, users can access them just like if they were in your office. When the users is remote the traffic is routed via the SaaS, when the users is in an office location or has access to an SD-WAN connection e.g. branch location, or executive with an SD-WAN edge at their home, the better solutions will route traffic directly.

What has this got to do with DLP?

I remember when I worked on my first Data Loss Prevention (DLP) project. It was in 2008 (wow that is 12 years ago now at time of writing). DLP seemed like the holy grail:

Forget networks, forget anti-malware. Finally here was a security technology that got directly at the problem: how to protect the Confidentiality, Integrity and Availability (CIA triangle of security) of your most sensitive data as close to the data as possible.

It did this via:

Digital Rights Management (DRM) - ironically the least deployed component of modern DLP solutions. But encrypt all your sensitive data, link it to your IDP. The protection travels with your data. Doesn’t matter how it is shared. If a user doesn’t need access anymore (e.g. leaver, mover, 3rd party that no longer requires access) then access is immediately lost. It doesn’t matter if the document was on a personal or managed device. Access revoked at the IDP == no more access to the sensitive information.

Direct security - rather than vulnerability scanning, anti-malware, network firewalls etc, DLP is far more direct. Find your most sensitive information (both structured and unstructured), DLP will help you search for it also, and protect it. Simples.

Well that was the promise anyway. What was the reality:

“You promise unicorns as the architect, it is my job to deliver the donkey” - My favorite project manager

Unfortunately I’m convinced that the true promise of DLP requires basically a general artificial intelligence. Or at least a much better version of GPT3 natural language processing. The current tools such as regex and keyword search are almost totally useless.

Bottom line: DLP does not currently work very well.

What DLP is good at:

Highly structured data that has quite a unique pattern - e.g. Credit card numbers. You would think, do a LUNH check, 16 digit number, should be low false positives right?

Even for credit cards DLP only works well when you combine it with other “dictionaries” i.e. key words. So you accept a high rate of false negatives to also look for data around the “card number” such as Visa/MasterCard, expiry number match, etc. If you are willing to connect your DLP engine to your actual card number database, and it is not tokenized (FAIL!!!), you can take a hash of the actual cards you care about e.g. customer card numbers and then the accuracy is very good. Not many companies do this though. Called “exact data match” in most DLP tools. If you want to do anything like Personally Identifiable Information (PII) or anything else that the business considers highly sensitive to your business and human could easily tell.. forget about it for now.

Highly templated data - this is the reality of the AI/ML promise of all the DLP engines the moment. If you have a watermarked template or a common structure to all your contracts, invoices etc. then the engine can “learn” these and you get a pretty high signal to noise
DLP that uses meta data - this is how many companies are trying to get around the problems of getting a computer to accurately read and make sense of human information. They also consume information such as: the location e.g. SharePoint library, who is accessing this information and what department / line of business do they belong to, when was this information last updated, does the user normally access this information, is this information normally sent to these email addresses by people within the organization or uploaded to this website etc. Of course you can do all this in your SIEM but who doesn’t love out of the box. Of course this means a hell of a lot more integrations for you but in theory it works.

Ok so where does DLP struggle at the moment:

Common sense - basically any adult human can look a document and see if it is sensitive or not, a computer cannot currently. Full stop.. Don’t believe the hype. Microsoft (bless their heart), don’t even at time of writing allow you to block an email as an action that breaches a DLP policy. You can only get a human to review it e.g. the user, the users line manager, a compliance or risk department or your poor line 1 cyber defense / SOC analysts.
Anyone determined - DLP is still all about low hanging fruit. Don’t look at the Mitre attack framework for data exfiltration and expect to catch these. A determined attacker will encrypt the data first, move it out via an open SFTP server, stream it out slowly via DNS etc. A determined user will take a screenprint with their phone of their screen. DLP isn’t stopping this. It is great at the basic and common communication and collaboration channels:
- Email. Yes 2022 at time of writing and email is still used… RIP Google Wave… or Slack you promised so much…
- Sharing e.g. OneDrive, SharePoint, Dropbox, Google drive.
- Collaboration e.g. Teams, Zoom, Slack.
- SaaS with an API e.g. ServiceNow, M365, Google Workspace.
- Internet uploads anywhere without an API - everyone except Microsoft has a forward proxy… They are building this into Defender for Endpoint integration with Defender for Cloud Apps but not enforce at time of writing. If your SFTP / batch transfers uses your Internet egress it can also cover .that.assuming you are not doing PGP or some layer of app encryption without an Additional Decryption Key but rather using TLS or before it is encrypted. Yes unfortunately most corporates still use batch transfers.
- Endpoint e.g. write to USB or other storage, copy to mapped drives, encrypt on the device, print, copy/paste etc.
- Mobile - ok nothing does this very well but you can use MDM/MDA tools like Intune to block export or copy paste of data to non corporate apps. Very sledgehammer than the DLP scalpel at the moment.
- At rest - easy to cover things like OneDrive and SharePoint, Google Drive etc, can cover NAS, S3 buckets, databases etc with some additional compute or agents for most solutions.
Structured data - DLP solutions can scan structured data such as in databases but they are not data lineage or data governance or knowledge management tools.
Reporting - don’t put DLP in so you can just “know where your data is and where it is going”. DLP solutions report on exceptions to your policy, there suck at giving you movement and storage data.

The problem is you are trying to stop every possible way data can get out in your organization. Remember resources are everywhere and so are your users and machines..

That is a losing battle.

So why do DLP at all?

Good question… when in doubt: compliance!! You may have regulations or contracts that require you to mitigate against data loss. That good old 18 year old auditor has it on his checklist…

That is fair question. Always do a threat model. Is this control a good fit for what you are actually worried about e.g..

Large amounts of sensitive data that is unstructured - ok this is basically every company that uses Excel and who doesn’t do that. I love how Workday allows you to export every report to Excel at the click of a button.. Build an amazing matrix security model and then export all your sensitive HR data!
Cover the basics - all those channels I listed above are the easy ways your staff in particular and 3rd parties with access will steal data. It is worth the investment to prevent and detect that.

What has SSE got to do with DLP?… Again

Basically it is a better way to do DLP and the SSE solutions are better than most of the DLP solutions on the market.

You want to put in SSE because to recap:

Your resources now run everywhere.
Your users and machines need to connect from anywhere on any device.
You want to support both of these for low cost, low friction and in a highly secure manner.

But think about what an SSE solution has, again to recap:

An agent on managed devices, a browser to access for unmanaged devices.
A SaaS to enable you to configure policy (the control plane for SDN fans).
If you do it right: the SSE is the way your users at least access all resources public or private and ideally machines also. Remember Zero Trust, your internal network is already compromised. You don’t need point solutions like Okta Access Gateway, Twin Gate VPN replacement, AWS Systems Manager and definitely no Jump hosts or bastions.
SSE already understands things at a transactional level e.g. allow uploads of documents to corporate Google Drive but not personal. Allow downloads from Workday to managed devices only.

Now think about your sensitive information. Turns out most SSE solutions can do all that basic regex and keyword stuff. The good ones can also read your existing labels e.g. such as Azure Information Protection labels if you already use that or allow/mandate users to label their own unstructured data.

So…..

Don’t do DLP standalone. Implement SSE and use that to provide you DLP features for both private and public resources.

Identity Revive

Discussion about this post