fbpx
Contact us
Back to the list of entries

Data Classification as Data Loss Prevention Necessity

Businesses have more data than ever before. Unfortunately, the luxury of obtaining that much comes with significant management issues. According to a global study by Seagate and IDC, only 32% of data available to enterprises is put to work. The remaining 68% is somehow stored, making sensitive information invisible in data masses and vulnerable. No wonder two-thirds of survey respondents report insufficient data security.

Data Classification Use Scenarios

Data Classification is directly related to risk management, compliance, and protection against internal threats. Organizing structured and unstructured data into appropriate categories ensures efficient use and data protection across the company network. With no data classification in place, any data protection program will fail. Here’s what Data Classification can help you with:

  • Mitigate risks. Gain control over location and access to sensitive information, thus reducing possible threats. 

Zecurion DLP doesn’t use document tags as this technology is unreliable: labels may not reflect the actual content.

  • Optimize business processes. Manage efficient access to work-required data by approved users, discover and eliminate redundant data, optimize business activities.

Zecurion DLP checks the content of documents in motion each time a policy is applied. It’s more complicated but more valid. Our next-generation technologies ensure these checks work instantly without slowing down the system.

  • Comply with regulations. After identifying data governed by applicable regulations, you can put additional tracking or control measures. You can also enable quarantine, archiving, or other regulation-required actions.

Zecurion DLP has its own pre-installed data classification tools. You will not require third-party solutions to ensure the data are under control.  

Data Sensitivity Levels

The United States government has seven levels, classified from the most minor “Controlled Unclassified Information (CUI)” to the most sensitive “Restricted Data/Formerly Restricted Data.” This number of levels can be overabundant for most organizations, whereas a system with three-four levels will exclude complexity and be easier to maintain. We suggest keeping it simple.

  • High sensitivity data: information that, if made public, could cause significant harm to an individual or the organization, can potentially lead to financial and reputational losses or pose a risk to company operations.
  • Medium sensitivity data: intended for internal use only, but is not sensitive, such as construction plans, marketing strategies, etc.
  • Low sensitivity data: data created for public access. Examples include product datasheets, webpages, blogs, etc.

Data Classification Methods by Zecurion Next Generation DLP

Zecurion Next Generation DLP includes content-based classification and uses 10+ content detection technologies to inspect files. Here’s the overview of the key ones.

Dictionary-based analysis

Dictionary consists of text strings with wildcards containing words on a particular topic (financial documents, spam messages, job search-related materials, and others). This technique looks for exact matches of designated words. To successfully detect all grammatical forms of a word, use the dictionary search with morphology or stemming. You can also use word combinations to decrease the false-positive level. E.g., the word “console” can appear in several dictionaries simultaneously, while “game console” will address the Computer Games dictionary specifically. 

You can create a dictionary for any subject or category and populate it with words that should be flagged. There are 30+ predefined dictionaries included in the system by default.

Templates and regular expressions

Regular expressions describe a set of character strings and have broad capabilities for searching structured data. Credit card numbers, Social Security numbers, IBAN accounts, URLs, email addresses, and other similar information can be detected with this technique.

Digital fingerprints

By collecting several documents of a specific type or category and providing them as input, Zecurion DLP creates a digital fingerprint to detect actual documents by their parts. After completing the digital fingerprint, Zecurion DLP can identify any document from the collection, or any part, or combination of elements from the document collection. You can add new documents to the collection, and Zecurion DLP will automatically update the digital fingerprints.

Zecurion uses shingles algorithms and the Bayesian method to prevent data loss. A fingerprint created with the shingles method stores information about sequences of words found in reviewed documents. This protects information when a user copies parts of a document, changes a sequence of phrases, or inserts extra phrases from other texts. The algorithm is effective if the text hasn’t been significantly changed (text deletion and insertion) and gives a low search accuracy on small texts (less than 50 words). You can create shingles based on sentences instead of words. In this case, the system will track only the deletion or movement of entire phrases. The size of such fingerprints is significantly smaller, and the processing speed is much higher.

A fingerprint created using the Bayesian method contains the dictionary based on a file array of a specific category. Each dictionary word is assigned a category weight. The weight determines the probability (0 to 100%) that a text containing the specified word belongs to the category. Fingerprint with the Bayesian method can include texts with up to 5000 words.

Machine learning

Another technique similar to digital fingerprints is the use of machine learning. The initial setup is identical – providing a collection of files for Zecurion DLP to analyze. Where digital fingerprints detect exact content matches, machine learning can detect documents similar to the submitted group based on keywords and/or semantic indicators.

AI-based image templates

Image templates effectively detect signatures, stamps, letterhead, or documents with a defined structure like passports or driver’s licenses. This method is also similar to digital fingerprints, but rather than detecting specific text, it detects image patterns. Like digital fingerprints and machine learning, the initial setup requires a collection of files that Zecurion DLP can analyze to develop the recognition necessary to detect it later.

For instance, all financial reporting documents are stamped with the company seal. After creating the graphic fingerprint, any image containing the seal (of any color, rotation angle, with or without overlapping elements) will be recognized as confidential. 

OCR (Optical Character Recognition)

This technique is valuable for identifying sensitive or confidential data that has been somehow scanned or photographed in an attempt to bypass other detection methods. Zecurion DLP leverages third-party optical character recognition engines to extract text from scanned documents. Zecurion DLP integrates with the ABBYY FineReader and Google Tesseract to be able to extract and identify text from images.

The Support Vector Machine

The Support Vector Machine allows creating a classifier that recognizes texts on a particular topic. During the fingerprint creation, you create two sets of documents: documents on the selected topic and documents that contain similar language but are not related to the chosen topic. The more documents are there in each set, the better is the accuracy of the classifier.

Comparison of two sets determines defining characteristics of the controlled category and excludes attributes that are common for both types.

After initial settings, you should perform a series of test runs on the documents that belong and don’t belong to the selected topic. The classifier operation is satisfactory if the calculated match probabilities for the texts that belong to the topic are greater than 90%.

TITUS Data Classification support

Integrating with the third-party solution might be helpful for deeper interpretation when the majority of work is done by the pre-installed software.

For this purpose, Zecurion DLP supports the TITUS Data Classification with its context- and user-based inspection.

Subscribe to our blog updates

You will receive only really useful emails and will always be able to unsubscribe from this mailing if, suddenly, your interests change

Recommended resources