How microsoft is attuning business intelligence to the GDPR with SQL server
With the SQL Server technology, Microsoft is offering well thought-out solutions for the implementation of the new data regulations of the European Union.
// Business Intelligence, IT Processes and Organisation
The EU has developed a new privacy / data protection law, the “EU General Data Protection Regulation”. It has been in full force since May 2018 and there is still uncertainty among a lot of companies and their data protection officers as to what is actually required by the new regulations and what kind of impact they will specifically have on their data processing processes. Especially in the big data and business intelligence environment, fundamental conflicts of interest arise and widely spread paradigms regarding data retention and analysis will potentially have to be put into question.
The following article provides an overview regarding the fundamental requirements of the GDPR and the obligations arising for companies and for the data protection officers, and outlines a possible approach to comply with these requirements. In addition, the impacts of the regulation are discussed in the business intelligence context and it is shown, based on the example of the Microsoft SQL Server platform, how the requirements of the new EU Directive can be covered by features of modern database management systems.
CONCEPTS AND REQUIREMENTS OF THE GDPR
The EU General Data Protection Regulation (GDPR), which originally came into force in May 2016, has as such become legally fully enforceable since May 2018. Several years passed before the EU Commission, the EU Parliament and the EU Council of Ministers were able to agree in late 2015 to get a comprehensive reform for the strengthening and standardisation of the data protection for all Member States of the EU on its way. A lot of companies already utilised the 2-year transition period intensively and invested significant expenditures to become familiar with the new legal situation and to adjust their data processing processes from an organisational as well as from a technical perspective to the new legal framework conditions. Even though these legal framework conditions quite often were not completely new in comparison to the German Federal Data Protection Act, a completely new motivation for the implementation results due to the increase of the announced fines which can amount to up to 20 million euros or 4 % of annual revenues worldwide.
A series of requirements and obligations result from the EU General Data Protection Regulation for companies that process personal data. Here, data is considered personal if it can be attributed unambiguously to a natural person (e.g. name, address, phone number). But this also encompasses data that can be attributed indirectly and which are therefore often referred to as person-relatable data (e. g. IP address, vehicle license plate). Some categories of personal data (e.g. data regarding racial/ethnic origin, political opinions, religious convictions, union memberships, health, sexual orientation, etc.) are emphasised as requiring particular protection and their processing is, on principle, prohibited, except in the exceptional cases described in Article 9 of the GDPR.
IMPLEMENTATION OF THE GDPR
A first step towards compliance with the regulation is to identify the personal data and the corresponding data processing processes and systems within the company. In light of the new statutory framework conditions, the identified processes may need to be adapted or redefined. Subsequently, the further steps deal with the taking of measures for the correct administration and protection of the personal data. Finally, internal data protection regulations must also be established to be able to perform adequate reporting regarding the compliance, or also non-compliance, with the regulation.
Derived from the regulation, the essential obligations that have to be taken into consideration in the context of the approach described above will be explained in more detail below.
The “Principles relating to processing of personal data” are defined in Article 5 GDPR as follows:
1. Personal data shall be:
a. processed […] in a transparent manner in relation to the data subject (“lawfulness, fairness and transparency”);
b. collected for specified, explicit and legitimate purposes […] (“purpose limitation”);
c. […] limited to what is necessary in relation to the purposes for which they are processed (“data minimisation”);
d. accurate […] and […] up to date; […] personal data that are incorrect, having regarding to the purposes for which they are processed, are erased or rectified without delay (“accuracy”);
e. kept in a form which permits identification of the data subjects for not longer than is necessary for the purposes for which the personal data are processed; […] (“storage limitation”);
f. processed in a manner that ensures appropriate security of the personal data, including protection against unauthorised or unlawful processing and against accidental loss, destruction or damage, using appropriate technical or organisational measures (“integrity and confidentiality”);
2. The controller shall be responsible for, and be able to demonstrate compliance with, paragraph 1 (“accountability”).
DOCUMENTATION OBLIGATIONS OF THE GDPR
In accordance with the principles relating to the processing of personal data in Article 5, companies have to critically analyse and document their data processing processes (lawfulness, fairness and transparency) and in this question for all personal data to be processed whether a clear purpose relation can be proven (purpose limitation) and the accuracy of the data for the purpose applies (accuracy), whether the data might already fulfil the purpose in anonymised form and as such no longer have to be declared as personal data, whether the data are needed at that scope (data minimisation) and whether the duration of storage can be justified accordingly (storage limitation). Additionally, measures must be undertaken to be able to prove, at any time, compliance with the aforementioned principles (accountability) – a challenging task for all data protection officers.
OBLIGATIONS TO INFORM AND DISCLOSE
In addition to the detailed documentation of these considerations, affected groups of persons must, furthermore, be informed about the type and manner of the storage and processing of their data and accordingly “any information […] relating to the processing” must be provided “in a concise, transparent, intelligible and easily accessible form, using clear and plain language” (Article 12). This information must also be provided upon request in accordance with the right of access by the data subject (Article 15).
OBLIGATIONS TO PROTECT THE RIGHTS OF DATA SUBJECTS
Since May 2018, companies in the EU must maintain processes that are suitable as reaction to the assertion of the rights of data subjects defined in Chapter 3 (Art. 12 - 23). This refers to the right to rectification, the right to erasure, the right to restriction of processing, the right to data portability and the right to object, which can lead to adjustments to the data processing processes defined and implemented within the company. The erasure of personal data in particular can pose a challenge to data protection officers that should not be underestimated since the data must not only be erased primarily from the databases of the processing systems, but also secondarily from any existing backups, log files, and existing reports.
DATA PROTECTION OBLIGATIONS
Once the personal data have been identified and their processing in compliance with the aforementioned principles has been defined and documented (regarding this also see Art. 30 Records of processing activities), the data protection officer has to ensure that these principles in accordance with Article 25 Data protection by design and by default are implemented through “appropriate technical and organisational measures” and, in particular, in accordance with the principle of integrity and confidentiality, the “confidentiality, integrity, availability and resilience of the processing systems and services”. Which measures can be referred to as “suitable” is not specified clearly and is, in accordance with Article 25, dependent on “the state of the art, the cost of implementation and the nature, scope, context and purposes of processing as well as the risks of varying likelihood and severity […] posed by the processing”. Regarding this, Article 32 Security of processing only gets marginally more specific and requires that “account shall be taken in particular of the risks that are presented by processing, in particular from accidental or unlawful destruction, loss, alteration, unauthorised disclosure of, or access to personal data transmitted, stored or otherwise processed”. Explicitly mentioned as suitable measures are pseudonymisation and encryption. In the context of the data protection impact assessment stipulated in Article 35, the assessments of the data protection principles and the respective risk in the processing of personal data, the selection of the protection level associated therewith, and the planned remedial measures must be considered and documented.
ACCOUNTABILITY AND NOTIFICATION OBLIGATIONS
If violations with respect to the protection of the personal data occur, and if an increased risk to the rights and freedoms of natural persons results therefrom, the accountability anchored in the principles as well as the notification obligation regulated in Article 33 are triggered, according to which companies “shall without undue delay and, where feasible, not later than 72 hours after having become aware of it, notify […] the breach to the supervisory authority”. “Any personal data breaches, comprising the facts relating to the personal data breach, its effects and the remedial action taken” must be documented and made available. This also includes the description of “the nature of the personal data breach including where possible, the categories and approximate number of data subjects concerned and the categories and approximate number of personal data records concerned”.
IMPACT OF THE GDPR ON BUSINESS INTELLIGENCE
In the business intelligence context, significant potential for conflicts may arise upon review of the General Data Protection Regulation. Especially since it is here in particular that it is being strived for, through the utilisation of data warehouse solutions, to create an integrated and central data storage that attempts to hold all relevant data long-term, in part reaching quite far back into the past, and which can be analysed in a multitude of ways. Above all, the data protection principles regarding purpose limitation, data minimisation and storage limitation can turn out to be problematic in this. In a lot of data warehouse and big data environments, the principle of “rather too much than too little” applies and data are rather held based on the principle of data maximisation than that of data minimisation and are, in part, held without clear purpose relation, with the anticipation that these data may be of future relevance and that as such a future purpose relation might be established.
Once they have been loaded into the data warehouse, data are deleted from it only in the rarest of cases. Rather, mechanisms are utilised to offload and archive data that has lost in relevance in order to be able to fall back on it in case of doubt. Master data, such as customer, supplier or employee data, often exist in historicised form to be able to cover as many analysis scenarios as possible.
There is no need to question these paradigms on principle, but particular caution is advised with respect to personal data.
It must be critically questioned whether personal data has to be kept at the respective level of detail in order to ultimately fulfil the analysis purposes. Most of the time, these data are viewed in aggregate form along different dimensions or perspectives, and a detailed analysis based on individual persons is not compulsorily required. As such, in a lot of situations the personal relation can be removed through application of data reduction and anonymisation so that the data no longer has to be taken into consideration with respect to the GDPR. For this, for example, the transferring of personal data such as names, ID numbers, e-mail addresses, date of birth, addresses, etc. into the data warehouse can be dispensed with, and it can be restricted to importing characteristic attributes such as age or year of birth, place of residence, customer group, occupational group, etc. that are utilised for categorisation and aggregation.
If the purpose relation for the processing of personal data exists and anonymisation is not an option, the scope and retention period of the data in the data warehouse must, in accordance with the principles of data minimisation and storage limitation, be restricted to a minimum and measures must be undertaken to erase the data again after the defined retention period. Similarly, it can be advisable for the data protection officers to take precautions in advance regarding the protection of the rights of data subjects and to, for example, define an erasure concept and, where applicable, implement corresponding procedures.
Last, but not least, the different protection levels for the personal data and the protective measures associate therewith, such as access rights, pseudonymisation and encryption have to be evaluated. In the following, the options offered by leading database systems to map the required protective measures on the database side will be introduced based on the example of Microsoft SQL Server.
IMPLEMENTATION OF THE GDPR WITH MICROSOFT SQL SERVER
In recent years, Microsoft has continuously expanded the range of security functions provided with SQL Server. As such, with the publication of SQL Server 2016, in particular, a whole series of innovative functions such as Always Encrypted, Dynamic Data Masking, and Row Level Security were added to the already existing functions such as Authentication, Authorisation, Transparent Data Encryption, Secured Database Connections, SQL Server Audit and Policies.
Before elaborating on the individual features of Microsoft SQL Server, we would like to point out here that a secure database solution is first and foremost considerably dependent on the underlying configurations of the operating system and of Microsoft SQL Server. Therefore, the hardening of the operating system and of the database management system should constitute the first step towards creating a secure infrastructure. For this, best practices (e.g. the SQL_Server_2012_Security_Best_Practice_Whitepaper_Apr2012.docx published by Microsoft) and standards can be relied on to, for example, restrict the rights for service accounts, turn off features and/or services that are not needed, install patches, configure ports, and define policies (e.g. password policies).
AUTHENTICATION & AUTHORISATION
With respect to the security of the processing systems, one of the most fundamental requirements of the EU GDPR is access control. In Microsoft SQL Server, access control is effected via the SQL Server authentication which supports two modes. On the one hand, this is the Windows authentication so that the authentication mechanisms of Windows Server can be utilised, and Windows users and Windows groups can be relied upon. For this, the use of an Active Directory provides for significant advantages in the sense of a centralised administration of users, groups, and policies. On the other hand, in Mixed Mode dedicated SQL logins that allow for an authentication via a user name and a password can be managed within SQL Server in addition to the Windows authentication.
The authorisation is performed via the authorisations on an object level assigned to the logins and users within SQL Server. For simplification of the administration of access rights, a role concept can be fallen back on to encapsulate authorisations with server-wide and database-specific roles. The roles predefined in SQL Server as well as the user-defined roles and assigned rights on the object level should be assigned to users such that they have merely the minimum rights available that are necessary for them to carry out their tasks (principle of least privilege).
DYNAMIC DATA MASKING
Dynamic Data Masking is a feature that has been available for all editions ever since the release of SQL Server 2016. It allows for a further restriction of read access to table columns with sensitive data. For privileged users with UNMASK authorisation, no difference is identifiable and they continue to receive full access to the original data. For non-privileged users with a conventional SELECT authorisation, the database provides the queried data only in masked form (e.g. “firstname.lastname@example.org” instead of “email@example.com”) and the user does not gain insight into the sensitive data. The patterns utilised for the masking must be defined in advance via a masking function.
Dynamic Data Masking supports the pseudonymisation of sensitive data explicitly mentioned in the General Data Protection Regulation. The feature is frequently being used in the data warehouse context when production data is to be used for development and test purposes, but the sensitive data should not be provided to the developers and testers and merely be displayed in pseudonymised form.
The usability of Dynamic Data Masking for the restriction of access by end users of a data warehouse is highly dependent on the architecture selected. If queries by end users are posed to the database directly or indirectly under the respective user context (e. g. via Microsoft Power BI and Microsoft SQL Server Analysis Services in DirectQuery mode), the user context can be analysed accordingly by Microsoft SQL Server and the feature can be utilised. If, instead, the data are loaded, processed and held unmasked by an application placed in between with the help of a dedicated privileged account, and if the user queries are responded to directly by the application, without falling back on the data base (e. g. Microsoft SQL Server Analysis Services in In-Memory mode), this security feature of Microsoft SQL Servers turns out to be without a purpose and the corresponding security concepts of the respective application have to be relied upon.
ROW LEVEL SECURITY
Another feature that can be used with all editions, ever since SQL Server 2016, to further restrict the access to sensitive data is Row Level Security. In this, authorisations can be defined on a row level so that users only receive access to data records in a table that are relevant to them. To implement this, a function (CREATE FUNCTION) and a security policy (CREATE SECURITY POLICY) must be created with tSQL. The function contains the filter logic and defines based on which table column the table should be filtered in which type and manner in accordance with the user context of the current database connection. Finally, the security policy ties the filter function to the table and as such restricts access row level-based.
Row Level Security is a valuable feature for being able to define and control access rights on the database side down to the row level. In the data warehouse context, the feature can be used sensibly as long as the users’ accesses via an application (e. g. Microsoft Power BI and Microsoft SQL Server Analysis Services in DirectQuery mode) are actually targeted towards the Microsoft SQL Server database and the current user context is passed on and can be analysed in the processing of the query. However, if the user queries are not processed by the database itself, but rather by an application placed in between (e. g. Microsoft SQL Server Analysis Services in In-Memory mode), which potentially processes the data multi-dimensionally and retains it in cache, the access control concepts of the respective application also have to be taken into consideration accordingly, and the Row Level Security of the SQL Server can no longer be trusted.
TRANSPORT LAYER SECURITY
In addition to access control, the encryption of data is also an explicitly mentioned measure that can be utilised for the protection of sensitive data. With respect to the securing of the communication paths, the encryption of database connections with Transport Layer Security is a common security feature which has or should have established itself as the standard at companies for quite some time already. Microsoft SQL Server supports secure connection over TLS 1.2 and as such helps to reduce the risk of unauthorised access (e. g. Man-in-the-Middle attacks) to sensitive data during transport via incoming or outgoing database connections. The utilisation of secured connections can be ensured and forced via the SQL Server configuration (setup of a certificate and activation of the ForceEncryption flag).
TRANSPARENT DATA ENCRYPTION
While TLS only effects encrypted communication, with the data still held unencrypted in the database, the Transparent Data Encryption (TDE) feature of the Enterprise Edition provides the option to encrypt the data on the physical level. With TDE, the risk is reduced that unauthorised parties may gain access to database, backup or transaction log files and recreate a database therewith in another location and as a result gain access to sensitive data. For this, the data are encrypted when writing them to the storage media and are then decrypted again when loading the data into memory. When TDE is activated for a user database, this also applies equally to the tempdb system database. Ultimately, the data is present again in memory in unencrypted form. For the applications accessing the database, there is no difference to non-encrypted databases.
The configuration of TDE encompasses, in essence, the creating of a Database Master Key based on the Service Master Key, which is generated during installation of the SQL Server instance, the generating of a certificate with the help of the Database Master Key, the generating of a Database Encryption Key utilising the certificate, and activation of the encryption for the database to be protected. The utmost diligence is warranted in the management of the certificate and its private key in order to be able to restore and decrypt encrypted databases in case of potentially required maintenance work.
With Always Encrypted, another innovative security feature was released with SQL Server 2016 (as of SQL Server 2016 SP1 for all editions) that allows for an even stricter separation between functionally privileged and technically privileged users and can be applied on a column level. Thus, it is being prevented that users, such as database administrators, who receive comprehensive privileges on the database level for technical and less so for functional or content reasons, gain access to unencrypted sensitive data. In this, the encryption and decryption of sensitive data with the Always Encrypted feature activated is carried out already via the Always Encrypted-capable database drivers used by applications. The database driver provides for an automatic translation of the original database query and encrypts in the query all values (e.g. filters) of the columns to be protected. Similarly, the result of the database query is decrypted again by the database driver and provided to the application and/or the privileged end user. As such, the database itself and also the privileged database users are only responsible for the correct processing of the encrypted data and have at no point in time access to the unencrypted data.
For the configuration of Always Encrypted, a certificate is needed that is used for generating a Column Master Key. The certificate is then installed on those target systems from which applications access the database via the database drivers. This can be the end user’s PC if the application has been installed locally. Otherwise, this is the server on which the application is being run. In the database, the Column Master Key is used to generate the Column Encryption Keys which are ultimately utilised for setting up the encryption of individual columns. For the encryption, there is a choice between deterministic encryption (the same input values lead to the same encrypted values) and random encryption (the same input values lead to different encrypted values). Both variants result in restriction with respect to the ability to query the encrypted columns. While random encryption constitutes the more secure variant, it prevents, among other things, filtering, grouping, sorting, joining and indexing based on the encrypted columns. Even though deterministic encryption leads to restrictions as well, it still allows for the use of filters based on equivalence, joining, grouping, sorting and indexing.
Similar to the assessment regarding Dynamic Data Masking and Row Level Security, the sensibility of using Always Encrypted in a data warehouse scenario is highly dependent on the architecture selected. As such, its use makes sense if queries are directed to the database, directly or indirectly, via a compatible database driver and are processed by said database. But if the data are loaded, decrypted, and temporarily stored by a downstream application, the problem of the holding of unencrypted data has not been solved but rather only shifted.
In addition to confidentiality and integrity, another explicitly required protection objective of the GDPR is the availability of the personal data. A fundamental protection against the loss of data, and the ability to restore data in a timely fashion after a breakdown, is afforded by the use of a well-defined backup and restore strategy. Additionally, a higher protection in the sense of high availability solutions can be achieved with SQL Server via the AlwaysOn Availability Groups or AlwaysOn Failover Cluster Instances features.
AlwaysOn Availability Groups, usable with the Enterprise Edition, allow for the configuring of a failover environment at the database level. One or more primary databases can be added to an AlwaysOn Availability Group and the latter can be replicated up to 8 times as secondary databases. A failover covers all databases, each, of an AlwaysOn Availability Group. Depending on the synchronisation mode selected (synchronous v asynchronous commit mode) and on the failover strategy dependent thereon (Automatic, Planned Manual, Forced Manual), a risk of data loss can be excluded or may still remain to a small extent. In this, the choice is, in essence, based on the performance differences in the synchronisation of primary and secondary databases. Other advantages of the feature are that the secondary databases can be used for Read-Only access or for creating backups, and thus provide for an unburdening of the primary databases.
AlwaysOn Failover Cluster Instances are based on the Windows Server Failover Clustering (WSFC) technology and provide failover functionality at the instance level for the Standard and Enterprise editions. While the Failover Cluster Instance is installed on multiple WSFC nodes, it is visible externally only as a single instance. If a breakdown of the active node occurs, an automatic failover to another WSFC node takes place.
SQL SERVER AUDIT
SQL Server Audit is a powerful feature that allows for the monitoring of events in SQL Server. For a SQL Server Audit, server-wide (for all editions) or database-wide (for all editions as of SQL Server 2016 SP1, prior to that only Enterprise Edition) specifications can be created. The specifications describe which actions should be monitored for which database objects. The creating of a specification can be carried out at differing degree of detail based on atomic actions and database objects or also based on predefined groups of actions. The information collected in case the monitoring has been activated can be written to Windows Application Event Log, the Windows Security Event Log or, instead, to a dedicated audit file, and subsequently analysed. As such, SQL Server Audit can take on an important role for the implementation of the EU GDPR, to, for example, monitor access to personal data, and can, through the logging of the events determined to be relevant during configuration, significantly contribute to compliance with the accountability obligation.
The requirements defined in the European Union regarding the protection of person data result in a multitude of obligations for data processing companies. Especially with respect to big data solutions and data warehouse solutions, these challenges may seem to be insurmountable at first glance. While the underlying considerations are by no means new, now the announced drastically increased fines provide a completely new motivation for companies to actually also implement them. For data warehouse solutions it is recommended to first identify personal data within the system and to check, in accordance with the Principles relating to processing of personal data, whether these data ultimately have to actually be processed in the intended way and manner. For this, it must, above all, be possible to present a clear purpose relation. Data reductions and anonymisation can help to remove the person relation from the data and thus escape from the area of application of the GDPR. If the processing of personal data is unavoidable, measures must be defined and implemented to be able to ensure and prove the protection of this data. Modern database management systems, such as Microsoft SQL Server, provide a multitude of functions to control access to data, encrypt data and communication paths, and guarantee high availability. The GDPR does not pose any specific requirements as to which protective measure are mandatorily required in which context and only stipulates that appropriate measures be undertaken corresponding to the protection level of the personal data. In this, the choice and interaction of the “appropriate” security functions must be decided on a case-by-case basis and often also depends on other factors such as data warehouse architecture, functional limitations, and performance.