Choosing the Right Extract Model for ETL
The traditional terminology is ETL – Extract, Transform, Load. Whilst that is the terminology we use, we don’t subscribe to that approach, we prefer ELT, Extract, Load, Transform. That is primarily explained in our next post, “When to Transform Data”.
This post focuses on the “E” – Extract -- with some meaningful approaches to get the data from your source systems to a destination system typically used for reporting and analysis purposes.
I.T. is usually the internal organisation that applies governance over the operations and access of data. Often a request for access to data will be met with barriers that are difficult to counter:
What do you want the data for ? [This is the easy part to work through]
Where is the data going ? [This can be a political hot potato if it is going to a destination that is not an accepted part of the IT Architecture. For example, if it is going to an Azure Data Warehouse but you are an AWS shop]
Who is using it ? [This can be the trickiest part if some of your data is sensitive and will be shared with a wide range of roles as opposed to a smaller finite group of business analysts. For example, if you are extracting finance, procurement and HR data but that data is going to be made available to finance and procurement people, then the challenge will be securing the HR data as well as procurement data being secured from finance for confidentiality purposes]
How will you be accessing the data. [This is the main topic of this post]
How to work with IT on these points ?
Engage business stakeholders to understand what is really required, what really matters, and what does not. This concerns what data and at what frequency. If something is not important, be prepared to drop it or move it to a later phase. The end result will be a focused set of requirements that have meaningful (ideally measurable) positive impact on the business.
Engage with IT and build up the approach and justifications before formally submitting it as a proposed project. They are more likely to work with you if you have shown you will work with them.
Key Considerations for Extracting Data:
Separate the extract of sensitive data from the remaining data (it might not). Using the above example, rather than extract all finance, procurement & HR data together, you may consider doing four extracts; procurement, finance, non-sensitive HR and sensitive HR. This will allow separation, for example, different teams of people responsible for reviewing the data extracted for monitoring purposes (and hence having access to the data itself before it arrives at its destination)
Ensuring the access to the data does not introduce new security vulnerabilities to the source systems – after all, a system is only as strong as its weakest link
The frequency of updates which then leads to estimating the size of the extracts and associated timings. It will be far more expensive and difficult solution to maintain if the complete data sets are extracted every night as opposed to incremental updates. Understanding your data will lead to minimising the data sets that require a complete extract each time.
Extracting From Highly Secure Environments
Requests for access to data are often denied or delayed due to the perception that the source system will be compromised by allowing extract of its precious data. The security and governance of the source system and its stakeholders needs to be respected. Having security a key requirement for every step of the process will maximise your ability to find a workable solution that is agreeable to all stakeholders.
Firstly, you need to ensure that the extracted data will be just as tightly secured as the source system if the data being extracted is the reason why the source system is so tightly secured.
Consider a ‘pull and push’ approach. This involves an extract program that is local to the source system (and therefore within its security perimeter) that ‘pulls’ the data out of the source system so that it resides in a file that is within the security perimeter of the source system. The files are then ‘pushed’ out of the security zone to another secured safe zone – the key feature is that there is no new ‘hole’ created that an illegal process could use to go and pull the data from. The push is through a one-way gate only, thus leaving the source system as secure as it was before.
Once in the new safe zone, the data should then be immediately processed and erased so that it cannot be recovered. This concept is illustrated in the diagram below:
Seems too simple ? You want simple, because simple works, simple is maintainable, and simple results in low cost of ownership.
This model is useful when the owners of the source system do not want an external program to access the data direct. As the extract programs are local to the source system, they are difficult to compromise and can be controlled by the owners of the source system. For example, if they don’t want extracts at a particular time, they can manage this themselves, rather than requesting a different team to not perform a direct access.
All pulling of data should be performed through read-only accounts that only have access to the database objects and rows required for the destination system, nothing else. For example, if the source system contains data on the local legal entity as well as the consolidated international legal entity which is not the target of the destination, then the read-only account should (if possible) restrict the access to only the local legal entity. This means that the source system owners can sleep peacefully at night knowing that the destination system cannot inadvertently over-reach. Instead, requests for additional data will need to be made through a formal process that has governance to ensure the correct data is being delivered to authorised users/roles/systems.
Trusted External Program Access
This is a program that is external to the source system (i.e resides outside of its security perimeter) that performs both the extract of the data and posts it to the destination system.
The same considerations apply to this model as to the previous. The process is more direct as all steps are performed by a single dedicated program. This concept is illustrated in the diagram below:
There are some additional complexities that need to be considered as part of this model:
It is paramount to understand if the tools used in this model store the data in intermediate locations. Ideally, there should be no storing of data, this will introduce another attack surface and other points of vulnerability.
There should also be no logging of data. Modern tools are now capable of significant logging, an unintended consequence could be your payroll data residing in Splunk as a result of an overzealous monitoring design.
Failure behaviour needs to be understood. Failure in getting data from the source system won’t be serious because by definition it will not have any data to compromise. However, what if the data is obtained and the posting to the destination fails ? What happens then ? It is important that the data is dropped with no trace, and whilst meta-data of ongoing success and failure must be kept so that the same data can be re-extracted at a later time, it is important that remnants of the failure are not left behind that violate the privacy of the extracted data.
Whilst there are less steps, this is not necessarily more secure. This model has enabled the ability for an external program with the correct credentials to extract data. This does not mean that the source system is automatically vulnerable, but it does mean that special consideration and design will need to be made to ensure that this does not result in increased vulnerability.
Counter this by having dedicated accounts for access at the server and database, and monitor technical fingerprints to ensure it is the correct external program that is accessing these accounts. A common scenario is that administrators who create these accounts then use them for their own access to perform particular tasks with management completely unaware that the technical team has full access to the sensitive data that great effort has gone to secure.
When to use this model ? The main requirements are high frequency of updates across a wide range of data.
This model is also often the model used for performing a Transform of data as part of an ETL process, Extract Transform Load. This is part of our next post, “when to transform data".
There are other hybrid approaches from the two conceptual models shown, however, these are the main approaches adopted by organisations of all sizes. The ‘push and pull’ model can be bespoke using technologies like bash scripts, Powershell scripts, SQL, Python, Secure FTP and SSH, whereas the ‘external programs’ are often the domain of third-party tools that have a specific way of extracting and loading data – and may present both advantages and disadvantages depending upon your requirements.
There is no wrong or right model to follow. However, security of data must be a key consideration of every step of the ETL process regardless of what model you adopt. We hope that this introduction on extracting data gives you food for thought about what your requirements are and how you could address them in as simple fashion as necessary to maximise the chances of success.
If you want to discuss your needs further and pick our brains on how you might approach extracting your data, please reach out to us at our contact page or email@example.com