Guidance

Choose tools and infrastructure to make better use of your data

How to choose data tools and infrastructure that are flexible, scalable, sustainable and secure.

This guidance will help you choose the software and standards for working with data. Read this together with the Data Ethics Framework and the Technology Code of Practice.

There are 4 main areas to consider:

  1. Choose analytical tools that keep pace with user needs.
  2. Use the cloud for the whole development cycle.
  3. Use appropriate security when using data in the cloud.
  4. Choose open data standards for better interoperability.

1. Choose analytical tools that keep pace with user needs

As data analysis and data science evolves, you should choose tools and techniques that can adapt and support best practices. The provides information on best practices including:

  • keeping up to date with innovations that can improve statistics and data
  • improving data presentation for their users
  • testing and releasing new official statistics

바카라 사이트 workers responsible for providing or procuring software for data analysis should choose a . These systems are flexible enough to use with a variety of tools and connect to a range of data sources and architecture.

You should , to iterate and add value with each change. If you make up-front long-term decisions on one type of software you risk being unable to meet evolving user needs.

Choosing open source languages

Data scientists and analysts often use common open source languages such as Python and R. The benefits of using these languages include:

  • good support and open training - which means reduced training costs
  • new data analytics methods
  • the ability to create your own methods using open source languages

The R and Python communities develop large collections of packages for data analysis. The and the , including many , provide extensive analytical functionality benefits.

Choosing tools that work with open technology

Choosing tools which work with open technology supports robust and appropriate data analysis, as set out in Principle 5 of the Data Ethics Framework.

Tools which work with open technology, such as Docker or Apache Spark, give your team the flexibility to meet your users바카라 사이트™ needs. Open tools are usually designed to work together and across vendors. Benefits include the ability to:

  • script a data pipeline using the best software for each task
  • run your code anywhere - using commodity container platforms, platform as a service or a Hadoop cluster

Other benefits include better:

  • collaboration
  • support software for engineering practices
  • capabilities in big data and machine learning

You can achieve better quality assurance in your software development with continuous integration and unit tests. The has .

If you use spreadsheets and business intelligence software you should be aware they:

  • do not often scale well to large datasets or intensive computation
  • do not often integrate well into automated pipelines, or support best practices in quality assurance
  • often need paid-for licences, making them expensive to deploy in the cloud

Case study - using data science with the Ministry of Justice analytical platform

The supports the latest data analysis tools. This platform allows easy integration of new open source software and leading cloud services into a platform for 300 staff in the data analysis professions.

The platform is a flexible and secure environment, where:

  • analysts use a web browser to sign-in once and then develop code in tools such as R and Python
  • you can access data and create live charts and dashboards that are accessible to end users in a web browser with no special software or licences
  • it runs software using standardised containers on an which allows the platform to run any of the latest open source tools, data stores and custom-built data analysis components
  • you can add innovative services such as a new graph database or machine learning framework
  • you can process datasets of almost unlimited size at low cost

The platform has helped the Ministry of Justice produce several national statistics more reliably and efficiently by using .

2. Use the cloud for the whole development cycle

In most circumstances, you should store data in the cloud and code in cloud-based version control repositories. You should run live data products in the cloud, as set out in the government바카라 사이트™s Cloud First policy, and you should use the cloud throughout the whole development cycle.

Keeping your data in the cloud

You can use cloud services for data analysis work. It바카라 사이트™s usually more efficient to use software-as-a-service and only pay for what you use, rather than setting up and running your own cluster for data. With cloud services it바카라 사이트™s important to be alert to supplier 바카라 사이트˜lock-in바카라 사이트™ and always consider the cost of switching to another supplier.

The benefits of storing your data in the cloud are that:

  • it scales well to large quantities of data that would not comfortably fit on a user바카라 사이트™s machine
  • you can take advantage of cloud-scale databases to process complicated queries in a reasonable time-frame
  • you can use it for all stages from exploration through to production systems
  • it바카라 사이트™s simpler to combine different datasets from your organisation
  • it바카라 사이트™s usually the cheapest option, due to commoditisation and pay-as-you-go pricing, but evaluate this against your own needs

Rather than sharing data files by email, the cloud enables data sharing by sending a link. This is a better practice because it helps you:

  • control and monitor access to the data
  • maintain connection to the original source data so you can avoid duplication and poor version control
  • get reports with live updates

When using data in the cloud make sure that your data is accessible through a stable URL or other endpoint, as this will help you to make reproducible analysis code.

Maintaining cloud-based version control

To maintain cloud-based version control and support collaboration, you should:

  • use a cloud-hosted repository, such as GitHub, to create pull requests
  • peer review code on a regular basis, to make sure you maintain the appropriate quality and keep all stakeholders up to date with any changes
  • share code outside your team and organisation
  • manage a list of issues
  • encourage reviews and invite comment

Reproducibility with the cloud

Cloud-based version control allows you to run automatic tests, which help you to make data analysis 바카라 사이트˜reproducible바카라 사이트™.

You should aim to make your data analysis reproducible so it바카라 사이트™s easy for someone else to work with it. For example, sharing your code and data so that someone else can run your data model on another computer at a different time with the same results. This is important because someone can:

  • check how your data analysis works
  • test your data analysis with different queries
  • run the analysis on a different dataset, or build on the analysis

Data analysts can make their data analysis reproducible by:

  • writing code that runs their analysis, rather than doing analysis through a series of manual steps, such as manual clicks in a graphical user interface
  • using the cloud for storing their data
  • setting up continuous integration and automated testing on all users바카라 사이트™ platforms
  • specifying the library dependencies as well as their version numbers

It바카라 사이트™s standard practice to specify dependencies and automate testing throughout software development. Unless you바카라 사이트™re doing quick, throw-away experiments you should aim to make all your code reproducible.

Using a cloud development environment

Teams using a data development environment running on a shared cloud platform (such as and ) benefit from a more streamlined process to onboard new users. This is because the team:

  • does not need to install something on their machine that would demand maintenance and updates (a benefit of software as a service)
  • can install code libraries across all users바카라 사이트™ environments, which makes the code easier to share and reproduce
  • is not tied to a particular corporate network, enabling users and collaboration from outside the organisation

Using a cloud environment for data development also means that you:

  • often have easy access to other cloud data services and cloud-hosted data due to the platform바카라 사이트™s built-in credentials
  • can decide which software to install
  • can decide who is best placed to install software, such as using a platform team who understand analysts바카라 사이트™ needs
  • are less likely to see users keeping local copies of the data on their laptop for development, especially if the data is also in the cloud

Teams using a development environment on local machines might risk:

  • not having administrator access for security reasons, which will prevent installation of development software
  • having longer installation times in Python or R installations and libraries
  • having less access to cloud-hosted data which might cause users to create workarounds, such as using email to circulate data

Sometimes, your data analyst may prefer to use their own custom environment. Where practical, you should aim to be flexible and try to replicate the essential elements of the cloud environment on their local machine.

The cloud environment offers a baseline of libraries, but as soon as you need more libraries you should specify all the dependencies and their version numbers, to make sure work is still reproducible.

3. Use appropriate security when using data in the cloud

The government바카라 사이트™s approach to security in the cloud is set out in the from the National Cyber Security Centre (NCSC). Also, in the , NCSC states that the commercial public cloud is an acceptable place to put OFFICIAL data.

NCSC considers the cloud to have acceptable security because:

  • there is less information on end user devices
  • the supplier applies regular upgrades and security patches
  • the supplier often has rigorous methods to audit data, and control access and monitoring

Whether you바카라 사이트™re procuring SaaS or developing your own solution for a platform of tools and services, you should put in place mitigations such as:

  • data encryption
  • single sign-on
  • two-factor authentication (2FA)
  • fine-grained access control
  • usage monitoring and alerts
  • timely patching

Other security challenges for data analysts include developing code on a platform with:

  • real data
  • internet access

When platforms have internet access and hold real data, threat actors or attackers may try to steal or alter the data. Also, there is a greater risk of an accidental real data leak.

You should integrate security controls and monitoring with the data and network flows. This should be proportionate to the risks faced in experimental, collaborative and production environments.

Balance security choices with user needs

Security should protect data, but not stop users from accessing the data they need for their work. The Service Manual has guidance on securing information for government services.

You should build security into a system so it바카라 사이트™s as invisible to the user as possible. Adding complicated login procedures, and restricting access to the tools users need, does not make your security better. Restrictive security makes more likely, with users avoiding security measures and finding workarounds.

Case study - using Ministry of Justice data in the public cloud

There is a government policy supporting the use of the cloud for personal and sensitive data. Most UK departments have assessed the risks, put in appropriate safeguards and moved sensitive data into the public cloud.

An example of this is from the Ministry of Justice who moved their prisoner data into the public cloud. This data has an OFFICIAL classification and often the 바카라 사이트˜SENSITIVE바카라 사이트™ handling caveat. It includes information such as health records and the security arrangements for prisoners.

The project team makes sure the appropriate security is used, such as:

  • careful isolation between elements using cloud sub-accounts, Virtual Private Clouds (VPCs) and firewall rules
  • finely grained user and role permissions
  • users logging in with two-factor authentication (2FA)
  • being able to quickly revoke or rotate secrets, encryption keys and certificates
  • frequent and reliable updates using peer-review and continuous deployment
  • extensive audit trails

Hosting the data in the cloud has enabled the Ministry of Justice to perform additional analysis using modern open source tools and scalable computing resources through its .

It바카라 사이트™s possible to achieve this level of security and functionality with a private data centre, but it would be a huge investment in hardware, software and expert staff to design and maintain it. You can reduce these issues by using the public cloud and taking advantage of the continuous investment and developments made by the suppliers.

4. Choose open data standards for better interoperability

An open data standard specifies a way of formatting and storing data. This can make data compatible with a wide range of tools in a predictable fashion, and prevents lock-in to proprietary tools. Open standards allow organisations to:

  • share information even when they do not have access to the same tools
  • replace their tools and still have access to their data
  • make a strategic decision to provide an agile environment that changes with the needs and capabilities of the users

The Open Standards Board selects open standards for use by government.

Examples of open standards include the:

Updates to this page

Published 1 July 2019

Sign up for emails or print this page