Data Gap Identification and Exploratory Analysis

Research question: Is there a correlation or causal relationship between school pupil withdrawals and school transport pickup points, or the distance travelled?


The school under research is an Independent Boarding and Day school of 820 pupils who are aged between 11-18, located in Gloucestershire. They operate in a very busy market that has not only five competing similar schools within the local area, but also several accomplished state schools and other colleges. In this increasingly competitive marketplace, we are observing an increasing trend in those leaving post GCSEs, and heightened competitiveness in the day pupil market.


The organisation collects and holds vast volumes of data, both financial and non-financial, in various sizes and formats; however, this is highly fragmented and has made the implementation of a robust data management framework extremely difficult - this is extremely common, and traditionally there has always been a lack of joined-up interaction between key business units. For this organisation, this includes finance, admissions, alumni, and academic staff – none of which are effectively coordinated.

They have both internal (Admissions, social media, financial) and external (Independent Schools Council, league tables) data sources at their disposal.  Financial reports and datasets from the Finance Office can help identify, for example, a pupil’s ‘lifetime value’, which in turn, can identify the profit generated from a successful retention drive. However, most CLV calculations do not factor in the cost of acquisition. It is clear from the previous section and figure 1, that whilst there are vast amounts of stored data, some of this is not linked and, in some cases, does not exist.

Mapping Between Business Functions and Data Sources

Figure 1 - Mapping Between Business Functions and Data Sources

At present, Admissions data (parental and pupil) that is captured from the school website and via other means is inputted manually into the Admissions software, which in turn is fed to the Finance department and the school MIS. For some time, it has been recognised that the school's data is somewhat 'siloed.' Siloing happens in companies when data is acquired by different teams and merely kept without regard for its potential utility to another team.

It is apparent from the data ecosystem that it is not fully understood why day pupils choose to withdraw from the Admissions process, or why they leave at the end of year 11 (post-GCSE). Table 1 outlines the key data sources identified:

Identified Data Gaps

Table 1 - Identified Data Gaps

A data gap analysis is similar to a functional behavioural analysis used to determine the underlying causes and triggering elements of a person's behaviour, enabling an organisation to understand the nature of its data usage. With the assistance of this research, we are able to ascertain the reason(s) for creating and utilising data that is either insufficient for the school’s demands, or where potential gaps might exist.


Some of the key tasks for anyone responsible for an organisation’s data strategy are ensuring data performance, quality, integrity, assurance, and provenance, plus data security (with GDPR in mind). Protection against data hazards such as privacy and bias lies in comprehensive data governance policies.

Developing a Big Data Infrastructure is not trivial. It requires a significant amount of science and technology, but the appropriate application also requires art. However, as is the case with many established crafts, there is a well-understood technique for giving the problem structure, allowing for acceptable consistency, reproducibility, and objectivity. The Cross Industry Standard Process for Data Mining provides an excellent codification of this process.

It is argued that CRISP-DM has not been updated or adjusted for the challenges of Big Data and modern data analytics. The 6 overarching aspects of CRISP-DM, however, are regarded as an excellent model for the analytics and data mining process.

CRISP-DM Methodology (Provost and Fawcett 2013, p.242)

Figure 2 - CRISP-DM Methodology (Provost and Fawcett 2013, p.242)

Having very specific objectives in mind enables an organisation to get the maximum value from data. That is why the data-driven decision-making process always begins with the same step: defining the critical business questions that drive business decisions. An organisation should first ask ‘what do you need to know to be able to achieve your strategic goals’? Whilst there will always be some overlap in the questions asked, for us, it is apparent that ours will be one of service optimisation. The following business decision and the related question in table 2 has been chosen based on the available data sources.

Business Decision Selected for Analysis

Table 2 - Business Decision Selected for Analysis

Data Preparation

The initial sets of data are ‘reasonably’ structured, in the sense that they reside in an organised fashion within their respective databases. Even though the data is reasonably structured but seemingly unrelated, we need to take an interim step, toward a contingency. To do this we will create pivot tables in Excel, which conveniently summarises even massive data sets into readable, digestible tables. This is not the only reason. The summarised data sets make the application and troubleshooting of future models far simpler. The initial export from the Admissions database, contains approximately 4500 records, across 23 columns.

Data Export

Table 3 - Initial Data Export

Filtered Data

Table 4 - Filtered Data

Pivoted Data Tables

Figure 3 - Pivoted Data Tables

Data Modelling

Having compared six different models, it was decided to conduct our inferential analysis using linear regression. This will be completed in Excel, using the Data Analysis tool-Pak, which is an add-on feature. The justification for this decision is as follows:

  • Since the filtration process, we have an arguably small dataset
  • Our dataset is quantitative in nature
  • Our dataset is measured at a continuous level (pickups/withdrawals)
  • Regression gives us an understanding of the strength of relationships
  • Regression offers some statistical significance
  • Regression can offer confidence levels for each coefficient under analysis

The  R2 number is always between 0 and 1, or 0 and 100%; 0 indicates that the model describes no variability around the mean, or essentially no relationship. The higher the number, the stronger the relationship. Taking a step back, if we were to run the data surrounding the average distance travelled and the withdrawal rate, we observe that R2 is a comparatively low number. 

Distance vs Withdrawals

Figure 4 - Distance vs Withdrawals

By extreme contrast, the visualisation of Pickups by Withdrawals paints a very different picture compared to Distance vs Withdrawals. We can see, clearly, the indicated trend. R2 in this case is 0.86, which can also be interpreted as 86%. Another statistic to consider is Significance F. This indicates the probability that the chosen model is incorrect; in this case, the number is 0.34 (extremely low).

Pickups vs Withdrawals

Figure 5 - Pickups vs Withdrawals

Descriptive Statistics

Table 5 - Descriptive Statistics


We have ascertained through this report that ultimately there is no real relationship between the distance of travel to school, and the number of withdrawals across any year group. Despite the vague nature of the reasons for withdrawal, which is a separate business issue that will need to be addressed, we have clearly demonstrated that there is an actionable connection and relationship between the number of pickup points on our transport routes and the rate of withdrawal.

It is important to recognise the Customer Lifetime Value of each individual pupil, as this will have an influence on Senior Management decisions about strategic and operational spending. This can be calculated using the following formula:

  • m = constant net margin (profits – retention costs)
  • k = discount rate
  • d = constant defection rate

We are able to calculate the pupil/parent lifetime value (CLV) of an individual as the sum of all future term payments less the total cost of facilitation. The initial computation does not take into account the cost of recruiting a pupil, which we would include to improve the model's accuracy. Additionally, the fundamental CLV computation is limited in a number of ways. It makes a lot of assumptions about the customer's lifespan. Margins and retention rates differ.

Accordingly, a decision was taken to improve and enhance the school transport fleet in order to accommodate more pickup points. We have demonstrated that whilst this will have an influence on lowering the withdrawal rate of day pupils, it is wholly justifiable when considering the pupil/parent lifetime value.


In conclusion, pupil withdrawal is a complex issue that impacts both students and teachers, and understanding the factors contributing to this phenomenon is crucial for improving student outcomes. The exploratory analysis of pupil withdrawal conducted in this study provides valuable insights into the potential 'operational' causes of pupil withdrawal, but further research is needed to validate these findings and develop effective interventions to address this issue. Overall, continued research in this area could lead to improved retention, increased student engagement, and better educational outcomes.

Recommendations for Further Study

Recommendations for further study in the area of pupil withdrawal exploratory analysis could include expanding the study's sample size to increase the generalisability of the findings. Further research could also investigate the impact of additional transport routes on pupil withdrawal. Additionally, it would be valuable to examine the role of pupil characteristics, such as age and gender, in pupil withdrawal. Further studies could also analyse the effectiveness of operational interventions to reduce pupil withdrawal and promote student retention.