Very recently I was tasked with analyzing leaves taken by employees at a client’s company. In particular, I need to understand whether an employee has taken leave in a given period, ultimately setting a benchmark for measuring employees’ compliance with return to office policies.
I was provided with the following two leaves datasets:
- Time Off Data (“Dataset A”) which sets out shorter term leaves such as annual leave or sick leave taken by an employee. These leaves were unique at a per employee per date level (i.e. each row in the dataset represents a day of leave taken by a particular employee).
- Leave of Absence Data (“Dataset B”) which sets out the from and to dates of longer term leaves taken by an employee. Some examples of these leaves are parental leave, maternity leave, leave without paid and career break. This dataset is on a ‘leave-as-you-go’ basis for employees who have taken longer term leaves, in that each row represents a date range for an employee, who may appear in multiple rows in the dataset with multiple date ranges (e.g. employees may prefer to take parental leave in batches of 3 days a week for 30 weeks, which would appear as 30 date ranges across 30 rows in the dataset).
The two datasets augment each other as an employee could take both shorter and longer term leaves in a given period.
For my analysis, I would like to merge the two datasets into a common format so that all leaves by a particular employee are accounted for. Since Dataset A is already of a structured tabular format at a per date per employee level, I needed to convert Dataset B into a similar format as shown in the image below (for a dataset I created for the purpose of demonstration).
The transformation illustrated in image 1 presented a number of challenges given it’s not ‘one-to-one’ and it involves dates. To achieve the transformation, I…