ptrail.preprocessing package
Submodules
ptrail.preprocessing.filters module
The filters module contains several data filtering functions like filtering the data based on time, date, proximity to a point and several others.
- class ptrail.preprocessing.filters.Filters[source]
Bases:
object
- static filter_by_bounding_box(dataframe: PTRAILDataFrame, bounding_box: tuple, inside: bool = True)[source]
Given a bounding box, filter out all the points that are within/outside the bounding box and return a dataframe containing the filtered points.
- Parameters:
dataframe (PTRAILDataFrame) – The dataframe from which the data is to be filtered out.
bounding_box (tuple) – The bounding box which is to be used to filter the data.
inside (bool) – Indicate whether the data outside the bounding box is required or the data inside it.
- Returns:
The filtered dataframe.
- Return type:
- static filter_by_date(dataframe: PTRAILDataFrame, start_date: Optional[str] = None, end_date: Optional[str] = None)[source]
Filter the dataset by user-given time range.
Note
The following options are to be noted for filtering the data:1. If the start_date and end_date both are not given, then entire dataset itself is returned.2. If only start_date is given, then the trajectory data after (including the start date) the start date is returned.3. If only end_date is given, then the trajectory data before (including the end date) the end date is returned.4. If start_date and end_date both are given then the data between the start_date and end_date (included) are returned.- Parameters:
dataframe (PTRAILDataFrame) – The dataframe that is to be filtered.
start_date (Optional[Text]) – The start date from which the points are to be filtered.
end_date (Optional[Text]) – The end date before which the points are to be filtered.
- Returns:
The filtered dataframe containing the resultant data.
- Return type:
- Raises:
ValueError: – When the start date is later than the end date.
- static filter_by_datetime(dataframe: PTRAILDataFrame, start_dateTime: Optional[str] = None, end_dateTime: Optional[str] = None)[source]
Filter the dataset by user-given time range.
Note
The following options are to be noted for filtering the data.1. If the start_dateTime and end_dateTime both are not given, then entire dataset itself is returned.2. If only start_dateTime is given, then the trajectory data after (including the start datetime) the start date is returned.3. If only end_dateTime is given, then the trajectory data before (including the end datetime) the end date is returned.4. If start_dateTime and end_dateTime both are given then the data between the start_dateTime and end_dateTime (included) are returned.- Parameters:
dataframe (PTRAILDataFrame) – The dataframe that is to be filtered.
start_dateTime (Optional[Text]) – The start dateTime from which the points are to be filtered.
end_dateTime (Optional[Text]) – The end dateTime before which the points are to be filtered.
- Returns:
The filtered dataframe containing the resultant data.
- Return type:
- Raises:
ValueError: – When the start datetime is later than the end datetime.
- static filter_by_max_consecutive_distance(dataframe, max_distance: float)[source]
Remove the points that have a distance between 2 consecutive points greater than a user specified value.
Note
max_distance is given in metres.
- Parameters:
dataframe (PTRAILDataFrame) – The dataframe which is to be filtered.
max_distance (float) – The consecutive distance threshold above which the points are to be removed.
- Returns:
The filtered dataframe.
- Return type:
- static filter_by_max_distance_and_speed(dataframe, max_distance: float, max_speed: float)[source]
Filter out values that have distance between consecutive points greater than a user-given distance and speed between consecutive points greater than a user-given speed
Note
The max_distance is given in metres
Note
The max_speed is given in metres/second (m/s).
- Parameters:
dataframe (PTRAILDataFrame) – The dataframe which is to be filtered.
max_distance (float) – The maximum distance between 2 consecutive points.
max_speed (float) – The maximum speed between 2 consecutive points.
- Returns:
The filtered dataframe.
- Return type:
pandas.DataFrame
- static filter_by_max_speed(dataframe: PTRAILDataFrame, max_speed: float)[source]
Remove the data points which have speed more than a user given speed.
Note
The max_speed is given in the units m/s (metres per second).
- Parameters:
dataframe (PTRAILDataFrame) – The dataframe which is to be filtered.
max_speed (float) – The speed threshold above which the points are to be removed.
- Returns:
PTRAILDataFrame Dataframe containing the resultant dataframe.
- Return type:
- static filter_by_min_consecutive_distance(dataframe, min_distance: float)[source]
Remove the points that have a distance between 2 consecutive points lesser than a user specified value.
Note
min_distance is given in metres.
- Parameters:
dataframe (PTRAILDataFrame) – The dataframe which is to be filtered.
min_distance (float) – The consecutive distance threshold below which the points are to be removed.
- Returns:
The filtered dataframe.
- Return type:
- static filter_by_min_distance_and_speed(dataframe, min_distance: float, min_speed: float)[source]
Filter out values that have distance between consecutive points lesser than a user-given distance and speed between consecutive points lesser than a user-given speed.
Note
The min_distance is given in metres.
Note
The min_speed is given in metres/second (m/s).
- Parameters:
dataframe (PTRAILDataFrame) – The dataframe which is to be filtered.
min_distance (float) – The minimum distance between 2 consecutive points.
min_speed (float) – The minimum speed between 2 consecutive points.
- Returns:
The filtered dataframe.
- Return type:
- static filter_by_min_speed(dataframe, min_speed: float)[source]
Remove the data points which have speed less than a user given speed.
Note
The min_speed is given in the units m/s (metres per second).
- Parameters:
dataframe (PTRAILDataFrame) – The dataframe which is to be filtered.
min_speed (float) – The speed threshold below which the points are to be removed.
- Returns:
PTRAILDataFrame Dataframe containing the resultant dataframe.
- Return type:
- static filter_by_traj_id(dataframe: PTRAILDataFrame, traj_id: str)[source]
Extract all the trajectory points of a particular trajectory specified by the trajectory’s ID.
- Parameters:
dataframe (PTRAILDataFrame) – The dataframe on which the filtering by ID is to be done.
traj_id (Text) – The ID of the trajectory which is to be extracted.
- Returns:
The dataframe containing all the trajectory points of the specified trajectory.
- Return type:
pandas.core.dataframe.DataFrame
- Raises:
MissingTrajIDException: – This exception is raised when the Trajectory ID given by the user does not exist in the dataset.
- static filter_outliers_by_consecutive_distance(dataframe: PTRAILDataFrame)[source]
Check the outlier points based on distance between 2 consecutive points. Outlier formula:
Lower outlier = Q1 - (1.5*IQR)Higher outlier = Q3 + (1.5*IQR)IQR = Inter quartile range = Q3 - Q1We need to find points between lower and higher outlier- Parameters:
dataframe (PTRAILDataFrame) – The dataframe which is to be filtered.
- Returns:
The dataframe which has been filtered.
- Return type:
- static filter_outliers_by_consecutive_speed(dataframe)[source]
Check the outlier points based on distance between 2 consecutive points. Outlier formula:
Lower outlier = Q1 - (1.5*IQR)Higher outlier = Q3 + (1.5*IQR)IQR = Inter quartile range = Q3 - Q1We need to find points between lower and higher outlier- Parameters:
dataframe (PTRAILDataFrame) – The dataframe which is to be filtered.
- Returns:
The dataframe which has been filtered.
- Return type:
- static get_bounding_box_by_radius(lat: float, lon: float, radius: float)[source]
Calculates bounding box from a point according to the given radius.
- Parameters:
lat (float) – The latitude of centroid point of the bounding box.
lon (float) – The longitude of centroid point of the bounding box.
radius (float) – The max radius of the bounding box. The radius is given in metres.
- Returns:
The bounding box of the user specified size.
- Return type:
tuple
References
https://mathmesquita.dev/2017/01/16/filtrando-localizacao-em-um-raio.html
- static hampel_outlier_detection(dataframe, column_name: str)[source]
Use the hampel filter to remove outliers from the dataset on the basis of column specified by the user.
Warning
Do not use Hampel filter outlier detection and try to detect outliers with DateTime as it will raise a NotImplementedError as it has not been implemented yet by the original author of the Hampel filter.
- Parameters:
dataframe (PTRAILDataFrame) – The dataframe from which the outliers are to be removed.
column_name (Text) – The column on te basis of which the outliers are to be detected.
- Returns:
The dataframe with the outliers removed.
- Return type:
- Raises:
KeyError: – The user-specified column is not present in the dataset.
References
Pedrido, M.O., “Hampel”, (2020), GitHub repository, https://github.com/MichaelisTrofficus/hampel_filter
- static remove_duplicates(dataframe: PTRAILDataFrame)[source]
- Drop duplicates based on the four following columns:
Trajectory ID
DateTime
Latitude
Longitude
Duplicates will be dropped only when all the values in the above mentioned four columns are the same.
- Returns:
The dataframe with dropped duplicates.
- Return type:
- static remove_trajectories_with_less_points(dataframe, num_min_points: Optional[int] = 3)[source]
Remove out the trajectories from the dataframe which have few points.
- Parameters:
dataframe (PTRAILDataFrame) – The dataframe from which trajectories with few points are to be removed.
num_min_points (Optional[int], default = 2) – The minimum number of points that a trajectory should have if it is to be retained in the dataset.
- Returns:
The filtered dataframe which does not contain the trajectories with few points anymore.
- Return type:
ptrail.preprocessing.helpers module
Warning
The helpers class has the functionalities that interpolate a point based on the given data by the user. The class contains the following 4 interpolation calculators:
Linear Interpolation
Cubic Interpolation
Random-Walk Interpolation
Kinematic Interpolation
Besides the interpolation helpers, there are also general utilities which are used in splitting up dataframes for running the code in parallel.
- class ptrail.preprocessing.helpers.Helpers[source]
Bases:
object
- static cubic_help(df: Union[pandas.DataFrame, PTRAILDataFrame], id_: str, sampling_rate: float, class_label_col)[source]
This method takes a dataframe and uses cubic interpolation to determine coordinates of location on Datetime where the time difference between 2 consecutive points exceeds the user-specified sampling_rate and inserts the interpolated point those between 2 points.
Warning
This method should not be used for dataframes with multiple trajectory ids as it will yield wrong results and there might be a significant drop in performance.
- Parameters:
df (Union[pd.DataFrame, NumTrajDF]) – The dataframe containing the original trajectory data.
id (Text) – The Trajectory ID of the points in the dataframe.
sampling_rate (float) – The maximum time difference between 2 points greater than which a point will be inserted between 2 points.
- Returns:
The dataframe containing the trajectory enhanced with interpolated points.
- Return type:
pandas.core.dataframe.DataFrame
- static hampel_help(df, column_name)[source]
This function is the helper function for the hampel_outlier_detection() function present in the filters module. The purpose of the function is to run the hampel filter on a single trajectory ID, remove the outliers and return the smaller dataframe.
Warning
This function should not be used directly as it will result in a slower execution of the function and might result in removal of points that are actually not outliers.
Warning
Do not use Hampel filter outlier detection and try to detect outliers with DateTime as it will raise a NotImplementedError as it has not been implemented yet by the original author of the Hampel filter.
- Parameters:
df (PTRAILDataFrame/pd.core.dataframe.DataFrame) – The dataframe which the outliers are to be removed
column_name (Text) – The column based on which the outliers are to be removed.
- Returns:
The dataframe where the outlier points are removed.
- Return type:
pd.core.dataframe.DataFrame
- static kinematic_help(dataframe: Union[pandas.DataFrame, PTRAILDataFrame], id_: str, sampling_rate: float, class_label_col)[source]
This method takes a dataframe and uses kinematic interpolation to determine coordinates of location on Datetime where the time difference between 2 consecutive points exceeds the user-specified sampling_rate and inserts the interpolated point those between 2 points.
Warning
This method should not be used for dataframes with multiple trajectory ids as it will yield wrong results and there might be a significant drop in performance.
- Parameters:
dataframe (Union[pd.DataFrame, NumTrajDF]) – The dataframe containing the original trajectory data.
id (Text) – The Trajectory ID of the points in the dataframe.
sampling_rate (float) – The maximum time difference between 2 points greater than which a point will be inserted between 2 points.
- Returns:
The dataframe containing the trajectory enhanced with interpolated points.
- Return type:
pandas.core.dataframe.DataFrame
References
Nogueira, T.O., “kinematic_interpolation.py”, (2016), GitHub repository, https://gist.github.com/talespaiva/128980e3608f9bc5083b.js
- static linear_help(dataframe: Union[pandas.DataFrame, PTRAILDataFrame], id_: str, sampling_rate: float, class_label_col)[source]
This method takes a dataframe and uses linear interpolation to determine coordinates of location on Datetime where the time difference between 2 consecutive points exceeds the user-specified sampling_rate and inserts the interpolated point those between 2 points.
Warning
This method should not be used for dataframes with multiple trajectory ids as it will yield wrong results and there might be a significant drop in performance.
- Parameters:
dataframe (Union[pd.DataFrame, NumTrajDF]) – The dataframe containing the original trajectory data.
id (Text) – The Trajectory ID of the points in the dataframe.
sampling_rate (float) – The maximum time difference between 2 points greater than which a point will be inserted between 2 points.
- Returns:
The dataframe containing the trajectory enhanced with interpolated points.
- Return type:
pandas.core.dataframe.DataFrame
- static random_walk_help(dataframe: PTRAILDataFrame, id_: str, sampling_rate: float, class_label_col)[source]
This method takes a dataframe and uses random-walk interpolation to determine coordinates of location on Datetime where the time difference between 2 consecutive points exceeds the user-specified sampling_rate and inserts the interpolated point those between 2 points.
Warning
This method should not be used for dataframes with multiple trajectory ids as it will yield wrong results and there might be a significant drop in performance.
- Parameters:
dataframe (Union[pd.DataFrame, NumTrajDF]) – The dataframe containing the original trajectory data.
id (Text) – The Trajectory ID of the points in the dataframe.
sampling_rate (float) – The maximum time difference between 2 points greater than which a point will be inserted between 2 points.
- Returns:
The dataframe containing the trajectory enhanced with interpolated points.
- Return type:
pandas.core.dataframe.DataFrame
References
Etemad, M., Soares, A., Etemad, E. et al. SWS: an unsupervised trajectory segmentation algorithm based on change detection with interpolation kernels. Geoinformatica (2020)
- static stats_helper(df, target_col_name, segmented)[source]
Generate the stats of the kinematic features present in the Dataframe.
- Parameters:
df (pandas.core.dataframe.DataFrame) – The dataframe containing the trajectory data and their features.
target_col_name (str) – This is the ‘y’ value that is used for ML tasks, this is asked to append the species back at the end.
segmented (Optional[bool]) – Indicate whether the trajectory has segments or not.
- Returns:
A dataframe containing the stats of the given trajectory.
- Return type:
pd.core.dataframe.DataFrame
ptrail.preprocessing.interpolation module
This class interpolates dataframe positions based on Datetime. It provides the user with the flexibility to use linear or cubic interpolation. In general, the user passes the dataframe, time jum and the interpolation type, based on the type the proper function is mapped. And if the time difference exceeds the time jump, the interpolated point is added to the position with large jump with a time increase of time jump. This interpolated row is added to the dataframe.
- class ptrail.preprocessing.interpolation.Interpolation[source]
Bases:
object
- static interpolate_position(dataframe: PTRAILDataFrame, sampling_rate: float, ip_type: Optional[str] = 'linear', class_label_col: Optional[str] = '')[source]
Interpolate the position of an object and create new points using one of the interpolation methods provided by the Library. Currently, the library supports the following 4 interpolation methods:
Linear Interpolation
Cubic-Spline Interpolation
Kinematic Interpolation
Random Walk Interpolation
Warning
The Interpolation methods will only return the 4 mandatory library columns because it is not possible to interpolate other data that may or may not be present in the dataset apart from latitude, longitude and datetime. As a result, other columns are dropped.
Note
The time-jump parameter specifies where the new points are to be inserted based on the time difference between 2 consecutive points. However, it does not guarantee that the dataset will be brought down to having difference between 2 consecutive points equal to or less than the user specified time jump.
Note
The time-jump is specified in seconds. Hence, if the user-specified time-jump is not sensible, then the execution of the method will take a very long time.
- Parameters:
dataframe (PTRAILDataFrame) – The dataframe containing the original dataset.
sampling_rate (float) – The maximum time difference between 2 consecutive points.
ip_type (Optional[Text], default = linear) – The type of interpolation that is to be used.
class_label_col (Optional[Text], default = '') – The column header which contains the class label of the point.
- Returns:
The dataframe containing the interpolated trajectory points.
- Return type:
ptrail.preprocessing.statistics module
The statistics module has several functionalities that calculate kinematic statistics of the trajectory, split trajectories, pivot dataframes etc. The main purpose of this module is to get the dataframe ready for Machine Learning tasks such as clustering, calssification etc.
- class ptrail.preprocessing.statistics.Statistics[source]
Bases:
object
- static generate_kinematic_stats(dataframe: PTRAILDataFrame, target_col_name: str, segmented: Optional[bool] = False)[source]
Generate the statistics of kinematic features for each unique trajectory in the dataframe.
- Parameters:
dataframe (PTRAILDataFrame) – The dataframe containing the trajectory data.
target_col_name (str) – This is the ‘y’ value that is used for ML tasks, this is asked to append the target_col back at the end.
segmented (Optional[bool]) – Indicate whether the trajectory has segments or not.
- Returns:
A pandas dataframe containing stats for all kinematic features for each unique trajectory in the dataframe.
- Return type:
pandas.core.dataframe.DataFrame
- static pivot_stats_df(dataframe, target_col_name: str, segmented: Optional[bool] = False)[source]
Given a dataframe with stats present in it, melt the dataframe to make it ready for ML tasks. This is specifically for melting the type of dataframe generated by the generate_kinematic_stats() function of the kinematic_features module.
Check the kinematic_features module for further details about the dataframe expected.
- Parameters:
dataframe (pd.core.dataframe.DataFrame) – The dataframe containing stats.
target_col_name (str) – This is the ‘y’ value that is used for ML tasks, this is asked to append the target_col back at the end.
segmented (Optional[bool]) – Indicate whether the trajectory has segments or not.
- Returns:
The dataframe above which is pivoted and has rows converted to columns.
- Return type:
pd.core.dataframe.DataFrame
- static segment_traj_by_days(dataframe: PTRAILDataFrame, num_days)[source]
Given a dataframe containing trajectory data, segment all the trajectories by each week.
- Parameters:
df (PTRAILDataFrame) – The dataframe containing trajectory data.
num_days (int) – The number of days that each segment is supposed to have.
- Returns:
The dataframe containing segmented trajectories with a new column added called segment_id
- Return type:
pandas.core.dataframe.DataFrame