cvfe.data package#
Submodules#
cvfe.data.constant module#
- cvfe.data.constant.CANADA_5257E_KEY_ABBREVIATION = {'Address': 'Addr', 'BGI2.VisaChoice1': 'noAuthStay', 'BGI2.VisaChoice2': 'refuseDeport', 'BGI3.Choice': 'criminalRec', 'BackgroundInfo': 'BGI', 'Contact': 'cntct', 'ContactInformation': 'CI', 'CountryWhereApplying': 'CWA', 'Current': 'Curr', 'Details.VisaChoice3': 'PrevApply', 'DetailsOfVisit': 'DOV', 'Education': 'Edu', 'GovPosition.Choice': 'witnessIllTreat', 'HowLongStay': 'HLS', 'Language': 'Lang', 'MaritalStatus': 'MS', 'Marriage': 'Marr', 'Married': 'Marr', 'Number': 'Num', 'Occ.Choice': 'politicViol', 'Occupation': 'Occ', 'Page': 'P', 'PageWrapper': 'PW', 'Passport': 'Psprt', 'PersonalDetails': 'PD', 'Phone': 'Phn', 'Previous': 'Prev', 'Previously': 'Prev', 'Purpose': 'Prps', 'Resident': 'Resi', 'Section': 'Sec', 'Signature': 'Sign', 'backgroundInfoCalc': 'otherThanMedic', 'contact': 'cntct'}#
Dict of abbreviation used to shortening length of KEYS in XML to CSV conversion
- cvfe.data.constant.CANADA_5645E_KEY_ABBREVIATION = {'Address': 'Addr', 'Applicant': 'App', 'Child': 'Chd', 'Father': 'Fa', 'Mother': 'Mo', 'Occupation': 'Occ', 'Relationship': 'Rel', 'Section': 'Sec', 'Spouse': 'Sps', 'Yes': 'Accomp', 'page': 'p'}#
Dict of abbreviation used to shortening length of KEYS in XML to CSV conversion
- cvfe.data.constant.CANADA_5257E_VALUE_ABBREVIATION = {'045': 'TURKEY', '223': 'IRAN', 'BIOMETRIC ENROLMENT': 'Bio'}#
Dict of abbreviation used to shortening length of VALUES in XML to CSV conversion
- cvfe.data.constant.CANADA_5257E_DROP_COLUMNS = ['ns0:datasets.@xmlns:ns0', 'P1.Header.CRCNum', 'P1.FormVersion', 'P1.PD.UCIClientID', 'P1.PD.SecHeader.@ns0:dataNode', 'P1.PD.CurrCOR.Row1.@ns0:dataNode', 'P1.PD.PrevCOR.Row1.@ns0:dataNode', 'P1.PD.CWA.Row1.@ns0:dataNode', 'P1.PD.ApplicationValidatedFlag', 'P2.MS.SecA.SecHeader.@ns0:dataNode', 'P2.MS.SecA.PsprtSecHeader.@ns0:dataNode', 'P2.MS.SecA.Langs.languagesHeader.@ns0:dataNode', 'P2.natID.SecHeader.@ns0:dataNode', 'P2.USCard.SecHeader.@ns0:dataNode', 'P2.USCard.SecHeader.@ns0:dataNode', 'P2.CI.cntct.cntctInfoSecHeader.@ns0:dataNode', 'P3.SecHeader_DOV.@ns0:dataNode', 'P3.Edu.Edu_SecHeader.@ns0:dataNode', 'P3.Occ.SecHeader_CurrOcc.@ns0:dataNode', 'P3.BGI_SecHeader.@ns0:dataNode', 'P3.Sign.Consent0.Choice', 'P3.Sign.hand.@ns0:dataNode', 'P3.Sign.TextField2', 'P3.Disclosure.@ns0:dataNode', 'P3.ReaderInfo', 'Barcodes.@ns0:dataNode']#
List of columns to be dropped before doing any preprocessing
Note
This list has been determined manually.
- cvfe.data.constant.CANADA_5645E_DROP_COLUMNS = {'formNum', 'p1.SecA.SecAdate', 'p1.SecA.SecAsignature', 'p1.SecA.Title.@xfa:dataNode', 'p1.SecB.SecBdate', 'p1.SecB.SecBsignature', 'p1.SecB.Title.@xfa:dataNode', 'p1.SecC.SecCsignature', 'p1.SecC.Subform2.@xfa:dataNode', 'p1.SecC.Title.@xfa:dataNode', 'xfa:datasets.@xmlns:xfa'}#
List of columns to be dropped before doing any preprocessing
Note
This list has been determined manually.
- class cvfe.data.constant.DocTypes(value)[source]#
Bases:
EnumContains all document types which can be used to customize ETL steps for each document type
Members follow the
<country_name>_<document_type>naming convention. The value and its order are meaningless.- CANADA = 1#
- CANADA_5257E = 2#
- CANADA_5645E = 3#
- CANADA_LABEL = 4#
- class cvfe.data.constant.CanadaCutoffTerms[source]#
Bases:
objectDict of cut off terms for different files that is can be used with :func:`vizard.data.functional.dict_summarizer
- CA5645E = 'IMM_5645'#
- CA5257E = 'form1'#
- class cvfe.data.constant.CanadaFillna[source]#
Bases:
objectValues used to fill
Nones depending on the form structureMembers follow the
<field_name>_<form_name>naming convention. The value has been extracted by manually inspecting the documents. Hence, for each form, user must find and set this value manually.Note
We do not use any heuristics here, we just follow what form used and only add another option which should be used as
Nonestate; i.e.Noneas a separate feature in categorical mode.- COUNTRY_CODE_5257E = 'Unknown'#
- VISA_TYPE_5257E = 'OTHER'#
- PLACE_BIRTH_CITY_5257E = 'OTHER'#
- COUNTRY_5257E = 'IRAN'#
- CITIZENSHIP_5257E = 'IRAN'#
- RESIDENCY_STATUS_5257E = 6#
- OTHER_DESCRIPTION_INDICATOR_5257E = False#
- PREVIOUS_COUNTRY_5257E = 'OTHER'#
- COUNTRY_WHERE_APPLYING_5257E = 'OTHER'#
- MARRIAGE_TYPE_5257E = 'OTHER'#
- PASSPORT_COUNTRY_5257E = 'OTHER'#
- NATIVE_LANG_5257E = 'IRAN'#
- LANGUAGES_ABLE_TO_COMMUNICATE_5257E = 'NEITHER'#
- ID_COUNTRY_5257E = 'IRAN'#
- PURPOSE_OF_VISIT_5257E = 7#
- CONTACT_TYPE_5257E = 'OTHER'#
- OCCUPATION_5257E = 'OTHER'#
- INDICATOR_FIELD_5257E = False#
- VISA_APPLICATION_TYPE_5645E = '0'#
- CHILD_MARRIAGE_STATUS_5645E = 9#
- CHILD_RELATION_5645E = 'OTHER'#
- VISA_RESULT = 0#
- cvfe.data.constant.DATEUTIL_DEFAULT_DATETIME = {'day': 1, 'month': 1, 'year': 1}#
A default date for the
dateutil.parser.parsefunction when some part of date is not provided
- class cvfe.data.constant.CustomNamingEnum(value)[source]#
Bases:
EnumExtends base
enum.Enumto support custom naming for membersNote
Class attribute
namehas been overridden to return the name of a marital status that matches with the dataset and not theEnumnaming convention of Python. For instance,COMMON_LAW->common-lawin case of Canada forms.Note
Devs should subclass this class and add their desired members in newly created classes. E.g. see
CanadaMarriageStatusNote
Classes that subclass this, for values of their members should use
enum.autoto demonstrate that chosen value is not domain-specific. Otherwise, any explicit value given to members should implicate a domain-specific (e.g. extracted from dataset) value. Values that are explicitly provided are the values used in original data. Hence, it should not be modified by any means as it is tied to dataset, transformation, and other domain-specific values. E.g. compare values inCanadaMarriageStatusandSiblingRelation.
- class cvfe.data.constant.CanadaMarriageStatus(value)[source]#
Bases:
CustomNamingEnumStates of marriage in Canada forms
Note
Values for the members are the values used in original Canada forms. Hence, it should not be modified by any means as it is tied to dataset, transformation, and other domain-specific values.
- COMMON_LAW = 2#
- DIVORCED = 3#
- SEPARATED = 4#
- MARRIED = 5#
- SINGLE = 7#
- WIDOWED = 8#
- UNKNOWN = 9#
- class cvfe.data.constant.CanadaContactRelation(value)[source]#
Bases:
CustomNamingEnumContact relation in Canada data
- F1 = 1#
- F2 = 2#
- HOTEL = 3#
- WORK = 4#
- FRIEND = 5#
- UKN = 6#
- class cvfe.data.constant.CanadaResidencyStatus(value)[source]#
Bases:
CustomNamingEnumResidency status in a country in Canada data
- CITIZEN = 1#
- VISITOR = 3#
- OTHER = 6#
- class cvfe.data.constant.Sex(value)[source]#
Bases:
CustomNamingEnumSex types in general
- FEMALE = 1#
- MALE = 2#
cvfe.data.functional module#
- cvfe.data.functional.dict_summarizer(d, cutoff_term, KEY_ABBREVIATION_DICT=None, VALUE_ABBREVIATION_DICT=None)[source]#
Takes a flattened dictionary and shortens its keys
- Parameters:
d (dict) – The dictionary to be shortened
cutoff_term (str) – The string that used to find in keys and remove anything behind it
KEY_ABBREVIATION_DICT (dict, optional) – A dictionary containing abbreviation mapping for keys. Defaults to None.
VALUE_ABBREVIATION_DICT (dict, optional) – A dictionary containing abbreviation mapping for values. Defaults to None.
- Returns:
A dict with shortened keys by throwing away some part and using a abbreviation dictionary for both keys and values.
- Return type:
- cvfe.data.functional.dict_to_csv(d, path)[source]#
Takes a flattened dictionary and writes it to a CSV file.
- cvfe.data.functional.column_dropper(dataframe, string, exclude=None, regex=False, inplace=True)[source]#
Takes a Pandas Dataframe and drops columns matching a pattern
- Parameters:
dataframe (
pandas.DataFrame) – Pandas dataframe to be processedstring (str) – string to look for in
dataframecolumnsexclude (Optional[str], optional) – string to exclude a subset of columns from being dropped. Defaults to None.
regex (bool, optional) – compile
stringas regex. Defaults to False.inplace (bool, optional) – whether or not use and inplace operation. Defaults to True.
- Returns:
Takes a Pandas Dataframe and searches for columns containing
stringin them either raw string or regex (in latter case, useregex=True) and afterexcludeing a subset of them, drops the remaining in-place.- Return type:
Optional[
pandas.DataFrame]
- cvfe.data.functional.fillna_datetime(dataframe, col_base_name, date, type, one_sided=False, inplace=False)[source]#
Takes names of two columns of dates (start, end) and fills them with a predefined value
- Parameters:
dataframe (
pandas.DataFrame) – Pandas Dataframe to be processedcol_base_name (str) – Base column name that accepts
'From'and'To'for extracting dates of same categorydate (str) – The desired date
type (DocTypes) –
Different ways of filling empty date columns:
'right': Uses thecurrent_dateas the final time'left': Uses thereference_dateas the starting time
one_sided (str | bool, optional) – whether or not use an inplace operation. Defaults to False.
inplace (bool, optional) –
DocTypesused to use rules for matching tags and filling appropriately. Defaults to False.
Note
In transformation operations such as
aggregate_datetime()function, this would be converted to period of zero. It is useful for filling periods of non existing items (e.g. age of children for single person).- Returns:
A Pandas Dataframe that two columns of dates that had no value (None) which was filled to the same date via
date.- Return type:
- cvfe.data.functional.aggregate_datetime(dataframe, col_base_name, new_col_name, type, if_nan='skip', one_sided=None, reference_date=None, current_date=None, **kwargs)[source]#
Takes two columns of dates in string form and calculates the period of them
- Parameters:
dataframe (
pandas.DataFrame) – Pandas dataframe to be processedcol_base_name (str) – Base column name that accepts
'From'and'To'for extracting dates of same categorynew_col_name (str) – The column name that extends
col_base_nameand will be the final column containing the period.type (DocTypes) – document type used to use rules for matching tags and filling appropriately. See
DocTypes.if_nan (Optional[str | Callable], optional) –
What to do with None s (NaN). Could be a function or predefined states as follow:
'skip': do nothing (i.e. ignoreNone``s). Defaults to ``'skip'.
one_sided (Optional[str], optional) –
Different ways of filling empty date columns. Defaults to None. Could be one of the following:
'right': Uses thecurrent_dateas the final time'left': Uses thereference_dateas the starting time
reference_date (Optional[str], optional) – Assumed
reference_date(t0<t1). Defaults to None.current_date (Optional[str], optional) – Assumed
current_date(t1>t0). Defaults to None.default_datetime – accepts datetime.datetime to set default date for dateutil.parser.parse.
- Returns:
A Pandas Dataframe calculate the period of two columns of dates and represent it in integer form. The two columns used will be dropped.
- Return type:
- cvfe.data.functional.tag_to_regex_compatible(string, type)[source]#
Takes a string and makes it regex compatible for XML parsed string
Note
This is specialized method and it may be better to override it for your own case.
- cvfe.data.functional.change_dtype(dataframe, col_name, dtype, if_nan='skip', **kwargs)[source]#
Changes the data type of a column with ability to fill
Nones- Parameters:
dataframe (
pandas.DataFrame) – Dataframe thatcolumn_namewill be searched oncol_name (str) – Desired column name of the dataframe
dtype (Callable) – target data type as a function e.g.
np.float32if_nan (str, Callable, optional) –
What to do with None s (NaN). Defaults to
'skip'. Could be a function or predefined states as follow:'skip': do nothing (i.e. ignoreNones)'value': fill the None withvalueargument viakwargs
default_datetime (optional) – accepts datetime.datetime to set default date for dateutil.parser.parse
- Raises:
ValueError – if string mode passed to
if_nandoes not exist. It won’t raise ifif_nanisCallable.- Returns:
A Pandas Dataframe calculate the period of two columns of dates and represent it in integer form. The two columns used will be dropped.
- Return type:
- cvfe.data.functional.flatten_dict(d)[source]#
Takes a (nested) multilevel dictionary and flattens it
- Parameters:
d (dict) – A dictionary (could be multilevel)
References
- Returns:
Flattened dictionary where keys and values of returned dict are:
new_keys[i] = f'{old_leys[level]}.{old_leys[level+1]}.[...].{old_leys[level+n]}'new_value = old_value
- Return type:
- cvfe.data.functional.xml_to_flattened_dict(xml)[source]#
Takes a (nested) XML and flattens it to a dict via
flatten_dict()
- cvfe.data.functional.create_directory_structure_tree(src, shallow=False)[source]#
Takes a path to directory and creates a dictionary of its directory structure tree
- Parameters:
References
- Returns:
Dictionary of all dirs (and subdirs) where keys are path and values are
0- Return type:
- cvfe.data.functional.dump_directory_structure_csv(src, shallow=True)[source]#
Saves a tree structure of a directory in csv file
Takes a
srcdirectory path, creates a tree of dir structure and writes it down to a csv file with name'label.csv'with default value of'0'for each pathNote
This has been used to manually extract and record labels.
- cvfe.data.functional.process_directory(src_dir, dst_dir, compose, file_pattern='*')[source]#
Transforms all files that match pattern in given dir and saves new files preserving dir structure
Note
A methods used for handling files from manually processed dataset to raw-dataset see
FileTransformfor more information.References
- Parameters:
src_dir (str) – Source directory to be processed
dst_dir (str) – Destination directory to write processed files
compose (FileTransformCompose) – An instance of transform composer. see
Compose.file_pattern (str, optional) – pattern to match files, default to
'*'for all files. Defaults to'*'.
- Return type:
- cvfe.data.functional.search_dict(string, dic, if_nan)[source]#
Converts a string to another given a dictionary to search for
Note
This could be used to convert non-standard country codes to their names
- cvfe.data.functional.config_csv_to_dict(path)[source]#
Takes a config CSV and return a dictionary of key and values
Note
Configs of our use case can be found in
cvfe.configs
cvfe.data.pdf module#
- class cvfe.data.pdf.PDFIO[source]#
Bases:
objectBase class for dealing with PDF files
For each type of PDF, let’s say XFA files, one needs to extend this class and abstract methods like
extract_raw_content()to generate a string of the content of the PDF in a format that can be used by the other classes (e.g. XML). For instance, seeXFAPDFfor the extension of this class.
- class cvfe.data.pdf.XFAPDF[source]#
Bases:
PDFIOContains functions and utility tools for dealing with XFA PDF documents.
- extract_raw_content(pdf_path)[source]#
Extracts RAW content of XFA PDF files which are in XML format
- Parameters:
pdf_path (str) – path to the pdf file
Reference:
- Returns:
XFA object of the pdf file in XML format
- Return type:
- clean_xml_for_csv(xml, type)[source]#
Cleans the XML file extracted from XFA forms
Since each form has its own format and issues, this method needs to be implemented uniquely for each unique file/form which needs to be specified using argument
typethat can be populated fromDocTypes.
- flatten_dict_basic(d)[source]#
Takes a (nested) dictionary and flattens it
ref: https://stackoverflow.com/questions/38852822/how-to-flatten-xml-file-in-python :type d:
dict:param d: A dictionary :param return: An ordered dict- Return type:
cvfe.data.preprocessor module#
- class cvfe.data.preprocessor.DataframePreprocessor(dataframe=None)[source]#
Bases:
objectA wrapper around builtin Pandas functions to make it easier for our data values
A class that contains methods for dealing with dataframes regarding transformation of data such as filling missing values, dropping columns, or aggregating multiple columns into a single more meaningful one.
This class needs to be extended for file specific preprocessing where tags are unique and need to be done entirely manually. In this case,
file_specific_basic_transform()needs to be implemented.- __init__(dataframe=None)[source]#
- Parameters:
dataframe (
pandas.DataFrame, optional) – Main dataframe to be preprocessed. Defaults to None.
- column_dropper(string, exclude=None, regex=False, inplace=True)[source]#
See
cvfe.data.functional.column_dropper()for more information
- fillna_datetime(col_base_name, type, one_sided, date=None, inplace=False)[source]#
See
cvfe.data.functional.fillna_datetime()for more details
- aggregate_datetime(col_base_name, new_col_name, type, if_nan=None, one_sided=None, reference_date=None, current_date=None)[source]#
See
cvfe.data.functional.aggregate_datetime()for more details- Return type:
- file_specific_basic_transform(type, path)[source]#
Takes a specific file then does data type fixing, missing value filling, discretization, etc.
Note
Since each files has its own unique tags and requirements, it is expected that all these transformation being hardcoded for each file, hence this method exists to just improve readability without any generalization to other problems or even files.
- change_dtype(col_name, dtype, if_nan='skip', **kwargs)[source]#
See
cvfe.data.functional.change_dtype()for more details
- class cvfe.data.preprocessor.CanadaDataframePreprocessor(dataframe=None)[source]#
Bases:
DataframePreprocessor- __init__(dataframe=None)[source]#
- Parameters:
dataframe (
pandas.DataFrame, optional) – Main dataframe to be preprocessed. Defaults to None.
- convert_country_code_to_name(string)[source]#
Converts the (custom and non-standard) code of a country to its name given the XFA docs LOV section.
- file_specific_basic_transform(type, path)[source]#
Takes a specific file then does data type fixing, missing value filling, discretization, etc.
Note
Since each files has its own unique tags and requirements, it is expected that all these transformation being hardcoded for each file, hence this method exists to just improve readability without any generalization to other problems or even files.
- class cvfe.data.preprocessor.FileTransformCompose(transforms)[source]#
Bases:
objectComposes several transforms operating on files together
The transforms should be tied to files with keyword and this will be only applying functions on files that match the keyword using a dictionary
Transformation dictionary over files in the following structure:
{ FileTransform: 'filter_str', ..., }
Note
Transforms will be applied in order of the keys in the dictionary
- __init__(transforms)[source]#
- Parameters:
transforms (dict[FileTransform, str]) – a dictionary of transforms, where the key is the instance of FileTransform and the value is the keyword that the transform will be applied to
- Raises:
ValueError – if the keyword is not a string
- class cvfe.data.preprocessor.FileTransform[source]#
Bases:
objectA base class for applying transforms as a composable object over files.
Any behavior over the files itself (not the content of files) must extend this class.
- class cvfe.data.preprocessor.CopyFile(mode)[source]#
Bases:
FileTransformOnly copies a file, a wrapper around shutil’s copying methods
Default is set to ‘cf’, i.e. shutil.copyfile. For more info see shutil documentation.
- class cvfe.data.preprocessor.MakeContentCopyProtectedMachineReadable[source]#
Bases:
FileTransformReads a ‘content-copy’ protected PDF and removes this restriction
Removing the protection is done by saving a “printed” version of via pikepdf
References