Initial loading of registry data
🌐 This document is available in both English and Ukrainian. Use the language toggle in the top right corner to switch between versions. |
The PL/pgSQL database procedure, a programming language embedded in PostgreSQL for advanced procedures and functions, is used for the initial data load.
Before executing the initial data load procedure, ensure that all prerequisites described in the section Preparing data for migration are met. This will provide a smooth and uninterrupted data-loading process. |
1. PL/pgSQL procedure specification for the Platform’s database
This section details the PL/pgSQL procedure that allows data to be loaded from CSV files into the Platform database tables.
CREATE OR REPLACE PROCEDURE p_load_table_from_csv(
p_table_name text,
p_file_name text,
p_table_columns text[],
p_target_table_columns text[] DEFAULT NULL::text[]
);
- Parameter description:
-
1 p_table_name
— the name of the target table where data will be loaded.2 p_file_name
— the full path to the CSV file from which data will be loaded.3 p_table_columns
— an array of field names in the data file corresponding to the first row of the CSV file.4 p_target_table_columns
— an array of target table fields. This parameter is optional if the list of fields in the data file matches the fields in the target table.Value formats for the
p_target_table_columns
parameter:-
<col_name>
: The field in the target table receives the value of the corresponding field in the CSV file. -
<col_name>::<some_calculation_based_on_source_cols>
: The field receives a value from a calculation based on source fields.For more information on this approach, see the section Complex Scenario 1. -
<col_name>::ref(lookup_col:<lookup_col_name>, ref_table:<ref_table_name>, ref_col:<ref_col_name>, ref_id:<ref_id_col_name>)
: Used to search forid
in a reference table when the input value is represented by a descriptive field, not an identifier, but the value of this field must be unique.SELECT <ref_id_col_name> INTO <col_name> FROM <ref_table_name> WHERE <ref_col_name> = <lookup_col_name>;
For more details on this usage method, see the section Complex Scenario 2. -
<col_name>::ref_array(lookup_col:<lookup_col_name>, ref_table:<ref_table_name>, ref_col:<ref_col_name>, ref_id:<ref_col_name>, delim:<delimiter_character>)
: Similar to the previous one, but for cases where the<lookup_col_name>
field in the CSV file contains an array of descriptive values, and an array of corresponding `id`s is returned.
-
2. Procedure call examples
2.1. Basic scenario
This scenario demonstrates the primary use of the procedure for importing data from a CSV file into a database table. A feature is the ability to import the entity ID value, uniquely identifying table records.
CALL p_load_table_from_csv(
'research', 'D:\PostgreSQL\csv\uuid\01-Typy_doslidgen_uuid.csv',
array [ 'uuid', 'research_type' ],
array[ 'research_type' ]
);
- Parameter description:
-
1 uuid
— a static (fixed) name that does not depend on the entity ID name. It imports a unique identifier (UUID) from the file. When importing, theprimaryKey
of the table being imported must be of typeUUID
.2 The ID column must be the first in the import file and have a UUID format to ensure proper data processing.
This example illustrates data import from the file 01-Types_of_research_uuid.csv, located at D:\PostgreSQL\csv\uuid\, into the research
table. The imported fields include uuid
as a unique identifier and research_type
as the type of research.
2.2. Complex Scenario 1
This scenario demonstrates the extended use of the procedure for importing data that includes processing specific conditions and forming additional fields using calculations based on the data imported from the CSV file. This approach is used to address specific data processing and classification needs.
CALL p_load_table_from_csv(
'koatuu', '${dataLoadPath}03-KOATUU_simple.csv',
array[ 'code', 'category', 'name' ],
array[ 'code', 'category', 'name',
'level1::substring(code,1,2)||''00000000''',
'level2::substring(code,1,5)||''00000''',
'type::CASE WHEN code ~ ''[0-9]{2}0{8}'' AND code !~ ''(80|85)0{8}'' THEN ''О''
WHEN code ~ ''[0-9]{2}2[0-9]{2}0{5}'' AND code !~ ''[0-9]{2}20{7}'' THEN ''Р''
WHEN coalesce(category, ''Р'') != ''Р''
OR code IN (SELECT DISTINCT substring(code,1,5)||''00000'' FROM koatuu_csv k2 WHERE category = ''Р'') AND category IS NULL
OR code ~ ''(80|85)0{8}'' THEN ''НП''
ELSE NULL END' ]
);
- Parameter and logic description:
-
1 code
,category
,name
— standard fields imported directly from the CSV file.2 level1
— creates a new field by extracting the first two digits from thecode
field and appending zeros to make eight digits, forming a unique first-level identifier.3 level2
— similar tolevel1
, but extracts the first five digits and appends zeros to make five digits, creating a unique second-level identifier.4 type
— determines the object type based on a complex conditional expression. Using regular expressions and conditional logic, this field classifies records based on their codes and categories.
This scenario is beneficial for processing and classifying geographical data according to KOATUU (Classifier of Objects of the Administrative-Territorial Unit of Ukraine), providing automated creation of hierarchical levels and types of objects based on their codes.
2.3. Complex Scenario 2
This scenario reveals the possibilities of importing data for a complex staff structure using additional transformations to match data with external keys and identifiers in the database. The scenario includes searching in reference tables using descriptive values to obtain corresponding identifiers and processing arrays of data.
CALL p_load_table_from_csv('staff','D:\PostgreSQL\csv\staff.csv'
,array['uuid','laboratory','staff_status','researches','full_name','hygienist_flag','full_time_flag','salary','fixed_term_contract_flag','contract_end_date','specialization_date','specialization_end_date','dismissal_flag','dismissal_date']
,array['laboratory_id::ref(lookup_col:laboratory,ref_table:laboratory,ref_col:name,ref_id:laboratory_id)'
,'staff_status_id::ref(lookup_col:staff_status,ref_table:staff_status,ref_col:name,ref_id:staff_status_id)'
,'researches::ref_array(lookup_col:researches,ref_table:research,ref_col:research_type,ref_id:research_id,delim:#)'
,'full_name','hygienist_flag','full_time_flag','salary','fixed_term_contract_flag','contract_end_date','specialization_date','specialization_end_date','dismissal_flag','dismissal_date']
);
- Parameter and logic description:
-
1 uuid
,full_name
,hygienist_flag
,full_time_flag
,salary
,fixed_term_contract_flag
,contract_end_date
,specialization_date
,specialization_end_date
,dismissal_flag
,dismissal_date
— standard fields imported directly from the CSV file without changes.2 laboratory_id
— thelaboratory
field in the CSV file is transformed intolaboratory_id
by searching for the corresponding identifier in thelaboratory
table by the laboratory name.3 staff_status_id
— similarly, thestaff_status
field is transformed intostaff_status_id
by searching for the staff status identifier in thestaff_status
table.4 researches
— this field contains an array of research that are transformed into correspondingresearch_id`s through a search in the `research
table. A special delimiter#
is used to identify individual elements in the array.
This scenario shows how to effectively import complex data sets requiring additional transformations and searches in reference tables to ensure data integrity. It is particularly beneficial for systems where it is crucial to match imported data with existing entities in the database.
3. Data loading process
For loading data into the database, it is recommended to use the standard functionality of Liquibase, a database management tool that supports database schema versioning and change management through XML files, SQL scripts, and other formats.
Refer to the detailed practical guide on modeling and loading data structures into the registry for guidance: Task 1. Modeling registry database structures. |
<databaseChangeLog
xmlns="http://www.liquibase.org/xml/ns/dbchangelog"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:ext="http://www.liquibase.org/xml/ns/dbchangelog-ext"
xsi:schemaLocation="http://www.liquibase.org/xml/ns/dbchangelog http://www.liquibase.org/xml/ns/dbchangelog/dbchangelog-4.2.xsd
http://www.liquibase.org/xml/ns/dbchangelog-ext http://www.liquibase.org/xml/ns/dbchangelog-ext.xsd">
<property name="dataLoadPath" value="D:\PostgreSQL\csv\"/>
<changeSet author="registry owner" id="load data to dictionaries">
<sql dbms="postgresql" endDelimiter=";" splitStatements="true" stripComments="true">
CALL p_load_table_from_csv('research','${dataLoadPath}01-Typy doslidgen.csv', array['code','research_type'], array['research_type']);
CALL p_load_table_from_csv('refusal_reason','${dataLoadPath}02-Prichiny vidmov.csv', array['code','document_type','name','constant_code'], array['document_type','name','constant_code']);
CALL p_load_table_from_csv('kopfg','${dataLoadPath}05-kopfg.csv', array['code','name']);
CALL p_load_table_from_csv('solution_type','${dataLoadPath}06-Typy rishen.csv', array['id','name','constant_code'], array['name','constant_code']);
CALL p_load_table_from_csv('factor','${dataLoadPath}14-Himichni_factory_OBRV.csv', array['code','name','col3','col4'], array['name','factor_type::''Хімічний: ОБРВ''']);
CALL p_load_table_from_csv('factor','${dataLoadPath}15-Himichni factory dovilni.csv', array['code','name'], array['name','factor_type::''Хімічний: довільні''']);
</sql>
</changeSet>
</databaseChangeLog>
- Description of the XML template and its components:
-
1 databaseChangeLog
— the root element that defines the database change log. It contains one or morechangeSet
elements, each representing a group of changes to be applied to the DB.2 property
— defines the variabledataLoadPath
, which contains the path to the directory with CSV files to be imported.3 changeSet
— describes the set of changes to be applied. Theauthor
andid
attributes help to identify and version each set of changes.4 sql
— used to execute SQL commands. Attributesdbms
,endDelimiter
,splitStatements
, andstripComments
allow controlling how commands are performed in the context of a specific DBMS (in our case, PostgreSQL).
This example demonstrates using Liquibase to automate loading data from external files into specified database tables, ensuring efficient and controlled data management.
Also, read about creating the physical data model of the registry on the page Liquibase extensions for data modeling. |
4. Requirements for CSV data import to the database
4.1. Import CSV requirements
CSV files loading data into the database must meet specified structural and formatting requirements. Below is a detailed description of these requirements with specific examples:
Requirement | Description | Example |
---|---|---|
File format |
Use only the CSV format. |
|
File name |
File names should be in Latin characters. |
|
Encoding |
Files must be in |
File saved as |
File structure |
The first line contains field names. Data follows from the second line. |
|
Field value separator |
Use |
|
Absence of value ( |
A skip between two separators represents the absence of a value. |
|
Uniqueness by business field |
Ensure the uniqueness of records by a key business field. |
In the |
Array of descriptive values notation |
Use curly braces for arrays of descriptive values. |
|
These requirements ensure clarity and consistency of the imported data, facilitating efficient loading of information into the database without errors or discrepancies.
4.2. CSV Example
Consider an example of a CSV file that contains four columns: name
, email
, age
, and specializations
, along with four data entries. This file will meet the specified requirements for structure and formatting.
name,email,age,specializations
John Doe,johndoe@example.com,30,{Data Analysis, Software Development}
Jane Smith,janesmith@example.com,28,{Project Management, User Experience Design}
Richard Roe,richardroe@example.com,35,{Network Administration, Technical Support}
Emily Davis,emilydavis@example.com,40,{Financial Planning, Market Research}
- In this example:
-
1 name
— the person’s name.2 email
— the person’s email address.3 age
— the person’s age.4 specializations
— the person’s specializations, presented as an array.
5. Recommendations for loading large amounts of data
When the need arises to load large volumes of data (tens or hundreds of megabytes) through CSV files, it is recommended to use direct SQL code instead of standard procedures. Using SQL allows for control over the loading process and facilitates the creation of historical records in the _hst
table and filling in metadata in fields starting with ddm_
. This approach is crucial for maintaining integrity and proper operation of the registry, as it mimics actions typically performed automatically during procedure calls.
-- Creating a temporary staging table that matches the CSV file structure
CREATE TABLE account_csv_stage (username text, bank_number text);
-- Loading data from the CSV file into the staging table
COPY account_csv_stage (username, bank_number)
FROM
'${dataLoadPath}account.csv' WITH (HEADER, FORMAT CSV);
-- Inserting data from the staging table into the main and historical tables while simultaneously filling in metadata
WITH main_table_cte AS (
INSERT INTO account (
username, bank_number, ddm_created_by,
ddm_updated_by
)
SELECT
username,
bank_number,
'admin',
'admin'
FROM
account_csv_stage RETURNING *
) INSERT INTO account_hst (
id, username, bank_number, ddm_created_by,
ddm_created_at, ddm_dml_op, ddm_system_id,
ddm_application_id, ddm_business_process_id
)
SELECT
id,
username,
bank_number,
ddm_created_by,
CURRENT_TIMESTAMP,
'I' as ddm_dml_op,
(
SELECT
ss.system_id
FROM
ddm_source_system ss
WHERE
ss.system_name = 'initial load'
) ddm_system_id,
(
SELECT
sa.application_id
FROM
ddm_source_application sa
WHERE
sa.application_name = 'initial load'
) ddm_application_id,
(
SELECT
sb.business_process_id
FROM
ddm_source_business_process sb
WHERE
sb.business_process_name = 'initial load process'
) ddm_business_process_id
FROM
main_table_cte;
-- Dropping the temporary staging table after operations are completed
DROP
TABLE account_csv_stage;
This method can be an alternative to calling the procedure in the XML template for data loading. It is advisable to create a separate changeSet
for each file load using this approach to ensure organized version and change management in the database.