Initial loading of registry data

🌐 This document is available in both English and Ukrainian. Use the language toggle in the top right corner to switch between versions.

The PL/pgSQL database procedure, a programming language embedded in PostgreSQL for advanced procedures and functions, is used for the initial data load.

Before executing the initial data load procedure, ensure that all prerequisites described in the section Preparing data for migration are met. This will provide a smooth and uninterrupted data-loading process.

1. PL/pgSQL procedure specification for the Platform’s database

This section details the PL/pgSQL procedure that allows data to be loaded from CSV files into the Platform database tables.

Data loading procedure
CREATE OR REPLACE PROCEDURE p_load_table_from_csv(
  p_table_name text,
  p_file_name text,
  p_table_columns text[],
  p_target_table_columns text[] DEFAULT NULL::text[]
);
Parameter description:
1 p_table_name — the name of the target table where data will be loaded.
2 p_file_name — the full path to the CSV file from which data will be loaded.
3 p_table_columns — an array of field names in the data file corresponding to the first row of the CSV file.
4 p_target_table_columns — an array of target table fields. This parameter is optional if the list of fields in the data file matches the fields in the target table.

Value formats for the p_target_table_columns parameter:

  • <col_name>: The field in the target table receives the value of the corresponding field in the CSV file.

  • <col_name>::<some_calculation_based_on_source_cols>: The field receives a value from a calculation based on source fields.

    For more information on this approach, see the section Complex Scenario 1.
  • <col_name>::ref(lookup_col:<lookup_col_name>, ref_table:<ref_table_name>, ref_col:<ref_col_name>, ref_id:<ref_id_col_name>): Used to search for id in a reference table when the input value is represented by a descriptive field, not an identifier, but the value of this field must be unique.

    SELECT <ref_id_col_name> INTO <col_name> FROM <ref_table_name> WHERE <ref_col_name> = <lookup_col_name>;
    For more details on this usage method, see the section Complex Scenario 2.
  • <col_name>::ref_array(lookup_col:<lookup_col_name>, ref_table:<ref_table_name>, ref_col:<ref_col_name>, ref_id:<ref_col_name>, delim:<delimiter_character>): Similar to the previous one, but for cases where the <lookup_col_name> field in the CSV file contains an array of descriptive values, and an array of corresponding `id`s is returned.

2. Procedure call examples

2.1. Basic scenario

This scenario demonstrates the primary use of the procedure for importing data from a CSV file into a database table. A feature is the ability to import the entity ID value, uniquely identifying table records.

Procedure call. Basic scenario
CALL p_load_table_from_csv(
  'research', 'D:\PostgreSQL\csv\uuid\01-Typy_doslidgen_uuid.csv',
  array [ 'uuid', 'research_type' ],
  array[ 'research_type' ]
);
Parameter description:
1 uuid — a static (fixed) name that does not depend on the entity ID name. It imports a unique identifier (UUID) from the file. When importing, the primaryKey of the table being imported must be of type UUID.
2 The ID column must be the first in the import file and have a UUID format to ensure proper data processing.

This example illustrates data import from the file 01-Types_of_research_uuid.csv, located at D:\PostgreSQL\csv\uuid\, into the research table. The imported fields include uuid as a unique identifier and research_type as the type of research.

2.2. Complex Scenario 1

This scenario demonstrates the extended use of the procedure for importing data that includes processing specific conditions and forming additional fields using calculations based on the data imported from the CSV file. This approach is used to address specific data processing and classification needs.

Procedure call. Complex scenario 1
CALL p_load_table_from_csv(
  'koatuu', '${dataLoadPath}03-KOATUU_simple.csv',
  array[ 'code', 'category', 'name' ],
  array[ 'code', 'category', 'name',
  'level1::substring(code,1,2)||''00000000''',
  'level2::substring(code,1,5)||''00000''',
  'type::CASE WHEN code ~ ''[0-9]{2}0{8}'' AND code !~ ''(80|85)0{8}'' THEN ''О''
            WHEN code ~ ''[0-9]{2}2[0-9]{2}0{5}'' AND code !~ ''[0-9]{2}20{7}'' THEN ''Р''
            WHEN coalesce(category, ''Р'') != ''Р''
            OR code IN (SELECT DISTINCT substring(code,1,5)||''00000'' FROM koatuu_csv k2 WHERE category = ''Р'') AND category IS NULL
            OR code ~ ''(80|85)0{8}'' THEN ''НП''
            ELSE NULL END' ]
);
Parameter and logic description:
1 code, category, name — standard fields imported directly from the CSV file.
2 level1 — creates a new field by extracting the first two digits from the code field and appending zeros to make eight digits, forming a unique first-level identifier.
3 level2 — similar to level1, but extracts the first five digits and appends zeros to make five digits, creating a unique second-level identifier.
4 type — determines the object type based on a complex conditional expression. Using regular expressions and conditional logic, this field classifies records based on their codes and categories.

This scenario is beneficial for processing and classifying geographical data according to KOATUU (Classifier of Objects of the Administrative-Territorial Unit of Ukraine), providing automated creation of hierarchical levels and types of objects based on their codes.

2.3. Complex Scenario 2

This scenario reveals the possibilities of importing data for a complex staff structure using additional transformations to match data with external keys and identifiers in the database. The scenario includes searching in reference tables using descriptive values to obtain corresponding identifiers and processing arrays of data.

Procedure Call. Complex Scenario 2
CALL p_load_table_from_csv('staff','D:\PostgreSQL\csv\staff.csv'
,array['uuid','laboratory','staff_status','researches','full_name','hygienist_flag','full_time_flag','salary','fixed_term_contract_flag','contract_end_date','specialization_date','specialization_end_date','dismissal_flag','dismissal_date']
  ,array['laboratory_id::ref(lookup_col:laboratory,ref_table:laboratory,ref_col:name,ref_id:laboratory_id)'
        ,'staff_status_id::ref(lookup_col:staff_status,ref_table:staff_status,ref_col:name,ref_id:staff_status_id)'
        ,'researches::ref_array(lookup_col:researches,ref_table:research,ref_col:research_type,ref_id:research_id,delim:#)'
        ,'full_name','hygienist_flag','full_time_flag','salary','fixed_term_contract_flag','contract_end_date','specialization_date','specialization_end_date','dismissal_flag','dismissal_date']
     );
Parameter and logic description:
1 uuid, full_name, hygienist_flag, full_time_flag, salary, fixed_term_contract_flag, contract_end_date, specialization_date, specialization_end_date, dismissal_flag, dismissal_date — standard fields imported directly from the CSV file without changes.
2 laboratory_id — the laboratory field in the CSV file is transformed into laboratory_id by searching for the corresponding identifier in the laboratory table by the laboratory name.
3 staff_status_id — similarly, the staff_status field is transformed into staff_status_id by searching for the staff status identifier in the staff_status table.
4 researches — this field contains an array of research that are transformed into corresponding research_id`s through a search in the `research table. A special delimiter # is used to identify individual elements in the array.

This scenario shows how to effectively import complex data sets requiring additional transformations and searches in reference tables to ensure data integrity. It is particularly beneficial for systems where it is crucial to match imported data with existing entities in the database.

3. Data loading process

For loading data into the database, it is recommended to use the standard functionality of Liquibase, a database management tool that supports database schema versioning and change management through XML files, SQL scripts, and other formats.

Refer to the detailed practical guide on modeling and loading data structures into the registry for guidance: Task 1. Modeling registry database structures.
Sample XML template for data loading
<databaseChangeLog
    xmlns="http://www.liquibase.org/xml/ns/dbchangelog"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xmlns:ext="http://www.liquibase.org/xml/ns/dbchangelog-ext"
        xsi:schemaLocation="http://www.liquibase.org/xml/ns/dbchangelog http://www.liquibase.org/xml/ns/dbchangelog/dbchangelog-4.2.xsd
        http://www.liquibase.org/xml/ns/dbchangelog-ext http://www.liquibase.org/xml/ns/dbchangelog-ext.xsd">
    <property  name="dataLoadPath"  value="D:\PostgreSQL\csv\"/>
    <changeSet author="registry owner" id="load data to dictionaries">
       <sql dbms="postgresql" endDelimiter=";" splitStatements="true" stripComments="true">
            CALL p_load_table_from_csv('research','${dataLoadPath}01-Typy doslidgen.csv', array['code','research_type'], array['research_type']);
            CALL p_load_table_from_csv('refusal_reason','${dataLoadPath}02-Prichiny vidmov.csv', array['code','document_type','name','constant_code'], array['document_type','name','constant_code']);
            CALL p_load_table_from_csv('kopfg','${dataLoadPath}05-kopfg.csv', array['code','name']);
            CALL p_load_table_from_csv('solution_type','${dataLoadPath}06-Typy rishen.csv', array['id','name','constant_code'], array['name','constant_code']);
            CALL p_load_table_from_csv('factor','${dataLoadPath}14-Himichni_factory_OBRV.csv', array['code','name','col3','col4'], array['name','factor_type::''Хімічний: ОБРВ''']);
            CALL p_load_table_from_csv('factor','${dataLoadPath}15-Himichni factory dovilni.csv', array['code','name'], array['name','factor_type::''Хімічний: довільні''']);
        </sql>
    </changeSet>
</databaseChangeLog>
Description of the XML template and its components:
1 databaseChangeLog — the root element that defines the database change log. It contains one or more changeSet elements, each representing a group of changes to be applied to the DB.
2 property — defines the variable dataLoadPath, which contains the path to the directory with CSV files to be imported.
3 changeSet — describes the set of changes to be applied. The author and id attributes help to identify and version each set of changes.
4 sql — used to execute SQL commands. Attributes dbms, endDelimiter, splitStatements, and stripComments allow controlling how commands are performed in the context of a specific DBMS (in our case, PostgreSQL).

This example demonstrates using Liquibase to automate loading data from external files into specified database tables, ensuring efficient and controlled data management.

Also, read about creating the physical data model of the registry on the page Liquibase extensions for data modeling.

4. Requirements for CSV data import to the database

4.1. Import CSV requirements

CSV files loading data into the database must meet specified structural and formatting requirements. Below is a detailed description of these requirements with specific examples:

Requirement Description Example

File format

Use only the CSV format.

data.csv

File name

File names should be in Latin characters.

employees.csv

Encoding

Files must be in UTF-8 encoding.

File saved as UTF-8 without BOM.

File structure

The first line contains field names. Data follows from the second line.

name,email\nJohn Doe,johndoe@example.com

Field value separator

Use , or ; as separators.

name,age\nJohn Doe,30 or name;age\nJohn Doe;30

Absence of value (NULL)

A skip between two separators represents the absence of a value.

name,age,address\nJohn Doe,30, (address is missing)

Uniqueness by business field

Ensure the uniqueness of records by a key business field.

In the departments.csv file, each department name is unique (name).

Array of descriptive values notation

Use curly braces for arrays of descriptive values.

specializations\n{Data Analysis, Software Development, Project Management}

These requirements ensure clarity and consistency of the imported data, facilitating efficient loading of information into the database without errors or discrepancies.

4.2. CSV Example

Consider an example of a CSV file that contains four columns: name, email, age, and specializations, along with four data entries. This file will meet the specified requirements for structure and formatting.

Example. Content of test data in CSV format
name,email,age,specializations
John Doe,johndoe@example.com,30,{Data Analysis, Software Development}
Jane Smith,janesmith@example.com,28,{Project Management, User Experience Design}
Richard Roe,richardroe@example.com,35,{Network Administration, Technical Support}
Emily Davis,emilydavis@example.com,40,{Financial Planning, Market Research}
In this example:
1 name — the person’s name.
2 email — the person’s email address.
3 age — the person’s age.
4 specializations — the person’s specializations, presented as an array.

5. Recommendations for loading large amounts of data

When the need arises to load large volumes of data (tens or hundreds of megabytes) through CSV files, it is recommended to use direct SQL code instead of standard procedures. Using SQL allows for control over the loading process and facilitates the creation of historical records in the _hst table and filling in metadata in fields starting with ddm_. This approach is crucial for maintaining integrity and proper operation of the registry, as it mimics actions typically performed automatically during procedure calls.

Sample SQL code for data loading
-- Creating a temporary staging table that matches the CSV file structure
CREATE TABLE account_csv_stage (username text, bank_number text);
-- Loading data from the CSV file into the staging table
COPY account_csv_stage (username, bank_number)
FROM
  '${dataLoadPath}account.csv' WITH (HEADER, FORMAT CSV);
-- Inserting data from the staging table into the main and historical tables while simultaneously filling in metadata
WITH main_table_cte AS (
  INSERT INTO account (
    username, bank_number, ddm_created_by,
    ddm_updated_by
  )
  SELECT
    username,
    bank_number,
    'admin',
    'admin'
  FROM
    account_csv_stage RETURNING *
) INSERT INTO account_hst (
  id, username, bank_number, ddm_created_by,
  ddm_created_at, ddm_dml_op, ddm_system_id,
  ddm_application_id, ddm_business_process_id
)
SELECT
  id,
  username,
  bank_number,
  ddm_created_by,
  CURRENT_TIMESTAMP,
  'I' as ddm_dml_op,
  (
    SELECT
      ss.system_id
    FROM
      ddm_source_system ss
    WHERE
      ss.system_name = 'initial load'
  ) ddm_system_id,
  (
    SELECT
      sa.application_id
    FROM
      ddm_source_application sa
    WHERE
      sa.application_name = 'initial load'
  ) ddm_application_id,
  (
    SELECT
      sb.business_process_id
    FROM
      ddm_source_business_process sb
    WHERE
      sb.business_process_name = 'initial load process'
  ) ddm_business_process_id
FROM
  main_table_cte;
-- Dropping the temporary staging table after operations are completed
DROP
  TABLE account_csv_stage;

This method can be an alternative to calling the procedure in the XML template for data loading. It is advisable to create a separate changeSet for each file load using this approach to ensure organized version and change management in the database.