InsightFlow Part 4: Data Exploration & Understanding the Datasets

InsightFlow GitHub Repo Before diving into building any data pipeline, a crucial first step is Data Exploration and Understanding. This involves getting familiar with the source datasets, their structure, the meaning of different fields, their frequency, and any potential quirks or challenges they might present. For the InsightFlow project, which aims to correlate Malaysian retail trends with fuel prices, we focused on three key datasets available from Malaysia's open data portal, data.gov.my. Let's explore each one: a) Headline Wholesale & Retail Trade (iowrt) Source: Department of Statistics Malaysia (DOSM) Purpose: Provides a high-level overview of the overall performance of the wholesale and retail trade sector in Malaysia. Frequency: Monthly (Data is typically reported for the first day of the month, e.g., 2023-01-01 represents January 2023). Key Variables: date: The month of the record (YYYY-MM-DD format, DD=01). series: Indicates the type of data in the row. We are primarily interested in 'abs' (absolute values), but it also includes 'growth_yoy' (year-on-year growth) and 'growth_mom' (month-on-month growth). sales: The total sales value generated by the sector for the month, reported in RM millions. This is a key metric for tracking overall economic activity in this sector. volume: An index representing the quantity of items sold (base year 2015 = 100). This helps understand if sales changes are due to price changes or actual volume changes. volume_sa: A seasonally adjusted version of the volume index, useful for identifying underlying trends by removing predictable seasonal patterns. Format: Available as CSV and Parquet. The pipeline ingests this via the data.gov.my API, which returns JSON, requiring conversion to Parquet for our raw layer. b) Wholesale & Retail Trade by Group (3-digit) (iowrt_3d) Source: Department of Statistics Malaysia (DOSM) Purpose: Breaks down the wholesale and retail trade performance into more granular sub-sectors based on the Malaysia Standard Industrial Classification (MSIC). This allows for a deeper analysis of which specific retail areas are driving overall trends. Frequency: Monthly (Same as the headline data). Key Variables: date: The month of the record. series: Same as iowrt ('abs', 'growth_yoy', 'growth_mom'). Again, we focus on 'abs'. group: A 3-digit code representing the specific MSIC group (e.g., '451' for Sale of motor vehicles, '471' for Retail sale in non-specialized stores). This is the key field for segmenting the data. sales: Sales value in RM millions for that specific group in that month. volume: Volume index for that specific group in that month. Format: Available as CSV and Parquet. The pipeline ingests the full Parquet file directly via its URL. Dependency: Understanding the group code requires joining with a separate MSIC Lookup Table (which we've incorporated as a seed file in our dbt project) to get meaningful descriptions like "Retail sale of food, beverages & tobacco in specialised stores". c) Price of Petroleum & Diesel (fuelprice) Source: Ministry of Finance, Malaysia Purpose: Provides the official weekly retail prices for different fuel types in Malaysia. This is the core economic indicator we want to correlate with retail sales. Frequency: Weekly (Unlike the monthly trade data). Key Variables: date: The date the price comes into effect (YYYY-MM-DD). series_type: Indicates if the row represents the price level ('level') or the weekly change ('change_weekly'). We focus on 'level'. ron95: Price per litre for RON95 petrol (in RM). ron97: Price per litre for RON97 petrol (in RM). diesel: Price per litre for Diesel in Peninsular Malaysia (in RM). diesel_eastmsia: Price per litre for Diesel in East Malaysia (in RM). Format: Available as CSV and Parquet. Ingested via the API (JSON) and converted to Parquet. Challenge: The primary challenge here is the frequency mismatch. The fuel prices are weekly, while the sales data is monthly. Our transformation process (specifically in dbt) will need to aggregate the weekly fuel prices into a meaningful monthly average to allow for direct comparison and joining with the sales data. Conclusion Understanding these three datasets – their purpose, key fields, frequency, and relationships (like the need for the MSIC lookup and handling the weekly-to-monthly fuel price aggregation) – is fundamental. This exploration directly informs the design of our ingestion scripts, the structure of our raw data lake storage (partitioning by date), and most importantly, the logic within our dbt models to clean, integrate, and transform this data into a cohesive format ready for analysis and visualization.

Apr 29, 2025 - 03:38
 0
InsightFlow Part 4: Data Exploration & Understanding the Datasets

InsightFlow GitHub Repo

Before diving into building any data pipeline, a crucial first step is Data Exploration and Understanding. This involves getting familiar with the source datasets, their structure, the meaning of different fields, their frequency, and any potential quirks or challenges they might present. For the InsightFlow project, which aims to correlate Malaysian retail trends with fuel prices, we focused on three key datasets available from Malaysia's open data portal, data.gov.my.

Let's explore each one:

a) Headline Wholesale & Retail Trade (iowrt)

  • Source: Department of Statistics Malaysia (DOSM)
  • Purpose: Provides a high-level overview of the overall performance of the wholesale and retail trade sector in Malaysia.
  • Frequency: Monthly (Data is typically reported for the first day of the month, e.g., 2023-01-01 represents January 2023).
  • Key Variables:
    • date: The month of the record (YYYY-MM-DD format, DD=01).
    • series: Indicates the type of data in the row. We are primarily interested in 'abs' (absolute values), but it also includes 'growth_yoy' (year-on-year growth) and 'growth_mom' (month-on-month growth).
    • sales: The total sales value generated by the sector for the month, reported in RM millions. This is a key metric for tracking overall economic activity in this sector.
    • volume: An index representing the quantity of items sold (base year 2015 = 100). This helps understand if sales changes are due to price changes or actual volume changes.
    • volume_sa: A seasonally adjusted version of the volume index, useful for identifying underlying trends by removing predictable seasonal patterns.
  • Format: Available as CSV and Parquet. The pipeline ingests this via the data.gov.my API, which returns JSON, requiring conversion to Parquet for our raw layer.

b) Wholesale & Retail Trade by Group (3-digit) (iowrt_3d)

  • Source: Department of Statistics Malaysia (DOSM)
  • Purpose: Breaks down the wholesale and retail trade performance into more granular sub-sectors based on the Malaysia Standard Industrial Classification (MSIC). This allows for a deeper analysis of which specific retail areas are driving overall trends.
  • Frequency: Monthly (Same as the headline data).
  • Key Variables:
    • date: The month of the record.
    • series: Same as iowrt ('abs', 'growth_yoy', 'growth_mom'). Again, we focus on 'abs'.
    • group: A 3-digit code representing the specific MSIC group (e.g., '451' for Sale of motor vehicles, '471' for Retail sale in non-specialized stores). This is the key field for segmenting the data.
    • sales: Sales value in RM millions for that specific group in that month.
    • volume: Volume index for that specific group in that month.
  • Format: Available as CSV and Parquet. The pipeline ingests the full Parquet file directly via its URL.
  • Dependency: Understanding the group code requires joining with a separate MSIC Lookup Table (which we've incorporated as a seed file in our dbt project) to get meaningful descriptions like "Retail sale of food, beverages & tobacco in specialised stores".

c) Price of Petroleum & Diesel (fuelprice)

  • Source: Ministry of Finance, Malaysia
  • Purpose: Provides the official weekly retail prices for different fuel types in Malaysia. This is the core economic indicator we want to correlate with retail sales.
  • Frequency: Weekly (Unlike the monthly trade data).
  • Key Variables:
    • date: The date the price comes into effect (YYYY-MM-DD).
    • series_type: Indicates if the row represents the price level ('level') or the weekly change ('change_weekly'). We focus on 'level'.
    • ron95: Price per litre for RON95 petrol (in RM).
    • ron97: Price per litre for RON97 petrol (in RM).
    • diesel: Price per litre for Diesel in Peninsular Malaysia (in RM).
    • diesel_eastmsia: Price per litre for Diesel in East Malaysia (in RM).
  • Format: Available as CSV and Parquet. Ingested via the API (JSON) and converted to Parquet.
  • Challenge: The primary challenge here is the frequency mismatch. The fuel prices are weekly, while the sales data is monthly. Our transformation process (specifically in dbt) will need to aggregate the weekly fuel prices into a meaningful monthly average to allow for direct comparison and joining with the sales data.

Conclusion

Understanding these three datasets – their purpose, key fields, frequency, and relationships (like the need for the MSIC lookup and handling the weekly-to-monthly fuel price aggregation) – is fundamental. This exploration directly informs the design of our ingestion scripts, the structure of our raw data lake storage (partitioning by date), and most importantly, the logic within our dbt models to clean, integrate, and transform this data into a cohesive format ready for analysis and visualization.