HOME/🔍 BigQuery/

🔍 100本ノック

Article Outline

|| データサイエンス100本ノック(構造化データ加工編) SQL編

| S-041 ★★

レシート明細テーブル(receipt)の売上金額(amount)を日付(sales_ymd)ごとに集計し、前日からの売上金額増減を計算せよ。なお、計算結果は10件表示すればよい。

with ttl_amount as (
    select
        sales_ymd
        , sum(amount) as amount
    from 
        `100knocks.receipt`
    group by sales_ymd
    order by sales_ymd
)
select
    sales_ymd
    -- , amount
    -- , lag(amount) over(order by sales_ymd) as lag
    , lag(amount) over(order by sales_ymd) - amount as diff
from 
    ttl_amount
order by 
    sales_ymd
limit 10
;
%%sql
WITH sales_amount_by_date AS (
    SELECT sales_ymd, SUM(amount) as amount FROM receipt
    GROUP BY sales_ymd
    ORDER BY sales_ymd
)
SELECT sales_ymd, LAG(sales_ymd, 1) OVER(ORDER BY sales_ymd) lag_ymd,
    amount,
    LAG(amount, 1) OVER(ORDER BY sales_ymd) as lag_amount,
    amount - LAG(amount, 1) OVER(ORDER BY sales_ymd) as diff_amount
FROM sales_amount_by_date
LIMIT 10;

Cf.

| S-042 ★★

レシート明細テーブル(receipt)の売上金額(amount)を日付(sales_ymd)ごとに集 計し、各日付のデータに対し、1日前、2日前、3日前のデータを結合せよ。結果は10件 表示すればよい。

with ttl_amount as (
    select
        sales_ymd
        , sum(amount) as amount
    from 
        `100knocks.receipt`
    group by sales_ymd
    order by sales_ymd
)
select
    sales_ymd
    , amount
    , lag(amount, 1) over w as amount_lag_1
    , lag(amount, 2) over w as amount_lag_2
    , lag(amount, 3) over w as amount_lag_3
from 
    ttl_amount
window
    w as (order by sales_ymd)
order by 
    sales_ymd
limit 10
;

| S-043 ★

レシート明細テーブル(receipt)と顧客テーブル(customer)を結合し、性別 (gender)と年代(ageから計算)ごとに売上金額(amount)を合計した売上サマリテー ブル(sales_summary)を作成せよ。性別は0が男性、1が女性、9が不明を表すものとす る。ただし、項目構成は年代、女性の売上金額、男性の売上金額、性別不明の売上金額の4項 目とすること(縦に年代、横に性別のクロス集計)。また、年代は10歳ごとの階級とすること。

with
    tb as (
        select
            sales_ymd
            , gender
            , age
            , case  
                  when age < 20 then "10's"
                  when age between 20 and 29 then "20's"
                  when age between 30 and 39 then "30's"
                  when age between 40 and 49 then "40's"
                  when age between 50 and 59 then "50's"
                  when age between 60 and 69 then "60's"
                  when age between 70 and 79 then "70's"
                  else "Over80's"
              end as age_category
            , amount
        from
            `100knocks.receipt`
            left join `100knocks.customer` using(customer_id)
        where 
            gender is not null
    )
select * from
    (select age_category, sum(amount) as male
     from tb
     where gender = "男性"
     group by age_category) as m

    left join (
        select
            age_category
            , sum(amount) as female
        from tb
        where gender = "女性"
        group by age_category) as f using(age_category)

    left join (
        select
            age_category
            , sum(amount) as unknown
        from tb
        where gender = "不明"
        group by age_category) as u using(age_category)

order by 
    age_category
;
%%sql

-- SQL向きではないため、やや強引に記載する(カテゴリ数が多いときはとても長いSQLとなってしまう点に注意)

DROP TABLE IF EXISTS sales_summary;

CREATE TABLE sales_summary AS
    WITH gender_era_amount AS (
        SELECT c.gender_cd,
        TRUNC(age/ 10) * 10 AS era,
        SUM(r.amount) AS amount
        FROM customer c
        JOIN receipt r
        ON c.customer_id = r.customer_id
        GROUP BY c.gender_cd, era
    )
    select era,
        MAX(CASE gender_cd WHEN '0' THEN amount ELSE 0 END) AS male ,
        MAX(CASE gender_cd WHEN '1' THEN amount ELSE 0 END) AS female,
        MAX(CASE gender_cd WHEN '9' THEN amount ELSE 0 END) AS unknown
    FROM gender_era_amount
    GROUP BY era
    ORDER BY era
;
SELECT * FROM sales_summary;

メモ - Gist

| S-044 ★

前設問で作成した売上サマリテーブル(sales_summary)は性別の売上を横持ちさせたも のであった。このテーブルから性別を縦持ちさせ、年代、性別コード、売上金額の3項目 に変換せよ。ただし、性別コードは男性を"00"、女性を"01"、不明を"99"とする。

with
    tb as (
        select
            sales_ymd
            , gender
            , age
            , case  
                  when age < 20              then "10's"
                  when age between 20 and 29 then "20's"
                  when age between 30 and 39 then "30's"
                  when age between 40 and 49 then "40's"
                  when age between 50 and 59 then "50's"
                  when age between 60 and 69 then "60's"
                  when age between 70 and 79 then "70's"
                  else "Over80's"
              end as age_category
            , amount
        from
            `prj-test3.100knocks.receipt`
            left join `prj-test3.100knocks.customer` using(customer_id)
        where gender is not null
    )
select * from
    (select
        '00' as gender_cd
        , age_category
        , sum(amount) as amount
    from tb
    where gender = "男性"
    group by age_category)
    union all
    (select
        '01' as gender_cd
        , age_category
        , sum(amount) as amount
    from tb
    where gender = "女性"
    group by age_category)
    union all
    (select
        '99' as gender_cd
        , age_category
        , sum(amount) as amount
    from tb
    where gender = "不明"
    group by age_category)
order by
    gender_cd
    , age_category
;

| S-045 ★

顧客テーブル(customer)の生年月日(birth_day)は日付型でデータを保有している。 これをYYYYMMDD形式の文字列に変換し、顧客ID(customer_id)とともに抽出せよ。 データは10件を抽出すれば良い。

select
    customer_id
    , replace(cast(birth_day as string), '-', '') as birth_day
from `prj-test3.100knocks.customer`
limit 10
;

Cf.【SQL】空白やスペース、0、特定の文字を削除する方法 - SE日記

# PostgreSQL
%%sql
SELECT customer_id, TO_CHAR(birth_day, 'YYYYMMDD') FROM customer LIMIT 10;

| S-046 ★

顧客テーブル(customer)の申し込み日(application_date)はYYYYMMDD形式の文字 列型でデータを保有している。これを日付型に変換し、顧客ID(customer_id)とともに 抽出せよ。データは10件を抽出すれば良い。

select
    customer_id
    , parse_date('%Y%m%d', cast(application_date as string)) as application_date
from `prj-test3.100knocks.customer`
limit 10
;

Cf.

# PostgreSQL
%%sql
SELECT customer_id, TO_DATE(application_date, 'YYYYMMDD')
FROM customer LIMIT 10;

| S-047 ★

レシート明細テーブル(receipt)の売上日(sales_ymd)はYYYYMMDD形式の数値型で データを保有している。これを日付型に変換し、レシート番号(receipt_no)、レシートサ ブ番号(receipt_sub_no)とともに抽出せよ。データは10件を抽出すれば良い。

select
    receipt_no
    , receipt_sub_no
    , parse_date('%Y%m%d', cast(sales_ymd as string)) as sales_ymd
from `prj-test3.100knocks.receipt`
limit10
;

| S-048 ★

レシート明細テーブル(receipt)の売上エポック秒(sales_epoch)は数値型のUNIX秒で データを保有している。これを日付型に変換し、レシート番号(receipt_no)、レシートサ ブ番号(receipt_sub_no)とともに抽出せよ。データは10件を抽出すれば良い。

select
    receipt_no
    , receipt_sub_no
    , format_timestamp('%Y-%m-%d %H:%M:%S', timestamp_seconds(sales_epoch), 'Asia/Tokyo')as sales_epoch
from `prj-test3.100knocks.receipt`
limit10
;

Cf.

# PostgreSQL
%%sql

SELECT
    TO_TIMESTAMP(sales_epoch) as sales_date,
    receipt_no, receipt_sub_no
FROM receipt
LIMIT 10;

| S-049 ★

レシート明細テーブル(receipt)の売上エポック秒(sales_epoch)を日付型に変換し、 「年」だけ取り出してレシート番号(receipt_no)、レシートサブ番号(receipt_sub_no) とともに抽出せよ。データは10件を抽出すれば良い。

select
    receipt_no
    , receipt_sub_no
    , format_date('%Y', cast(format_timestamp('%Y-%m-%d', timestamp_seconds(sales_epoch), 'Asia/Tokyo') as date)) as sales_epoch
from `prj-test3.100knocks.receipt`
limit10
;

Cf. BIGQUERY】分析入門 - SECTION6 - .tk

%%sql

SELECT
    TO_CHAR(EXTRACT(YEAR FROM TO_TIMESTAMP(sales_epoch)),'FM9999') as sales_year,
    receipt_no,
    receipt_sub_no
FROM receipt
LIMIT 10;

| S-050 ★

レシート明細テーブル(receipt)の売上エポック秒(sales_epoch)を日付型に変換し、 「月」だけ取り出してレシート番号(receipt_no)、レシートサブ番号(receipt_sub_no) とともに抽出せよ。なお、「月」は0埋め2桁で取り出すこと。データは10件を抽出すれば 良い。

select
    format_date('%m', cast(format_timestamp('%Y-%m-%d', timestamp_seconds(sales_epoch), 'Asia/Tokyo') as date)) as sales_epoch
    , receipt_no
    , receipt_sub_no
from `prj-test3.100knocks.receipt`
limit 10
;