HOME/🔍 BigQuery/

🔍 100本ノック

Article Outline

|| データサイエンス100本ノック(構造化データ加工編) SQL編

| S-021 ★

レシート明細テーブル(receipt)に対し、件数をカウントせよ。

select count(*) from `100knocks.receipt`;

| S-022 ★

レシート明細テーブル(receipt)の顧客ID(customer_id)に対し、ユニーク件数をカウントせよ。

select
    count(distinct customer_id) as unique_cnt
from 
    `100knocks.receipt`
;

| S-023 ★

レシート明細テーブル(receipt)に対し、店舗コード(store_cd)ごとに売上金額 (amount)と売上数量(quantity)を合計せよ。

select
    store_cd
    , sum(amount) as ttl_amount
    , sum(quantity) as ttl_quantity
from 
    `100knocks.receipt`
group by 
    store_cd
;

| S-024 ★

レシート明細テーブル(receipt)に対し、顧客ID(customer_id)ごとに最も新しい売上日(sales_ymd)を求め、10件表示せよ。

select
    customer_id
    , max(recently_sales_ymd) as recently_sales_ymd
from(
    select
        customer_id
        , last_value(sales_ymd) over(
                partition by 
                    customer_id 
                order by 
                    sales_ymd
                rows 
                    between UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING
          ) as recently_sales_ymd
    from 
        `100knocks.receipt`
)
group by 
    customer_id
limit 10
;
select
    customer_id
    , max(sales_ymd) as recently_sales_ymd
from 
    `100knocks.receipt`
group by 
    customer_id
limit 10
;

| S-025 ★

レシート明細テーブル(receipt)に対し、顧客ID(customer_id)ごとに最も古い売上日 (sales_ymd)を求め、10件表示せよ。

select
    customer_id
    , min(sales_ymd)
from 
    `100knocks.receipt`
group by 
    customer_id
;

| S-026 ★

レシート明細テーブル(receipt)に対し、顧客ID(customer_id)ごとに最も新しい売上 日(sales_ymd)と古い売上日を求め、両者が異なるデータを10件表示せよ。

select
    customer_id
    , max(sales_ymd) as recently
    , min(sales_ymd) as formely
from 
    `100knocks.receipt`
group by 
    customer_id
having 
    recently <> formely
limit 10
;

| S-027 ★

レシート明細テーブル(receipt)に対し、店舗コード(store_cd)ごとに売上金額 (amount)の平均を計算し、降順でTOP5を表示せよ。

select
    store_cd
    , avg(amount)as mean_amount
from
    `100knocks.receipt`
group by 
    store_cd
order by 
    mean_amount desc
limit 5
;

| S-028 ★

レシート明細テーブル(receipt)に対し、店舗コード(store_cd)ごとに売上金額 (amount)の中央値を計算し、降順でTOP5を表示せよ。

# BigQuery
select
    store_cd
    , max(median_amount) as median_amount
from
    (select
         store_cd
         , percentile_cont(amount, 0.5) over(partition by store_cd) as median_amount
     from `100knocks.receipt`) as stcd
group by 
    store_cd
order by 
    median_amount desc
limit 5
;
# PostgreSQL
select
    store_cd
    , percentile_cont(0.5)within group(order by amount) as median_amount
from `100knocks.receipt`
group by store_cd
order by median_amount desc
limit 5
;

| S-029 ★★

レシート明細テーブル(receipt)に対し、店舗コード(store_cd)ごとに商品コード (product_cd)の最頻値を求めよ。

# BigQuery
select
    store_cd
    , approx_top_count(product_cd, 1) as product_cd_mod
from 
    `100knocks.receipt`
group by 
    store_cd
;

Cf.

# PostgreSQL(解法1)
WITH product_mode AS (
    SELECT store_cd,product_cd, COUNT(1) as mode_cnt,
        RANK() OVER(PARTITION BY store_cd ORDER BY COUNT(1) DESC) AS rnk
    FROM receipt
    GROUP BY store_cd,product_cd
)
SELECT store_cd,product_cd, mode_cnt
FROM product_mode
WHERE rnk = 1
ORDER BY store_cd,product_cd;

# PostgreSQL(解法2)
SELECT store_cd, mode() WITHIN GROUP(ORDER BY product_cd)
FROM receipt
GROUP BY store_cd
ORDER BY store_cd

| S-030 ★

レシート明細テーブル(receipt)に対し、店舗コード(store_cd)ごとに売上金額 (amount)の標本分散を計算し、降順でTOP5を表示せよ。

# BigQuery 解法1(スクラッチ)
with 
    mean_tb as (
        select
            store_cd
            , avg(amount) as mean
        from 
            `100knocks.receipt`
        group by 
            store_cd
    )
    , diff_tb as (
        select
            store_cd
            , pow(cast(amount as float64) - m.mean, 2) as deviation_square
        from 
            `100knocks.receipt`
            left join mean_tb m using(store_cd)
    )
select
    store_cd
    , avg(deviation_square) as variance
from  diff_tb
group by store_cd
order by variance desc
limit 5
;
# BigQuery 解法2
select
    store_cd
    , variance(amount) as variance
from 
    `100knocks.receipt`
group by 
    store_cd
order by 
    variance desc
limit 5
;

Cf. BigQueryで統計量を出す時に使うクエリメモ - Qiita