susinlee 님의 블로그

코호트 분석 - SQL 쿼리로 구현하기 본문

코드카타/SQL, Pandas

코호트 분석 - SQL 쿼리로 구현하기

susinlee 2025. 3. 16. 11:11

customer_orders 테이블

컬럼명 타입 설명
ORDER_ID INT 주문 ID (PK)
CUSTOMER_ID INT 고객 ID
ORDER_DATE DATE 주문 날짜
ORDER_AMOUNT DECIMAL(10, 2) 주문 금액

 

고객의 첫 주문 월을 기준으로 코호트 그룹을 만들고,

각 코호트 그룹에서 시간이 지남에 따라 활성 사용자 수를 계산하는 쿼리를 짜보자.

각 코호트 그룹에 대해 1개월 후부터 12개월 후까지의 활성 사용자 수를 추적한다.

 

1. 고객별로 가장 처음 주문한 날짜를 구한다.

SELECT customer_id,
       DATE_FORMAT(MIN(order_date), '%Y-%m-01') AS first_order_month
FROM customer_orders
GROUP BY customer_id

 

2. 위의 테이블을 customer_orders 테이블과 join 하고, order_date의 월과 첫 주문 월의 차를 구해준다.

WITH cohort AS (
    SELECT customer_id,
           DATE_FORMAT(MIN(order_date), '%Y-%m-01') AS first_order_month
    FROM customer_orders
    GROUP BY customer_id
)
SELECT 
    co.customer_id,
    first_order_month,
    DATE_FORMAT(order_date, '%Y-%m-01') AS order_date,
    TIMESTAMPDIFF(
        MONTH, 
        first_order_month, 
        DATE_FORMAT(order_date, '%Y-%m-01')
    ) AS month_number
FROM customer_orders co
JOIN cohort c ON co.customer_id = c.customer_id

 

3. 그렇게 만든 테이블을 통해 각 경우(CASE)에 대해서 고유 customer_id 수를 집계해준다.

WITH cohort AS (
    SELECT customer_id,
           DATE_FORMAT(MIN(order_date), '%Y-%m-01') AS first_order_month
    FROM customer_orders
    GROUP BY customer_id
), actvie_order AS (
    SELECT 
        co.customer_id,
        first_order_month,
        DATE_FORMAT(order_date, '%Y-%m-01') AS order_date,
        TIMESTAMPDIFF(
            MONTH, 
            first_order_month, 
            DATE_FORMAT(order_date, '%Y-%m-01')
        ) AS month_number
    FROM customer_orders co
    JOIN cohort c ON co.customer_id = c.customer_id
)
SELECT 
    first_order_month ,
    COUNT(DISTINCT CASE WHEN month_number = 0 THEN customer_id END) AS cohort_user_count,
    COUNT(DISTINCT CASE WHEN month_number = 1 THEN customer_id END) AS user_count_1_month_later,
    COUNT(DISTINCT CASE WHEN month_number = 2 THEN customer_id END) AS user_count_2_month_later,
    COUNT(DISTINCT CASE WHEN month_number = 3 THEN customer_id END) AS user_count_3_month_later,
    COUNT(DISTINCT CASE WHEN month_number = 4 THEN customer_id END) AS user_count_4_month_later,
    COUNT(DISTINCT CASE WHEN month_number = 5 THEN customer_id END) AS user_count_5_month_later,
    COUNT(DISTINCT CASE WHEN month_number = 6 THEN customer_id END) AS user_count_6_month_later,
    COUNT(DISTINCT CASE WHEN month_number = 7 THEN customer_id END) AS user_count_7_month_later,
    COUNT(DISTINCT CASE WHEN month_number = 8 THEN customer_id END) AS user_count_8_month_later,
    COUNT(DISTINCT CASE WHEN month_number = 9 THEN customer_id END) AS user_count_9_month_later,
    COUNT(DISTINCT CASE WHEN month_number = 10 THEN customer_id END) AS user_count_10_month_later,
    COUNT(DISTINCT CASE WHEN month_number = 11 THEN customer_id END) AS user_count_11_month_later,
    COUNT(DISTINCT CASE WHEN month_number = 12 THEN customer_id END) AS user_count_12_month_later
FROM actvie_order
GROUP BY first_order_month

 


미리 중간 집계를 수행하여 쿼리의 효율성을 높이자

WITH cohort AS (
    SELECT
        CUSTOMER_ID,
        DATE(DATE_FORMAT(MIN(ORDER_DATE), '%Y-%m-01')) AS first_order_month
    FROM customer_orders
    GROUP BY CUSTOMER_ID
), active_orders AS (
    SELECT
        o.CUSTOMER_ID,
        c.first_order_month,
        DATE(DATE_FORMAT(o.ORDER_DATE, '%Y-%m-01')) AS active_month
    FROM customer_orders o
    JOIN cohort c
        ON o.CUSTOMER_ID = c.CUSTOMER_ID
), cohort_counts AS (
    SELECT
        first_order_month,
        active_month,
        COUNT(DISTINCT CUSTOMER_ID) AS user_count
    FROM active_orders
    GROUP BY first_order_month, active_month
)
SELECT
    DATE_FORMAT(first_order_month, '%Y-%m') FIRST_ORDER_MONTH,
    SUM(CASE WHEN active_month = first_order_month THEN user_count ELSE 0 END) AS COHORT_USER_COUNT,
    SUM(CASE WHEN active_month = DATE_ADD(first_order_month, INTERVAL 1 MONTH) THEN user_count ELSE 0 END) AS USER_COUNT_1_MONTH_LATER,
    SUM(CASE WHEN active_month = DATE_ADD(first_order_month, INTERVAL 2 MONTH) THEN user_count ELSE 0 END) AS USER_COUNT_2_MONTH_LATER, 
    SUM(CASE WHEN active_month = DATE_ADD(first_order_month, INTERVAL 3 MONTH) THEN user_count ELSE 0 END) AS USER_COUNT_3_MONTH_LATER,
    SUM(CASE WHEN active_month = DATE_ADD(first_order_month, INTERVAL 4 MONTH) THEN user_count ELSE 0 END) AS USER_COUNT_4_MONTH_LATER,
    SUM(CASE WHEN active_month = DATE_ADD(first_order_month, INTERVAL 5 MONTH) THEN user_count ELSE 0 END) AS USER_COUNT_5_MONTH_LATER,
    SUM(CASE WHEN active_month = DATE_ADD(first_order_month, INTERVAL 6 MONTH) THEN user_count ELSE 0 END) AS USER_COUNT_6_MONTH_LATER,
    SUM(CASE WHEN active_month = DATE_ADD(first_order_month, INTERVAL 7 MONTH) THEN user_count ELSE 0 END) AS USER_COUNT_7_MONTH_LATER,
    SUM(CASE WHEN active_month = DATE_ADD(first_order_month, INTERVAL 8 MONTH) THEN user_count ELSE 0 END) AS USER_COUNT_8_MONTH_LATER,
    SUM(CASE WHEN active_month = DATE_ADD(first_order_month, INTERVAL 9 MONTH) THEN user_count ELSE 0 END) AS USER_COUNT_9_MONTH_LATER,
    SUM(CASE WHEN active_month = DATE_ADD(first_order_month, INTERVAL 10 MONTH) THEN user_count ELSE 0 END) AS USER_COUNT_10_MONTH_LATER,
    SUM(CASE WHEN active_month = DATE_ADD(first_order_month, INTERVAL 11 MONTH) THEN user_count ELSE 0 END) AS USER_COUNT_11_MONTH_LATER,
    SUM(CASE WHEN active_month = DATE_ADD(first_order_month, INTERVAL 12 MONTH) THEN user_count ELSE 0 END) AS USER_COUNT_12_MONTH_LATER
FROM cohort_counts
GROUP BY first_order_month
ORDER BY first_order_month;