Kaggle的Outbrain点击预测比赛分析

https://yq.aliyun.com/articles/293596

https://www.kaggle.com/c/outbrain-click-prediction

https://www.kaggle.com/anokas/outbrain-eda

用户个性化点击率预估

基本场景:

document_id(document)  uuid(user)  ad_id(a set of ads)

原始数据:

page_views.csv: the log of users visiting documents

  • uuid
  • document_id
  • timestamp (ms since 1970-01-01 - 1465876799998)
  • platform (desktop = 1, mobile = 2, tablet =3)
  • geo_location (country>state>DMA)
  • traffic_source (internal = 1, search = 2, social = 3)

clicks_train.csv:

  • display_id
  • ad_id
  • clicked (1 if clicked, 0 otherwise)

events.csv: (information on the display_id context)

  • display_id
  • uuid
  • document_id
  • timestamp
  • platform
  • geo_location

promoted_content.csv: details on the ads.

  • ad_id
  • document_id
  • campaign_id
  • advertiser_id

documents_meta.csv: details on the documents.

  • document_id
  • source_id (the part of the site on which the document is displayed, e.g. edition.cnn.com)
  • publisher_id
  • publish_time

documents_topics.csv, documents_entities.csv, and documents_categories.csv all provide information about the content in a document, as well as Outbrain's confidence in each respective relationship. 

数据分析

import pandas as pd 
import os
import gc # We're gonna be clearing memory a lot
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

df_train = pd.read_csv('./outbrain-click-prediction/clicks_train.csv')
df_test = pd.read_csv('./outbrain-click-prediction/clicks_test.csv')

# 页面广告数分布
size_train = df_train.groupby('display_id')['ad_id'].count().value_counts()
size_train = size_train / np.sum(size_train)

直方图:

plt.figure(figsize=(12,4))
p = sns.color_palette()
sns.barplot(size_train.index, size_train.values, alpha=0.8, color=p[0], label='train')
plt.legend()
plt.xlabel('Number of Ads in display', fontsize=12)
plt.ylabel('Proportion of set', fontsize=12)

统计广告出现次数:

# 以下两行都可以
df_train.groupby('ad_id')['ad_id'].count()
df_train.groupby('ad_id').agg(np.size) 

统计训练集和测试集中ad的重合度:

len(set(df_test.ad_id.unique()).intersection(df_train.ad_id.unique())) / len(df_test.ad_id.unique())

对events.csv进行统计:

print (events.columns.to_list())
print (events.head())
print (events.platform.value_counts())
events.platform = events.platform.astype(str)
print (events.platform.value_counts())

print (events.groupby('uuid')['uuid'].count().sort_values()) # 统计用户的出现次数
原文地址:https://www.cnblogs.com/ljygoodgoodstudydaydayup/p/10456935.html