spark 中怎么像 pandas 里面对时间数据做 resample

1. 笨办法

pandas Dataframe 可以很容易做时序数据的 resample,按照一定的frequency 聚合数据. 但是spark 中因为没有顺序的概念就不太好做,下面是怎么在spark中做resample 的例子. 

  

def resample(column, agg_interval=900, time_format='yyyy-MM-dd HH:mm:ss'):
    if type(column)==str:
        column = F.col(column)

    # Convert the timestamp to unix timestamp format.
    # Unix timestamp = number of seconds since 00:00:00 UTC, 1 January 1970.
    col_ut =  F.unix_timestamp(column, format=time_format)

    # Divide the time into dicrete intervals, by rounding. 
    col_ut_agg =  F.floor(col_ut / agg_interval) * agg_interval  

    # Convert to and return a human readable timestamp
    return F.from_unixtime(col_ut_agg)

df = df.withColumn('dt_resampled', resample(df.dt, agg_interval=3600)) # 1h resample
df.show()

2. 新办法

  

   使用 groupby + window

   

Ref:

https://mihevc.org/2016/09/28/spark-resampling.html 笨办法但是容易理解

https://rsandstroem.github.io/sparkdataframes.html#Resampling-time-series-with-Spark 新办法

原文地址:https://www.cnblogs.com/mashuai-191/p/12915906.html