Techshack Weekly - Pandas 10 分钟入门

查看原文

本文是 pandas 的 10min 入门帖子。

新建 Series 对象：pd.Series(given_list), 如果值为空，用 np.nan。
新建 DataFrame 对象：pd.DataFrame(given_numpy_array, index=index, columns=given_list)。稍微复杂一些，第一个参数是 numpy 矩阵，index 是 pandas Index 实例，columns 是列的名字。
dict 新建 DataFrame 对象：pd.DataFrame({"a": 1., "B": pd_series, ...})
查看每一列的类型：df.dtypes
查看头几列和最后几列：df.head(), df.tail(3).
查看 index: df.index
查看 columns：df.columns
查看值：df.values
快速查看一些基本统计项：df.describe()
转置矩阵：df.T
根据轴排序：df.sort_index(axis=1, ascending=False)
根据值排序：df.sort_values(by='B')
选择一列：df['A']
Index 切片：df[0:3], df['20130102':'20130104']
根据 Label 选择一行：df.loc[label]。Label
根据多个 Labels 选择多行：df.loc[:, ['a', 'b']], 特别的，: 可以选出所有 Labels，'a':'b' 可以选出一个连续区间。
获得 scalar value: df.loc[date, 'A'], 就相当于提供了行号和列号。另一个语法是：df.at[date, 'A']
根据位置获得一行：df.iloc[3]
根据位置切片或者多值获得多行：df.iloc[3:5,0:2], df.iloc[[1,2,3],[0,2]]
获得 scalar value: df.iloc[1,1], 或者 df.iat[1,1]
根据条件索引检索多行：df[df.A > 0], df[df.A.isin(['a', 'b', 'c'])]
添加新的列：df['F'] = pd.Series([1,2,3,4,5,6], index=df.index)
设置值：df.at[X, Y] = Z 或者 df.iat[x, y] = z, 前者按照 label，后者按照 position
np.nan 表示缺失的数据，不会参与计算。可以通过 reindex 获得新的拷贝：df.index(index=index, columns=columns)
丢弃 nan: df.dropna(how='any')
填补 nan: df.fillna(value=5)
求平均数：df.mean()
求某一列的平均数：df.mean(1)
map: df.apply(f)
求各个值的数量：series.value_counts()
合并 dataframes: pd.concat([df1, df2, ...])
Join dataframes: pd.merge(df1, df2, on='key')
新加新行：df.append(df.iloc[3], ignore_index=True)
groupby 一般有三个步骤：分组，处理，合并。例如 df.groupby('A').sum()
处理 Time Series 数据：df.index 可以是 DateTimeIndex
分类：`df['grade'] = df['raw_grade'].astype('category')