Home Convert variable to multiple columns with Pandas
Reply: 3

Convert variable to multiple columns with Pandas

Computarch
1#
Computarch Published in 2018-02-13 23:45:26Z

I have a legacy datafile that contains data in the following format:

SURVEY  NUM TEMPORAL
WHS 1   Byz
WHS 1   Byz_Um
WHS 1   IAII
WHS 1   L_Isl
WHS 1   L_Rom
WHS 1   Mod
WHS 1   Nab
WHS 2   Byz
WHS 2   Mod
WHS 2   Unk
WHS 2   MP
WHS 3   Byz
WHS 3   Nab
WHS 3   LMP
WHS 3   UP
WHS 4   LMP
WHS 4   MP
WHS 4   UP
WHS 5   Byz
WHS 5   Unk
WHS 5   LMP

etc..

Essentially, the column "NUM" is a unique identifier that relates to a specific site, and the column "TEMPORAL" is a value associated with that site. For whatever reason, the original file repeates this over several lines for sites with mutltiple temporal occupations (this archaeological data). I would like to use Pandas to conver this to something like so:

SURVEY NUM  Byz Byz_Um IAII L_Isl LMP L_Rom Nab MP Mod Unk UP
WHS 1   1  1  1  1  1  0  0  0  0  0  1  0  0  0
WHS 2   1  0  0  0  0  0  0  0  0  1  1  0  0  1
WHS 3   1  0  0  0  0  0  0  0  1  1  1  0  1  0
WHS 4   0  0  0  0  0  0  0  0  0  0  0  1  0  1
WHS 5   1  0  0  0  0  0  0  1  0  0  0  0  1  0

Where a 1 is placed into a new column if that TEMPORAL period exists. I tried using df.pivot with "NUM" as the index and "TEMPORAL" as the columns, but that did not work. There are several thousand sites in this database, so doing it manually is not a possibility. Any ideas?

piRSquared
2#
piRSquared Reply to 2018-02-14 00:08:07Z

np.bincount and pd.factorize

i, r = pd.factorize(df['SURVEY'])
j, c = pd.factorize(df['NUM TEMPORAL'])
n, m = len(r), len(c)

b = np.bincount(i * m + j, minlength=n * m).reshape(n, m)

pd.DataFrame(b, r, c)

       Byz  Byz_Um  IAII  L_Isl  L_Rom  Mod  Nab  Unk  MP  LMP  UP
WHS 1    1       1     1      1      1    1    1    0   0    0   0
WHS 2    1       0     0      0      0    1    0    1   1    0   0
WHS 3    1       0     0      0      0    0    1    0   0    1   1
WHS 4    0       0     0      0      0    0    0    0   1    1   1
WHS 5    1       0     0      0      0    0    0    1   0    1   0

pd.get_dummies and pd.DataFrame.dot

pd.get_dummies(df['SURVEY']).T.dot(pd.get_dummies(df['NUM TEMPORAL']))

       Byz  Byz_Um  IAII  LMP  L_Isl  L_Rom  MP  Mod  Nab  UP  Unk
WHS 1    1       1     1    0      1      1   0    1    1   0    0
WHS 2    1       0     0    0      0      0   1    1    0   0    1
WHS 3    1       0     0    1      0      0   0    0    1   1    0
WHS 4    0       0     0    1      0      0   1    0    0   1    0
WHS 5    1       0     0    1      0      0   0    0    0   0    1

pd.crosstab

pd.crosstab(*df.values.T)

col_0  Byz  Byz_Um  IAII  LMP  L_Isl  L_Rom  MP  Mod  Nab  UP  Unk
row_0                                                             
WHS 1    1       1     1    0      1      1   0    1    1   0    0
WHS 2    1       0     0    0      0      0   1    1    0   0    1
WHS 3    1       0     0    1      0      0   0    0    1   1    0
WHS 4    0       0     0    1      0      0   1    0    0   1    0
WHS 5    1       0     0    1      0      0   0    0    0   0    1

Or

pd.crosstab(df['SURVEY'], df['NUM TEMPORAL'])

NUM TEMPORAL  Byz  Byz_Um  IAII  LMP  L_Isl  L_Rom  MP  Mod  Nab  UP  Unk
SURVEY                                                                   
WHS 1           1       1     1    0      1      1   0    1    1   0    0
WHS 2           1       0     0    0      0      0   1    1    0   0    1
WHS 3           1       0     0    1      0      0   0    0    1   1    0
WHS 4           0       0     0    1      0      0   1    0    0   1    0
WHS 5           1       0     0    1      0      0   0    0    0   0    1

reconstruction

pd.Series(1, df.values.T.tolist()).unstack(fill_value=0)

       Byz  Byz_Um  IAII  LMP  L_Isl  L_Rom  MP  Mod  Nab  UP  Unk
WHS 1    1       1     1    0      1      1   0    1    1   0    0
WHS 2    1       0     0    0      0      0   1    1    0   0    1
WHS 3    1       0     0    1      0      0   0    0    1   1    0
WHS 4    0       0     0    1      0      0   1    0    0   1    0
WHS 5    1       0     0    1      0      0   0    0    0   0    1

Explanation

  • df.values.T.tolist() passes a list of length 2, consisting of two levels. Those levels get interpreted as a MultiIndex.
  • 1 becomes the default value.
  • unstack with fill_value=0 reshapes accordingly
Wen
3#
Wen Reply to 2018-02-13 23:59:10Z

By using get_dummies

df.set_index(['SURVEY','NUM']).TEMPORAL.str.get_dummies().sum(level=1)
Out[127]: 
     Byz  Byz_Um  IAII  LMP  L_Isl  L_Rom  MP  Mod  Nab  UP  Unk
NUM                                                             
1      1       1     1    0      1      1   0    1    1   0    0
2      1       0     0    0      0      0   1    1    0   0    1
3      1       0     0    1      0      0   0    0    1   1    0
4      0       0     0    1      0      0   1    0    0   1    0
5      1       0     0    1      0      0   0    0    0   0    1
Computarch
4#
Computarch Reply to 2018-02-14 03:40:29Z

Thanks to piRSquared for leading me to the correct answer. The following code did exactly what I wanted:

df2 = pandas.crosstab(df['NUM'], [df['SURVEY'], df['TEMPORAL']])
You need to login account before you can post.

About| Privacy statement| Terms of Service| Advertising| Contact us| Help| Sitemap|
Processed in 0.337567 second(s) , Gzip On .

© 2016 Powered by mzan.com design MATCHINFO