[Data analysis] Meta data

1 분 소요

Meta data

데이터를 Dictionary 형태로 가공하여 Data indexing 시 편리하게 사용하기 위한 데이터 활용 구조를 의미한다.
예시로, 공부를 진행했던 Kaggle의 ‘Porto Seguro’s Safe Driver Prediction’ Competition의 경우를 사용하겠다.
해당 Competition은 정보의 보호를 목적으로 (추정사항) 각 Feature의 이름을 공개해두지 않았다.
따라서, Dataset을 meta data의 형태로 바꾸기 위해서 아래와 같은 특징들을 중심으로 분류하였다.
Role (데이터의 역할), Level (데이터의 형태), Keep (데이터의 활용 가능성), dtype (데이터의 자료형)
코드는 아래와 같다.

  data = []
  for f in train.columns:
      # Defining the role
      if f == 'target':
          role = 'target'
      elif f == 'id':
          role = 'id'
      else:
          role = 'input' ## For using input feature data of model
          
      # Defining the level
      if 'bin' in f or f == 'target':
          level = 'binary' # bin=binary, target is 0 or 1
      elif 'cat' in f or f == 'id':
          level = 'nominal' # categorical type
      elif train[f].dtype == float:
          level = 'interval' 
      elif train[f].dtype == int:
          level = 'ordinal'
          
      # Initialize keep to True for all variables except for id
      keep = True
      if f == 'id':
          keep = False
      
      # Defining the data type 
      dtype = train[f].dtype
      
      # Creating a Dict that contains all the metadata for the variable
      f_dict = {
          'varname': f,
          'role': role,
          'level': level,
          'keep': keep,
          'dtype': dtype
      }
      data.append(f_dict)
      
  meta = pd.DataFrame(data, columns=['varname', 'role', 'level', 'keep', 'dtype'])
  meta.set_index('varname', inplace=True)

Meta data 활용

Interval variables에 대한 정보를 확인하고 싶을 경우

v = meta[(meta.level == 'interval') & (meta.keep)].index
train[v].describe()

Model의 Input으로 지정되어 있는 데이터에 관한 정보를 확인하고 싶을 경우
```
v = meta[(meta.keep)].index
train[v].describe()
```

Model의 Input으로 지정되어 있는 데이터 중 특정 Column을 Drop하고 싶을 경우

vars_to_drop = ['ps_car_03_cat', 'ps_car_05_cat']
train.drop(vars_to_drop, inplace=True, axis=1)
meta.loc[(vars_to_drop),'keep'] = False  # Updating the meta

Reference

https://www.kaggle.com/bertcarremans/data-preparation-exploration (Kernel)
https://www.kaggle.com/bertcarremans (Author information of kernel)

Twitter Facebook LinkedIn

[Data analysis] Meta data

Meta data

Meta data 활용

Reference

공유하기

댓글남기기

참고

[Concept summary] Contrastive learning

[Concept summary] Maximum likelihood estimation

[Linear algebra] 개념 정리

[Tip] Windows에서 Docker 가지고 놀기