Binning or Discretization

Binning or Discretization : Real-world data tend to be noisy. Noisy data is data with a large amount of additional meaningless information in it called noise. Data cleaning (or data cleansing) routines attempt to smooth out noise while identifying outliers in the data.

There are three data smoothing techniques as follows –

1. Binning : Binning methods smooth a sorted data value by consulting its “neighborhood”, that is, the values around it.

2. Regression : It conforms data values to a function. Linear regression involves finding the “best” line to fit two attributes (or variables) so that one attribute can be used to predict the other.

3. Outlier analysis : Outliers may be detected by clustering, for example, where similar values are organized into groups, or “clusters”. Intuitively, values that fall outside of the set of clusters may be considered as outliers.

Binning method for data smoothing – This method the data is first sorted and then the sorted values are distributed into a number of buckets or bins . As binning methods consult the neighborhood of values, they perform local smoothing.

Binning can also be used as a discretization technique. Here discretization refers to the process of converting or partitioning continuous attributes, features or variables to discretized or nominal attributes/features/variables/intervals.

For example, attribute values can be discretized by applying equal-width binning or equal-frequency binning , and then replacing each bin value by the bin mean or median, as in smoothing by bin means or smoothing by bin medians, respectively. Then the continuous values can be converted to a nominal or discretized value which is same as the value of their corresponding bin.

Learn More