Discretising a dataset - Solution

ronnie34 · February 1, 2018, 10:38pm

Hello, I’m wanting to write a function, say disc() to discretise a dataset using equal width binning. I want it to be able take a data frame “dataset” and the number of bins as arguments and return “dataset” with non-ordinal attributes categorized. Could anyone help me in doing this without the use of R libraries ?
The concept of using it without an R library is what I am finding difficult

martin.R · February 2, 2018, 8:59am

Have you checked cut?

ronnie34 · February 2, 2018, 9:05am

I have tried using cut but i'm not having any luck with it unfortunately

tbradley · February 2, 2018, 6:59pm

Can you post an example of the code that you have tried so far, a small toy data set and your desired output from said data set? Check out reprex

ronnie34 · February 2, 2018, 8:58pm

I'm struggling to get started, and an example of the dataset I'm trying to discretise the following:

Income	Loan
12	Y
13	Y
14	Y
12	N
14	Y
16	Y
18	N
33	Y
22	N
24	N
46	N
53	N
24	N
19	N
25	N
32	Y
33	Y
37	N
21	N
25	Y

danr · February 2, 2018, 9:34pm

Is what is below the the sort of thing you are trying to do?

When you ask a question it's good to have an overall prose description of what you want to do, as you have done, but you should also include both example input and output too.

You did provide input but not in a form we can easily use to work on the question you are asking. You should include code that generates the some input for us to work on. As is we have to go and by hand build your table.

Also take the time to learn about reprex as @tbradley mentioned. They are not only useful for asking questions they are also useful as quick way to reproducibly try out things as you develop your code.

Also what do you mean by "not using libraries". Do you mean not using packages or not using packages that are part of the R download or what?

And finally is this what you are trying to do?

ds <- tibble::tribble(
    ~income, ~loan,
    1, TRUE,
    2, FALSE,
    3, TRUE,
    4, TRUE,
    5, FALSE
)

uloan <- unique(ds$loan)

bins <- data.frame()

for(i in 1 : length(uloan)) {
    key <- as.character(uloan[[i]])
    count <- nrow(ds[ds$loan == key,])
    bins[key, "count"] <- count
}
bins
#>       count
#> TRUE      3
#> FALSE     2

ronnie34 · February 2, 2018, 10:02pm

Sorry for the confusion I may have caused earlier,
and yeah that is kind of what I am wanting, however I'm wanting a function so that it can be applied to any data-set, with loan just being a dummy dataset.
Is there a way the above could be adjusted to make it into a function that is applicable to all datasets?

ronnie34 · February 2, 2018, 11:19pm

And in response to the R libraries question, I meant not using any other R libraries that require additional download

danr · February 2, 2018, 11:50pm

You can turn the example code into a function as below. Is that what you are looking for?

What we need is at least a signature of the function you are looking for, i,e. the function without any body, and some example input (as R script so we can build it) and the output data you expect. As is I'm just trying to guess what you are trying to do.

Everyone here wants to help learn how to make R give you the results you want... Giving us as much info in a form we can easily use just makes it easier for us to help and quicker for you to get an answer.

ds <- tibble::tribble(
    ~income, ~loan,
    1, TRUE,
    2, FALSE,
    3, TRUE,
    4, TRUE,
    5, FALSE
)


discritize <- function(tbl, key_column)
{
    uloan <- unique(tbl[,key_column])
    
    bins <- data.frame()
    
    for(i in 1 : nrow(uloan)) {
        key <- as.character(uloan[[i,1]])
        count <- nrow(tbl[tbl["loan"] == key,])
        bins[key, "count"] <- count
    }
    bins
}

discritize(ds, "loan")
#>       count
#> TRUE      3
#> FALSE     2

ronnie34 · February 3, 2018, 8:40am

Thanks for that, but would the above function work for the loan dataset only?
Because i notice that in the first part of the R code ‘ds’ is set to only read the loan dataset and not any others.
Would this mean the user would need to manually change this everytime they wanted to discretise a different dataset with the same function you’ve given?

ronnie34 · February 3, 2018, 8:50am

Also under this I'm still not receiving the end result which should be the initial dataset with the non-ordinal attributes categorized.
The request I wanted was a function disc() to discretise a dataset using equal width binning. It should be able to take a data frame "dataset" and the number of bins as arguments, and return "dataset" with non-ordinal attributes categorized.
So I want this function to take the arguments of "dataset" and the number of bins as arguments, such that the user can manually change these.
With the "Loan" dataset being one of the many datasets that can be inputted in the function.
I hope that this makes things clearer?

tbradley · February 3, 2018, 2:38pm

Can you please post the desired output your function would give you from either the data set you posted earlier or the toy one that @danr put together? It is much harder to answer your question without knowing what you want exactly

ronnie34 · February 3, 2018, 5:30pm

A derived output for the dataset I posted above is the following:

Say for the dataset Loan, with 4 bins, I should be able to type

disc(loan, 4)
Income Loan
1 Y
2 (12,22.2] Y
3 (12,22.2] Y
4 N
5 (12,22.2] Y
6 (12,22.2] Y
7 (12,22.2] N
8 (32.5,42.8] Y
9 (12, 22.2] N
10 (22.2,32.5] N
11 (42.8,53] N
12 (42.8,53] N
13 (22.2,32.5] N
14 (12,22.5] N
15 (22.2,32.5] N
16 (22.2,32.5] Y
17 (32.5,42.8] Y
18 (32.5,42.8] N
19 (12,22.2] N
20 (22.2,32.5] Y

I hope this helps, thankyou again for your consistent help with this, it’s very much appreciated

danr · February 3, 2018, 7:31pm

Example 1 is not exactly what you want but it should be enough to get you started with making the function you want.

However discretize() is a function in the arules package that appears to do exactly what you want. It's shown in example 2. You shouldn't build your own statistics functions there are just too many odd's and end's to trip over. arules is a package built by statisticians.

If you feel you need to build your own discretize then you should use arules to check that your function is generating the correct output.

Example 1

tbl <- tibble::tribble(
    ~Income,    ~Loan,
    12, T,
    13, T,
    14, T,
    12, F,
    14, T,
    16, T,
    18, F,
    33, T,
    22, F,
    24, F,
    46, F,
    53, F,
    24, F,
    19, F,
    25, F,
    32, T,
    33, T,
    37, F,
    21, F,
    25, T
)

disc <- function(tbl, bin_count) {
    r <- range(tbl$Income)
    bin_size <- (r[2] - r[1]) / bin_count
    range_starts <- vector("list", bin_count)
    for (i in 1:bin_count) {
        range_starts[[i]] <- r[1] + (i - 1) * bin_size
    }
    for (i in 1:nrow(tbl)) {
        for (j in length(range_starts):1) {
            if (tbl[i, "Income"] >= range_starts[[j]]) {
                tbl[[i, "Income"]] <- j
            }
        }
    }
    tbl[1, "Income"] <- as.character(tbl[1, "Income"])
    
    for (i in 1:nrow(tbl)) {
        start <- r[1] + (as.integer(tbl[[i, "Income"]]) - 1) * bin_size
        tbl[[i, "Income"]] <-
            paste("(", start, ",", start + bin_size, "]")
    }
    tbl
}

disc(tbl, 4)
#> # A tibble: 20 x 2
#>    Income           Loan 
#>    <chr>            <lgl>
#>  1 ( 12 , 22.25 ]   T    
#>  2 ( 12 , 22.25 ]   T    
#>  3 ( 12 , 22.25 ]   T    
#>  4 ( 12 , 22.25 ]   F    
#>  5 ( 12 , 22.25 ]   T    
#>  6 ( 12 , 22.25 ]   T    
#>  7 ( 12 , 22.25 ]   F    
#>  8 ( 32.5 , 42.75 ] T    
#>  9 ( 12 , 22.25 ]   F    
#> 10 ( 22.25 , 32.5 ] F    
#> 11 ( 42.75 , 53 ]   F    
#> 12 ( 42.75 , 53 ]   F    
#> 13 ( 22.25 , 32.5 ] F    
#> 14 ( 12 , 22.25 ]   F    
#> 15 ( 22.25 , 32.5 ] F    
#> 16 ( 22.25 , 32.5 ] T    
#> 17 ( 32.5 , 42.75 ] T    
#> 18 ( 32.5 , 42.75 ] F    
#> 19 ( 12 , 22.25 ]   F    
#> 20 ( 22.25 , 32.5 ] T

Example 2

tbl <- tibble::tribble(
    ~Income,    ~Loan,
    12, T,
    13, T,
    14, T,
    12, F,
    14, T,
    16, T,
    18, F,
    33, T,
    22, F,
    24, F,
    46, F,
    53, F,
    24, F,
    19, F,
    25, F,
    32, T,
    33, T,
    37, F,
    21, F,
    25, T
)


intervals <-  tibble::as_tibble(arules::discretize(tbl$Income, categories = 4))
intervals
#> # A tibble: 20 x 1
#>    value      
#>    <fct>      
#>  1 [12.0,22.2)
#>  2 [12.0,22.2)
#>  3 [12.0,22.2)
#>  4 [12.0,22.2)
#>  5 [12.0,22.2)
#>  6 [12.0,22.2)
#>  7 [12.0,22.2)
#>  8 [32.5,42.8)
#>  9 [12.0,22.2)
#> 10 [22.2,32.5)
#> 11 [42.8,53.0]
#> 12 [42.8,53.0]
#> 13 [22.2,32.5)
#> 14 [12.0,22.2)
#> 15 [22.2,32.5)
#> 16 [22.2,32.5)
#> 17 [32.5,42.8)
#> 18 [32.5,42.8)
#> 19 [12.0,22.2)
#> 20 [22.2,32.5)

ronnie34 · February 3, 2018, 9:02pm

Example 1 is exactly what I have wanted, but I'm still struggling to generalise this function so that it can be used with any dataset..Is there something which can be done to encounter this?
Thankyou for your help above

ronnie34 · February 3, 2018, 9:24pm

I tried making the function generic by doing the following:

disc <- function(tbl, bin_count) {
tbl <- tibble::tribble(loan)
  r <- range(tbl$colnames(dataset)[[1]])
  bin_size <- (r[2] - r[1]) / bin_count
  range_starts <- vector("list", bin_count)
  for (i in 1:bin_count) {
    range_starts[[i]] <- r[1] + (i - 1) * bin_size
  }
  for (i in 1:nrow(tbl)) {
    for (j in length(range_starts):1) {
      if (tbl[i, "colnames(dataset)[[1]]"] >= range_starts[[j]]) {
        tbl[[i, "colnames(dataset)[[1]]"]] <- j
      }
    }
  }
  tbl[1, "colnames(dataset)[[1]]"] <- as.character(tbl[1, "colnames(dataset)[[1]]"])
  
  for (i in 1:nrow(tbl)) {
    start <- r[1] + (as.integer(tbl[[i, "colnames(dataset)[[1]]"]]) - 1) * bin_size
    tbl[[i, "colnames(dataset)[[1]]"]] <-
      paste("(", start, ",", start + bin_size, "]")
  }
  tbl
}

However i'm seeing that this isn't working

ronnie34 · February 3, 2018, 9:35pm

I’ve tried playing around with this but seeming to not get anywhere for the past 2 hours. Is there much that can be done with this?

ronnie34 · February 3, 2018, 10:22pm

I've also tried playing around with the tibble function but i'm not getting anywhere with that either unfortunately.
I'm struggling to make what you have created above, generic, so that it can be applied to all datasets if that makes sense?

ronnie34 · February 4, 2018, 7:07am

Although this gives the output I want, i'm still having trouble with making this code generic so that it can be applied to any dataset. Since currently it is tailored for the "loan" dataset, whereas I want this function such that it can be applied to any dataset.
It would be very much appreciated if someone could help me with this task since I'm not getting anywhere with this myself, despite spending 8 hours on this. Sorry for any inconvenience this may cause
Ronnie

ronnie34 · February 4, 2018, 11:21am

I’ve tried again but not seeming to get anywhere unfortunately, would it be possible for you to generalise the function you gave in example 1 such that it can be applied to any dataset please?
I only ask since despite my efforts I’m not seeming to get anywhere and believe his would be a good learning opportunity to me to see the outcome and learn from any mistakes made
Thankyou for your help in all this

Income	Loan
12	Y
13	Y
14	Y
12	N
14	Y
16	Y
18	N
33	Y
22	N
24	N
46	N
53	N
24	N
19	N
25	N
32	Y
33	Y
37	N
21	N
25	Y

Income	Loan
12	Y
13	Y
14	Y
12	N
14	Y
16	Y
18	N
33	Y
22	N
24	N
46	N
53	N
24	N
19	N
25	N
32	Y
33	Y
37	N
21	N
25	Y

Income	Loan
12	Y
13	Y
14	Y
12	N
14	Y
16	Y
18	N
33	Y
22	N
24	N
46	N
53	N
24	N
19	N
25	N
32	Y
33	Y
37	N
21	N
25	Y