Help to visualise data

kolaborek · July 30, 2021, 7:09pm

Hello,
I'm doing analyse of leaked password and want to visualise frequency of occurrence of uppercase letters vs password length.
I thought about chart with dots where X axis will be password length and Y axis will show places where uppercase letter occurs.

For now I was extracting below data in PowerShell and it looks like below:

Password_Length : Place_where_uppercase_letter_occurs
8 : 1 3 (in this example uppercase letters was on first, and third place in password)
9 : 0 (in this example password didn't have any uppercase letters)
13 : 1 5 9 13
5 : 1
etc.

As I'm completely new in R I don't know how should I import those data and to what type od structure (arrary, list, etc...)

Best Regards and thank you for every help

FJCC · July 30, 2021, 9:25pm

Here is one way to do it. I bet there are more elegant solutions. I put in a lot of print statements to help illustrate what the code does. All of the packages used are from the tidyverse. You should have a lot of studying to do to understand how all of this works!

#Read the data into a vector. Each line of the file is one element of the vector
DATA <- readLines("~/R/Play/Dummy.csv")
DATA
#> [1] "8 : 1 3"       "9 : 0"         "13 : 1 5 9 13" "5 : 1"
library(stringr)
#delete the :
DATA2 <- str_replace_all(DATA, pattern = ":", replacement = "")
DATA2
#> [1] "8  1 3"       "9  0"         "13  1 5 9 13" "5  1"

#extract the characters representing the lengths of the passwords
LENGTHS <- str_extract(DATA2, "^\\d+")
LENGTHS
#> [1] "8"  "9"  "13" "5"

#Delete the lengths and the spaces after them from the original data
POSITIONS <- str_replace(DATA2, "^\\d+\\s+", "")
POSITIONS
#> [1] "1 3"      "0"        "1 5 9 13" "1"

#Make a "list" of all the positions. Each element of the list is all
#of the positions in one password
POSITIONS <- str_split(POSITIONS," ")
POSITIONS
#> [[1]]
#> [1] "1" "3"
#> 
#> [[2]]
#> [1] "0"
#> 
#> [[3]]
#> [1] "1"  "5"  "9"  "13"
#> 
#> [[4]]
#> [1] "1"

#Iterate over the lengths and positions to make  data frames showing
#all of the position for each password labeled with its length
library(purrr)
#.x is the first argumen (LENGTHS) and .y is POSITIONS
DFs <- map2(LENGTHS, POSITIONS, ~data.frame(LENG = .x, POS = .y))
DFs
#> [[1]]
#>   LENG POS
#> 1    8   1
#> 2    8   3
#> 
#> [[2]]
#>   LENG POS
#> 1    9   0
#> 
#> [[3]]
#>   LENG POS
#> 1   13   1
#> 2   13   5
#> 3   13   9
#> 4   13  13
#> 
#> [[4]]
#>   LENG POS
#> 1    5   1

#Combine the data frames
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
AllDat <- bind_rows(DFs)

#convert all of the characters to numbers

AllDat <- AllDat %>% mutate(across(.fns = as.numeric))
AllDat
#>   LENG POS
#> 1    8   1
#> 2    8   3
#> 3    9   0
#> 4   13   1
#> 5   13   5
#> 6   13   9
#> 7   13  13
#> 8    5   1
#Count the Length/Position combinations
SUMMARY <- AllDat %>% group_by(LENG, POS) %>% count()
SUMMARY
#> # A tibble: 8 x 3
#> # Groups:   LENG, POS [8]
#>    LENG   POS     n
#>   <dbl> <dbl> <int>
#> 1     5     1     1
#> 2     8     1     1
#> 3     8     3     1
#> 4     9     0     1
#> 5    13     1     1
#> 6    13     5     1
#> 7    13     9     1
#> 8    13    13     1

#Plot the result. All counts are one, so it is not very exciting.
library(ggplot2)
#> Warning: package 'ggplot2' was built under R version 4.0.5
ggplot(SUMMARY, aes(x = LENG, y = POS, fill = n)) + geom_tile()

^{Created on 2021-07-30 by the reprex package (v0.3.0)}

kolaborek · August 4, 2021, 4:47pm

Thank you very much for your time and work I really appreciat that!

I change my PowerShell code to get result as you provide and then I import it into data.frame from which I make plot using ggplot.

X is Position of uppercase letter
Y is Password Length

Below is code which I use to make plot (titles are in Polish). It was made from over 100.000 passwords.

> ggplot(data=ramka_danych, aes(x=pozycja, y=dlugosc)) + geom_tile(alpha = 0.05) + 
     labs(title = "Pozycje dużych liter w hasłach", x = "pozycja dużej litery", y = "długosc hasła") + 
     scale_x_continuous(limits = c(-1, 40), breaks = seq(0, 40, by = 1)) + 
     scale_y_continuous(limits = c(-1, 40), breaks = seq(0, 40, by = 1)) +
     theme(panel.grid.minor = element_line (color = "lightgrey", linetype = 3), panel.grid.major = element_line (color = "grey", linetype = 0))

Thank you once again. It was really big lesson for me:)

Ps. To visualise intensity (like heatmap) I use "alpha = 0.05" but maybe in R there is something better for this task?

FJCC · August 4, 2021, 6:16pm

Your method of using alpha = 0.05 and the over plotting to show the frequency of each data point seems to work but expect that the signal saturates at some point. Once the tile looks black, adding more tiles cannot make it look blacker.
If you calculate how many times each combination of dlugosc and pozycja occurs, you can use that number to color the tiles. ggplot will ensure that color scale covers the whole range of n. The following plot might give a different impression of the data than the plot you made.

library(dplyr)
NewDF <- ramka_danych %>%  group_by(pozycja, dlugosc) %>% count()

ggplot(data=NewDF, aes(x=pozycja, y=dlugosc, fill = n)) + geom_tile() + 
     labs(title = "Pozycje dużych liter w hasłach", x = "pozycja dużej litery", y = "długosc hasła") + 
     scale_x_continuous(limits = c(-1, 40), breaks = seq(0, 40, by = 1)) + 
     scale_y_continuous(limits = c(-1, 40), breaks = seq(0, 40, by = 1)) +
     theme(panel.grid.minor = element_line (color = "lightgrey", linetype = 3), panel.grid.major = element_line (color = "grey", linetype = 0))

kolaborek · August 11, 2021, 6:55pm

Ok. I tried with "fill" option, but cannot find good values for corellation values "dlugosc" and "pozycja".

I think, that I'll finish with code like below:

library(ggplot2)
library(ggpointdensity)
library(viridis)
ggplot(data=ramka_danych, aes(x=pozycjaDuzejLitery, y=dlugoscHasla)) + geom_pointdensity (alpha = 0.02, size = 4, shape = 15) + 
  labs(title = "Pozycje dużych liter w hasłach", x = "pozycja dużej litery", y = "długosc hasła", color = "Liczba wystapien \nduzych liter") + 
  scale_x_continuous(limits = c(-1, 40), breaks = seq(0, 40, by = 1)) + 
  scale_y_continuous(limits = c(-1, 40), breaks = seq(0, 40, by = 1)) +
  theme(panel.grid.minor = element_line (color = "lightgrey", linetype = 3), panel.grid.major = element_line (color = "grey", linetype = 0)) + scale_color_viridis(option = "inferno")

It gives me result like the screen below:

@FJCC thank you very much for you help

system · September 1, 2021, 6:56pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.