Hello,
I'm doing analyse of leaked password and want to visualise frequency of occurrence of uppercase letters vs password length.
I thought about chart with dots where X axis will be password length and Y axis will show places where uppercase letter occurs.
For now I was extracting below data in PowerShell and it looks like below:
Password_Length : Place_where_uppercase_letter_occurs
8 : 1 3 (in this example uppercase letters was on first, and third place in password)
9 : 0 (in this example password didn't have any uppercase letters)
13 : 1 5 9 13
5 : 1
etc.
As I'm completely new in R I don't know how should I import those data and to what type od structure (arrary, list, etc...)
Here is one way to do it. I bet there are more elegant solutions. I put in a lot of print statements to help illustrate what the code does. All of the packages used are from the tidyverse. You should have a lot of studying to do to understand how all of this works!
#Read the data into a vector. Each line of the file is one element of the vector
DATA <- readLines("~/R/Play/Dummy.csv")
DATA
#> [1] "8 : 1 3" "9 : 0" "13 : 1 5 9 13" "5 : 1"
library(stringr)
#delete the :
DATA2 <- str_replace_all(DATA, pattern = ":", replacement = "")
DATA2
#> [1] "8 1 3" "9 0" "13 1 5 9 13" "5 1"
#extract the characters representing the lengths of the passwords
LENGTHS <- str_extract(DATA2, "^\\d+")
LENGTHS
#> [1] "8" "9" "13" "5"
#Delete the lengths and the spaces after them from the original data
POSITIONS <- str_replace(DATA2, "^\\d+\\s+", "")
POSITIONS
#> [1] "1 3" "0" "1 5 9 13" "1"
#Make a "list" of all the positions. Each element of the list is all
#of the positions in one password
POSITIONS <- str_split(POSITIONS," ")
POSITIONS
#> [[1]]
#> [1] "1" "3"
#>
#> [[2]]
#> [1] "0"
#>
#> [[3]]
#> [1] "1" "5" "9" "13"
#>
#> [[4]]
#> [1] "1"
#Iterate over the lengths and positions to make data frames showing
#all of the position for each password labeled with its length
library(purrr)
#.x is the first argumen (LENGTHS) and .y is POSITIONS
DFs <- map2(LENGTHS, POSITIONS, ~data.frame(LENG = .x, POS = .y))
DFs
#> [[1]]
#> LENG POS
#> 1 8 1
#> 2 8 3
#>
#> [[2]]
#> LENG POS
#> 1 9 0
#>
#> [[3]]
#> LENG POS
#> 1 13 1
#> 2 13 5
#> 3 13 9
#> 4 13 13
#>
#> [[4]]
#> LENG POS
#> 1 5 1
#Combine the data frames
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
AllDat <- bind_rows(DFs)
#convert all of the characters to numbers
AllDat <- AllDat %>% mutate(across(.fns = as.numeric))
AllDat
#> LENG POS
#> 1 8 1
#> 2 8 3
#> 3 9 0
#> 4 13 1
#> 5 13 5
#> 6 13 9
#> 7 13 13
#> 8 5 1
#Count the Length/Position combinations
SUMMARY <- AllDat %>% group_by(LENG, POS) %>% count()
SUMMARY
#> # A tibble: 8 x 3
#> # Groups: LENG, POS [8]
#> LENG POS n
#> <dbl> <dbl> <int>
#> 1 5 1 1
#> 2 8 1 1
#> 3 8 3 1
#> 4 9 0 1
#> 5 13 1 1
#> 6 13 5 1
#> 7 13 9 1
#> 8 13 13 1
#Plot the result. All counts are one, so it is not very exciting.
library(ggplot2)
#> Warning: package 'ggplot2' was built under R version 4.0.5
ggplot(SUMMARY, aes(x = LENG, y = POS, fill = n)) + geom_tile()
Your method of using alpha = 0.05 and the over plotting to show the frequency of each data point seems to work but expect that the signal saturates at some point. Once the tile looks black, adding more tiles cannot make it look blacker.
If you calculate how many times each combination of dlugosc and pozycja occurs, you can use that number to color the tiles. ggplot will ensure that color scale covers the whole range of n. The following plot might give a different impression of the data than the plot you made.
library(dplyr)
NewDF <- ramka_danych %>% group_by(pozycja, dlugosc) %>% count()
ggplot(data=NewDF, aes(x=pozycja, y=dlugosc, fill = n)) + geom_tile() +
labs(title = "Pozycje dużych liter w hasłach", x = "pozycja dużej litery", y = "długosc hasła") +
scale_x_continuous(limits = c(-1, 40), breaks = seq(0, 40, by = 1)) +
scale_y_continuous(limits = c(-1, 40), breaks = seq(0, 40, by = 1)) +
theme(panel.grid.minor = element_line (color = "lightgrey", linetype = 3), panel.grid.major = element_line (color = "grey", linetype = 0))