ggplot : Beginner question about string data.

ggplot2
graphics
#1

Hi,

First I'd like to apologize for my bad english ! :neutral_face:

I'm new to R studio and I'm trying to do something that looks hard to do..

I have a huge amount of data. My data is ordered like this :

image

What I want to see is, on this example, 3 courbs (cause 3 games) with
X = State.
Y = Type .

Because I need to see the order of types :
Game 1 : [State 1] A > B [State 2] A > C [State 3] C
Game 2 : [State 1] A > B > C > C [State 2] C [State 3] C
Game 3 : [State 1] B [State 2] B > A [State 3] A

I know that its hard because they are only strings, so its possible for me to use the time as the X axis, but I still have a problem with this Y axis using types. (And to be honest I don't care about time since I only want to know the order of types).

Using ggplot2, the only thing that gives me a little amount of graphical representation is geom_point() which is not really what I want :roll_eyes:

The thing is, I know that it could probably work if I change my data like this :
Type A = 1
Type B = 2
Type C = 3

State 1 = 1
State 2 = 2
State 3 = 3

Or using Time instead of State, because now strings are numbers. But I think it should be a way to keep at least one axis with Strings (since they are repetables).

I also don't know how to separate games from my data. As you can see, there is 3 games in my example, but in reality I have thousands of games. I suppose there is a way (using a function probably ?), to create datas game by game.

I hope someone can help me find the solution :pray:

Stalki.

0 Likes

#2

Could you please turn this into a self-contained reprex (short for reproducible example)? It will help us help you if we can be sure we're all working with/looking at the same stuff.

install.packages("reprex")

If you've never heard of a reprex before, you might want to start by reading the tidyverse.org help page. The reprex dos and don'ts are also useful.

There's also a nice FAQ on how to do a minimal reprex for beginners, below:

What to do if you run into clipboard problems

If you run into problems with access to your clipboard, you can specify an outfile for the reprex, and then copy and paste the contents into the forum.

reprex::reprex(input = "fruits_stringdist.R", outfile = "fruits_stringdist.md")

For pointers specific to the community site, check out the [reprex FAQ](https://community.rstudio.com/t/faq-whats-a-reproducible-example-reprex-and-how-do-i-do-one/5Could you please turn this into a self-contained reprex (short for reproducible example)? It will help us help you if we can be sure we're all working with/looking at the same stuff.

install.packages("reprex")

If you've never heard of a reprex before, you might want to start by reading the tidyverse.org help page. The reprex dos and don'ts are also useful.

There's also a nice FAQ on how to do a minimal reprex for beginners, below:

What to do if you run into clipboard problems

If you run into problems with access to your clipboard, you can specify an outfile for the reprex, and then copy and paste the contents into the forum.

reprex::reprex(input = "fruits_stringdist.R", outfile = "fruits_stringdist.md")

For pointers specific to the community site, check out the reprex FAQ.

1 Like

#3

Hi stalki

welcome to the RStudio Community! Below you will find a reprex, it's just a starting point and a idea of representation. If you would like to represent all your games by selection a "shiny app" would be helpful.

library(tidyverse)

#create reprex data based on data example
games <- tibble(Game=c(rep(1, times = 5), rep(2, times = 7)),
                Type=c('Type A','Type B','Type A','Type C','Type C',
                       'Type A','Type B','Type C','Type C','Type C','Type C','Type B'),
                State=c('State 1','State 1','State 2','State 2','State 3',
                        'State 1','State 1','State 1','State 1','State 2','State 3','State 3'),
                Time=c(1:5,1:7))

#prepare data for visualization
prep_games <- games %>% 
  mutate(Type=factor(Type, levels = c('Type A','Type B','Type C'))) %>% 
  mutate(State=factor(State, levels = c('State 1','State 2','State 3')))

filter(prep_games, Game==1) %>% 
  ggplot(aes(x = Time, y = Type)) + 
  geom_line(aes(group = State, color=State)) +
  geom_point(aes(group = State, color=State)) +
  theme_bw()


filter(prep_games, Game==2) %>% 
  ggplot(aes(x = Time, y = Type)) + 
  geom_line(aes(group = State, color=State)) +
  geom_point(aes(group = State, color=State)) +
  theme_bw()

Created on 2019-03-21 by the reprex package (v0.2.1)

2 Likes

#4

Hi stalki,

Welcome to the community.

@mara has already suggested that you present your problem as a reprex, and maybe this will help you with that if not with your problem itself:

library(tidyverse)
# typed in, so didn't use exactly what was supplied in terms of naming etc
stalki <- tibble(id=1:16,
                 game = c(1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3),
                 type = c("A","B","A","C","C","A","B","C","C","C","C","B","B","B","A","A"),
                 state = c("s1","s1","s2","s2","s3","s1","s1","s1","s1","s2","s3","s3","s1","s2","s2","s3"),
                 time = c(1,2,3,4,5,1,2,3,4,5,6,7,1,2,3,4))
# My first impression is that what follows is what you might have been after,
# although it sounds as if you have too many games to consider, but
# also it doesn't show you when things have remained the same.
ggplot(stalki, aes(y = type, x = state, group = game, colour =f actor(game))) + geom_path()

# ...so I wondered if something like this might be more informative, using shape to indicate state
ggplot(stalki, aes(y = state, x = time, group = game, shape = type, colour = factor(game))) + geom_path() + geom_point()

Created on 2019-03-21 by the reprex package (v0.2.1)

Are you really going to look at 1000s of individual graphs? And if so, to what end? Will you make the same decision the first time you observe a certain pattern as you will the 50th? Maybe thinking about how to summarise into manageable chunks might be useful (then it can be approached with filtering, nesting, ...).

Ron.

3 Likes

#5

Thank you all for answering !

Mara : I've tried to create a reprex but i don't why RStudio crash everytime I do it :thinking: I think I do it wrong but thanks adam83 & ron !

I'm glad to see that this is going to be possible !

The 2nd solution of @ron is really close to what I want. I think I'll be able to do something with that !!
The only thing is that I'd like to have states instead of time on the Y axis (as you did just above), and types instead of states. But I think this is impossible. Since I want to track order of types, I need to keep time, right ?

Problem is, I really don't care about time, I just want the order. So I think I find out a solution. I'll just simulate a time depending to the number of types by state.

(Please tell me if this is stupid)

Let say I want that 1 state represent 100 secs

Example : In game 1, there are 5 types.
2 types in State 1 : Type A.time = 50sec / Type B.time = 100sec.
2 types in State 2 : TypeA.time = 150sec / Type C = 200sec.
1 type in State 3 : Type C = 300sec.

So i'll have a simulated time for each type :
50
100
150
200
300

And I can say that 100 sec = state1, 200sec = state2 and 300sec = state3. And i think it'll work for every games.

I'm doing this because if I take the real time, it'll cause a lot of problems since you can take 5sec to do something that someone else will do in 100sec. And this will affect the graphical representation :frowning:

To answer you @ron, the purpose of this is to develop a little algorithm that find all the different patterns and to see the most used (horrible english but i'm sure you got it ! :slight_smile: )

Why I want a graphical representation is because I need to have a general idea about paterns. So i can start to have hypothesis ... And of course in my real data there are more than 3 types too !

Thanks again you are really helpfull !!

(Can only mention 1 people since I'm new :roll_eyes:)

2 Likes

#6

Hi @stalki,

I believe you do need to keep time in there to track the changes between state and type when plotting, as otherwise the lines may overwrite and not be fully informative as in my first attempt.

I didn't really understand your suggestion with the times, so I've made something up which may or may not be helpful. You could summarise each game as a single string representing state and type (and time if that makes sense), then summarise your multiple games based on that:

library(tidyverse)
# typed in, so didn't use exactly what was supplied in terms of naming etc
stalki <- tibble(id=1:16,
                 game = c(1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3),
                 type = c("A","B","A","C","C","A","B","C","C","C","C","B","B","B","A","A"),
                 state = c("s1","s1","s2","s2","s3","s1","s1","s1","s1","s2","s3","s3","s1","s2","s2","s3"),
                 time = c(1,2,3,4,5,1,2,3,4,5,6,7,1,2,3,4))
stalki_seq <- stalki %>%
              group_by(game) %>%
              summarise(seq = paste(paste0(time, 't', type, state), collapse = '>'))
stalki_seq
#> # A tibble: 3 x 2
#>    game seq                                      
#>   <dbl> <chr>                                    
#> 1     1 1tAs1>2tBs1>3tAs2>4tCs2>5tCs3            
#> 2     2 1tAs1>2tBs1>3tCs1>4tCs1>5tCs2>6tCs3>7tBs3
#> 3     3 1tBs1>2tBs2>3tAs2>4tAs3

Created on 2019-03-22 by the reprex package (v0.2.1)

You may not be interested in the time, or it may be too 'variable' (cut it into groups?, round it?) for this purpose, or maybe time 1 isn't the same thing in each game. Or you may want to reduce it to times when either state or type actually changes from the previous time. But you could get to a point where you have a single string representing each game and you could then summarise all your games easily, eg:

stalki_seq %>% group_by(seq) %>% summarise(n = n())

And then you could do graphical representations of each observed game pattern, maybe with some summary info such as it's frequency, rather than each individual game. Or, groups of game patterns if that would make sense/not be too busy, with colour or size representing frequency.

@mara, @adam83, what do you reckon?

1 Like

#7

Hey @ron,

I could never have imagined that this is possible with R Studio ! It seems like a really great idea.
The only problem is that, they wont be totally same patterns from one game to another, but there will be same parts of patterns.
Is it possible, with your proposition, to "exploit" only parts of sequence?

I don't know if it's clear enough..

Anyway I've used your yesterday solution and with my idea of simulated time it works pretty well ! It'll be more easy to understand with this screen (screens > my english :roll_eyes: ) :

image

For all games, state 1 is at 100sec, state 2 at 200 sec and state 3 at 300sec.
And I have somehting like this without colors

[Image 2 in next post] Can only 1 img / post because i'm new.

But know I think I understand you better @ron. With much more games we wont understand anything. And I suppose this is because it's too "static".. Its only traits from point A to points B and not courbs.
What I wanted to see is something like this :

[Image 3 in next post]

It gets dark where there are a lot of things.

But wait, while writing my answer, I've just tryed something that looks fine :

[Image 4 in next post]

This is exactly the kind of thing that I want to. As I said i'm really new to R studio, I just tryed geom_line() + geom_smooth() and I got this (using your proposition)

ggplot(DataExample2, aes(y = Type, x = Time, group = Game)) + geom_line() + geom_smooth()

Now you are more aware about the "direction" I want to take !

Thank you so much for your help,

Stalki.

0 Likes

#8

[Image 2]

0 Likes

#9

[Image 3]

0 Likes

#10

[Image 4]

(sorry moderators for this flood :pray:)

0 Likes

#11

If you can identify parts of a sequence you are interested in, say changes from typeA state1 to typeB state1, you could create a regular expression to identify that and use one of the grep family of functions to filter or mark the games that corresponded to that.

Looks like you're making some progress, I'll take a look at your other graphics at lunchtime and maybe make some more suggestions/comments.

1 Like

#12

Have you considered parallel sets for visualizing this?

library(tidyverse)
library(ggforce)
games <- tibble::tribble(
    ~id, ~Game,    ~Type,    ~State, ~Time,
    1L,     1, "Type A", "State 1",     1,
    2L,     1, "Type B", "State 1",     2,
    3L,     1, "Type A", "State 2",     3,
    4L,     1, "Type C", "State 2",     4,
    5L,     1, "Type C", "State 3",     5,
    6L,     2, "Type A", "State 1",     1,
    7L,     2, "Type B", "State 1",     2,
    8L,     2, "Type C", "State 1",     3,
    9L,     2, "Type C", "State 1",     4,
    10L,     2, "Type C", "State 2",     5,
    11L,     2, "Type C", "State 3",     6,
    12L,     2, "Type B", "State 3",     7,
    13L,     3, "Type B", "State 1",     1,
    14L,     3, "Type B", "State 2",     2,
    15L,     3, "Type A", "State 2",     3,
    16L,     3, "Type A", "State 3",     4
)

games %>% 
    select(Type, State, Game) %>%
    mutate(Game = as.factor(Game)) %>% 
    gather_set_data(1:2) %>% 
    add_count() %>% 
    ggplot(aes(x, id = id, split = y, value = n)) +
    geom_parallel_sets(aes(fill =  Game), alpha = 0.3, axis.width = 0.1) +
    geom_parallel_sets_axes(axis.width = 0.1) +
    geom_parallel_sets_labels(colour = 'white') +
    theme(axis.title.x=element_blank(),
          axis.text.y=element_blank(),
          axis.ticks.y=element_blank())

0 Likes

#13

I don't understand this from your screenshot. For example, in Game 1 there is State 2 at 150 sec and at 200 sec. And State 1 at 50 sec. I'm a bit confused.

I don't think using geom_smooth, as in your 4th image, is appropriate.

@andresrcs suggestion is really attractive. But:

  1. I think @stalki is interested in the sequence of changes/movements not just the co-existence of states and types, and I think that is lost in this graphic (unless I'm misreading it). Am I right about the sequence being important @stalki?
  2. I can't imagine how that would scale across many games.
0 Likes

#14

Yep it looks awesome but I need to analyse way more than 3 games and, as you understood well ron, keep the sequence !

Sorry @ron it's hard for me to explain this in english.

The purpose of this is to order games courbs properly on a graph.

I'm using 100sec for 1 state (but it could be 10sec or 1sec or anything else it doesn't matter).

If game1.state1 has 3 types. I divide 100 / 3 = 33.33.
If game1.state2 has 4 types. I divide 100 / 4 = 25.

If game2.state1 has 5 types. I divide 100/5 = 20.
If game2.state1 has 2 types. I divide 100/2 = 50.

So,
game1.state1.type1 = 33
game1.state1.type2 = 66 (+33)
game1.state1.type3 = 100 (+33) <<<<< This is equals to the begining of state 2.
game1.state2.type1 = 125 (+25)
game1.state2.type2 = 150 (+25)
game1.state2.type3 = 175 (+25)
game1.state2.type4 = 200 (25) <<<<< This is equals to the begining of state 3.

game2.state1.type1 = 20
game2.state1.type2 = 40 (+20)
game2.state1.type3 = 60(+20)
game2.state1.type4 = 80 (+20)
game2.state1.type5 = 100 (+20) <<<<< This is equals to the begining of state 2.
game2.state1.type1 = 150 (+50)
game2.state1.type2 = 200 (+50) <<<<< This is equals to the beginning of state 3.

Now if I create a graph with Game 1 and Game 2, i'll have all types from state 1 between 0 and 100, all types from state 2 between 100 and 200, independently of the number of types in each state.

I thought about this since I want to keep order (sequence?) of types and since I can't use real in game time.

I know it's like crafting (french expression idk if it works in english) ! :slight_smile:

0 Likes

closed #15

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.

0 Likes