How to read Telegram chat JSON?

I have a personal chat downloaded from Telegram which I would like to get into a tidy format with 3 columns: name, timestamp, and message text. Optionally, it might be useful to have a separate tagging for whether it was a test message, reply, forwarded message, or sticker. I tried following the guide on this page to no avail, perhaps because the structure of the JSON has changed.

Here is the head of the current structure:

{
  "name": "Grace 🧤",
  "type": "personal_chat",
  "id": 2730825451,
  "messages": [
    {
      "id": 1980499,
      "type": "message",
      "date": "2020-01-01T00:00:02",
      "from": "Henry",
      "from_id": 4325636679,
      "text": "It's 2020..."
    },
    {
      "id": 1980500,
      "type": "message",
      "date": "2020-01-01T00:00:04",
      "from": "Henry",
      "from_id": 4325636679,
      "text": "Fireworks!"
    },
    {
      "id": 1980501,
      "type": "message",
      "date": "2020-01-01T00:00:05",
      "from": "Grace 🧤 🍒",
      "from_id": 4720225552,
      "text": "You're a minute late!"
    },
    {
      
>       str(tele.json)
List of 4
 $ name    : chr "Grace <U+0001F9E4>"
 $ type    : chr "personal_chat"
 $ id      : num 4.72e+09
 $ messages:List of 312397
  ..$ :List of 6
  .. ..$ id     : num 1980499
  .. ..$ type   : chr "message"
  .. ..$ date   : chr "2020-01-01T00:00:02"
  .. ..$ from   : chr "Henry"
  .. ..$ from_id: num 4.33e+09
  .. ..$ text   : chr "It's 2020.."
  ..$ :List of 6
  .. ..$ id     : num 1980500
  .. ..$ type   : chr "message"
  .. ..$ date   : chr "2020-01-01T00:00:04"
  .. ..$ from   : chr "Henry"
  .. ..$ from_id: num 4.33e+09
  .. ..$ text   : chr "Fireworks!"

I tried importing it as such:

library(rjson)
tele.json <- fromJSON(file = "twentytwenty.json")

# Replicating the example on the website given, I get NULL
rlist::list.filter(tele.json[["chats"]][["list"]], 
                   .[["name"]] == "Henry")

NULL

Each message looks something like this:

$messages[[974]]
$messages[[974]]$id
[1] 1981527

$messages[[974]]$type
[1] "message"

$messages[[974]]$date
[1] "2020-01-01T21:39:51"

$messages[[974]]$from
[1] "Henry"

$messages[[974]]$from_id
[1] 4325636679

$messages[[974]]$text
[1] "The quick brown fox jumped over a lazy dog"

# Prints till [[1000]] then truncates
 [ reached getOption("max.print") -- omitted 311397 entries ]

Replies look like this (I show the str as I was unable to find it in print):

 ..$ :List of 7
  .. ..$ id                 : num 1980589
  .. ..$ type               : chr "message"
  .. ..$ date               : chr "2020-01-01T00:13:43"
  .. ..$ from               : chr "Grace <U+0001F9E4> <U+0001F352>"
  .. ..$ from_id            : num 4.72e+09
  .. ..$ reply_to_message_id: num 1980585
  .. ..$ text               : chr "I like trains~"

I presume forwarded messages would be different too, but I was unable to locate an example.

Stickers look like this, which I would like to label so I can count or remove them:

$messages[[969]]
$messages[[969]]$id
[1] 1981522

$messages[[969]]$type
[1] "message"

$messages[[969]]$date
[1] "2020-01-01T21:39:24"

$messages[[969]]$from
[1] "Grace \U0001f9e4 \U0001f352"

$messages[[969]]$from_id
[1] 4720225552

$messages[[969]]$file
[1] "(File not included. Change data exporting settings to download.)"

$messages[[969]]$thumbnail
[1] "(File not included. Change data exporting settings to download.)"

$messages[[969]]$media_type
[1] "sticker"

$messages[[969]]$sticker_emoji
[1] "\U0001f60d"

$messages[[969]]$width
[1] 512

$messages[[969]]$height
[1] 512

$messages[[969]]$text
[1] ""

Changes in emojis attached to the saved name also seems like it might be an issue, though I suppose it would be possible to filter by str_detect the name in a case-insensitive fashion.

Appreciate any help!

It's still a bit hard "for me" to wrap my head around the structure of the json file. These are text messages and I understand that you cannot share everything, but would it be possible to share a big enough chunk of the data (after redacting it possibly)?

Yep. Here is a small example which I extracted from a less lengthy chat (this is the entirety of the exported JSON):

{
 "name": "Douglas",
 "type": "personal_chat",
 "id": 4908364846,
 "messages": [
  {
   "id": 1615952,
   "type": "message",
   "date": "2019-05-21T17:10:55",
   "from": "Grace",
   "from_id": 4325636679,
   "forwarded_from": "The Dodo",
   "file": "(File not included. Change data exporting settings to download.)",
   "thumbnail": "(File not included. Change data exporting settings to download.)",
   "media_type": "video_file",
   "mime_type": "video/mp4",
   "duration_seconds": 1181,
   "width": 1280,
   "height": 720,
   "text": "Cute cat goes crazy over catnip 👊"
  },
  {
   "id": 1615953,
   "type": "message",
   "date": "2019-05-21T17:11:40",
   "from": "Grace",
   "from_id": 4325636679,
   "text": [
    {
     "type": "link",
     "text": "https://youtube.com/"
    }
   ]
  },
  {
   "id": 2259979,
   "type": "message",
   "date": "2020-08-18T02:22:05",
   "from": "Douglas Quentin",
   "from_id": 4908364846,
   "photo": "(File not included. Change data exporting settings to download.)",
   "width": 591,
   "height": 1280,
   "text": ""
  },
  {
   "id": 2259981,
   "type": "message",
   "date": "2020-08-18T02:32:33",
   "from": "Douglas Quentin",
   "from_id": 4908364846,
   "photo": "(File not included. Change data exporting settings to download.)",
   "width": 941,
   "height": 1280,
   "text": ""
  },
  {
   "id": 2259982,
   "type": "message",
   "date": "2020-08-18T02:32:37",
   "from": "Douglas Quentin",
   "from_id": 4908364846,
   "text": "These are the cars I am looking at."
  },
  {
   "id": 2259984,
   "type": "message",
   "date": "2020-08-18T02:32:56",
   "from": "Douglas Quentin",
   "from_id": 4908364846,
   "text": "I am sorry"
  },
  {
   "id": 2259985,
   "type": "message",
   "date": "2020-08-18T02:33:03",
   "from": "Douglas Quentin",
   "from_id": 4908364846,
   "text": "Maybe SUVs are not for me after all."
  },
  {
   "id": 2259986,
   "type": "message",
   "date": "2020-08-18T02:33:04",
   "from": "Douglas Quentin",
   "from_id": 4908364846,
   "text": "But"
  },
  {
   "id": 2259987,
   "type": "message",
   "date": "2020-08-18T02:33:20",
   "from": "Grace",
   "from_id": 4325636679,
   "text": "Auto or manual?!"
  },
  {
   "id": 2259988,
   "type": "message",
   "date": "2020-08-18T02:33:22",
   "from": "Douglas Quentin",
   "from_id": 4908364846,
   "text": "Am I too lazy"
  },
  {
   "id": 2259989,
   "type": "message",
   "date": "2020-08-18T02:33:24",
   "from": "Douglas Quentin",
   "from_id": 4908364846,
   "text": "Hello"
  },
  {
   "id": 2259990,
   "type": "message",
   "date": "2020-08-18T02:33:29",
   "from": "Douglas Quentin",
   "from_id": 4908364846,
   "text": "Auto is more expensive"
  },
  {
   "id": 2259991,
   "type": "message",
   "date": "2020-08-18T02:33:39",
   "from": "Grace",
   "from_id": 4325636679,
   "text": "Yeah"
  },
  {
   "id": 2259992,
   "type": "message",
   "date": "2020-08-18T02:33:40",
   "from": "Douglas Quentin",
   "from_id": 4908364846,
   "text": "That car has a nice dashboard though"
  },
  {
   "id": 2259993,
   "type": "message",
   "date": "2020-08-18T02:33:43",
   "from": "Douglas Quentin",
   "from_id": 4908364846,
   "text": "Pulled the clutch"
  },
  {
   "id": 2259994,
   "type": "message",
   "date": "2020-08-18T02:33:55",
   "from": "Douglas Quentin",
   "from_id": 4908364846,
   "text": "\" I am a petrolhead\""
  },
  {
   "id": 2259995,
   "type": "message",
   "date": "2020-08-18T02:34:10",
   "from": "Douglas Quentin",
   "from_id": 4908364846,
   "reply_to_message_id": 2259987,
   "text": "No electric"
  },
  {
   "id": 2259996,
   "type": "message",
   "date": "2020-08-18T02:34:12",
   "from": "Douglas Quentin",
   "from_id": 4908364846,
   "text": "Very different engine sound"
  },
  {
   "id": 2259997,
   "type": "message",
   "date": "2020-08-18T02:34:29",
   "from": "Grace",
   "from_id": 4325636679,
   "text": "Looks the same"
  },
  {
   "id": 2259998,
   "type": "message",
   "date": "2020-08-18T02:34:32",
   "from": "Grace",
   "from_id": 4325636679,
   "text": "Ambiguous"
  },
  {
   "id": 2259999,
   "type": "message",
   "date": "2020-08-18T02:34:38",
   "from": "Douglas Quentin",
   "from_id": 4908364846,
   "photo": "(File not included. Change data exporting settings to download.)",
   "width": 909,
   "height": 1280,
   "text": ""
  },
  {
   "id": 2260000,
   "type": "message",
   "date": "2020-08-18T02:34:41",
   "from": "Grace",
   "from_id": 4325636679,
   "text": "Tell the salesman this."
  },
  {
   "id": 2260001,
   "type": "message",
   "date": "2020-08-18T02:34:42",
   "from": "Grace",
   "from_id": 4325636679,
   "text": "Haha"
  },
  {
   "id": 2260002,
   "type": "message",
   "date": "2020-08-18T02:34:48",
   "from": "Douglas Quentin",
   "from_id": 4908364846,
   "text": "Is this silver or off-white?"
  },
  {
   "id": 2260003,
   "type": "message",
   "date": "2020-08-18T02:35:15",
   "from": "Douglas Quentin",
   "from_id": 4908364846,
   "text": "I thought you were referencing this list of white unicorns"
  }
 ]
}

After redacting it was saved as a .txt file. Not sure if .json and .txt has any differences but I managed to import the .txt with fromJSON like so:

tele.json <- rjson::fromJSON(file = "jsonderulo.txt")

If you want to import your dataset in a format that is more manageable, I suggest you use the jsonlite package instead. I saved the data you posted in a .txt file and imported it with the following code. This way you obtain a nice data frame/tibble that you can wrangle to your heart's content:

library(jsonlite)
library(tibble)

json_full <- fromJSON("telegram_text.txt")
dat <- as_tibble(json_full$messages)
dat

# A tibble: 25 x 16
        id type   date    from   from_id forwarded_from file        thumbnail      media_type mime_type duration_seconds width height text   photo       reply_to_messag~
     <int> <chr>  <chr>   <chr>    <dbl> <chr>          <chr>       <chr>          <chr>      <chr>                <int> <int>  <int> <list> <chr>                  <int>
 1 1615952 messa~ 2019-0~ Grace   4.33e9 The Dodo       (File not ~ (File not inc~ video_file video/mp4             1181  1280    720 <chr ~ NA                        NA
 2 1615953 messa~ 2019-0~ Grace   4.33e9 NA             NA          NA             NA         NA                      NA    NA     NA <df[,~ NA                        NA
 3 2259979 messa~ 2020-0~ Dougl~  4.91e9 NA             NA          NA             NA         NA                      NA   591   1280 <chr ~ (File not ~               NA
 4 2259981 messa~ 2020-0~ Dougl~  4.91e9 NA             NA          NA             NA         NA                      NA   941   1280 <chr ~ (File not ~               NA
 5 2259982 messa~ 2020-0~ Dougl~  4.91e9 NA             NA          NA             NA         NA                      NA    NA     NA <chr ~ NA                        NA
 6 2259984 messa~ 2020-0~ Dougl~  4.91e9 NA             NA          NA             NA         NA                      NA    NA     NA <chr ~ NA                        NA
 7 2259985 messa~ 2020-0~ Dougl~  4.91e9 NA             NA          NA             NA         NA                      NA    NA     NA <chr ~ NA                        NA
 8 2259986 messa~ 2020-0~ Dougl~  4.91e9 NA             NA          NA             NA         NA                      NA    NA     NA <chr ~ NA                        NA
 9 2259987 messa~ 2020-0~ Grace   4.33e9 NA             NA          NA             NA         NA                      NA    NA     NA <chr ~ NA                        NA
10 2259988 messa~ 2020-0~ Dougl~  4.91e9 NA             NA          NA             NA         NA                      NA    NA     NA <chr ~ NA                        NA
# ... with 15 more rows

Thanks for getting me off the ground with this. I managed to wrangle most of the text from there. I will mark your response as solution.

Though if you could help me further, one small issue I have is that forwarded messages, replies with links and other non-text only messages are difficult to extract consistently because they are nested in different list or df structures. For example,

# to extract mentions a.k.a. replies
dat$text[[1]][[2]]$text

# hyperlink with comment (to extract the comment)
dat$text[[2]][[2]]

# hyperlink only messages (to extract the link)
dat$text[[3]]$text
# Reprex
structure(list(from = c("Grace <U+0001F9E4> <U+0001F352>", "Grace <U+0001F9E4> <U+0001F352>", 
"Henry", "Henry"), text = list(list("Rekt ", list(type = "mention", 
    text = "at myself")), list(list(type = "link", text = "https://youtu.be/abcdefg"), 
    "\n\nThis is so funny"), structure(list(type = "phone", text = "RIP my phone 2017 - 2019"), class = "data.frame", row.names = 1L), 
    list(list(type = "link", text = "https://www.youtube.com/abcdegfgsdfjshgf"), 
        " Feel better"))), row.names = c(NA, 
-4L), class = c("tbl_df", "tbl", "data.frame"))

Might there be a way to extract these easily despite their different structures? A if-else implementation by detecting is.list , is.data.frame and is.character did not work because there are different list structures.

for(i in 1:nrow(dat) ) {
  
  if (is.character(dat$text[[i]])) {
    msgs[i, 1] <- dat$text[[i]]               # for character
  } else if (is.data.frame(dat$text[[i]])) {
    msgs[i, 1] <- dat$text[[i]]$text          # for df
  } else {
    msgs[i, 1] <- dat$text[[i]][[2]]$text     # for list 
  }
}

I'm definitely willing to help you further.

I am not very sure what kind of wrangling you did but would it not be easier for you to work straight with dat... the data frame? It seems to me that you nested dat by the from variable?

Yes I am working with dat according to your solution. The issue I am having can be seen in the text column. If you notice, while most messages (text only) are of the structure <chr [1]> , links, links with comments, replies etc. are of a different structure. For example in the second row, the message was a link

[[1]]
  type                 text
1 link https://youtube.com/

and is in the structure <df[,2] [1 x 2]>. But because they are quite varied in their structure, I have not found a way to tell R to differentiate/automate the extraction of the link or text in such types of messages.

Okay I understand now. I'll look into and get back to you :slight_smile:

1 Like

@bayesian

I just looked at it and it turns out you that you can easily get a workable format with tidyr::unnest(). Using the data you provided me above:

library(jsonlite)
library(tibble)

json_full <- fromJSON("telegram_text.txt") 

dat <- as_tibble(json_full$messages) %>%
  select(from, text) %>%
  unnest(cols = text)

dat

# A tibble: 25 x 3
   from            text                                         type 
   <chr>           <chr>                                        <chr>
 1 Grace           "Cute cat goes crazy over catnip \U0001f44a" NA   
 2 Grace           "https://youtube.com/"                       link 
 3 Douglas Quentin ""                                           NA   
 4 Douglas Quentin ""                                           NA   
 5 Douglas Quentin "These are the cars I am looking at."        NA   
 6 Douglas Quentin "I am sorry"                                 NA   
 7 Douglas Quentin "Maybe SUVs are not for me after all."       NA   
 8 Douglas Quentin "But"                                        NA   
 9 Grace           "Auto or manual?!"                           NA   
10 Douglas Quentin "Am I too lazy"                              NA   
# ... with 15 more rows

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.