[00:00.000 --> 00:12.640]  Hi everyone, it's a big pleasure to be here.
[00:12.640 --> 00:20.240]  My name is Armin Purnaki and I'm a PhD candidate in Applied Mathematics and I work on building
[00:20.240 --> 00:27.360]  tools for discourse analysis and we build tools for discourse analysis based on methods
[00:27.360 --> 00:32.960]  from graph theory, network science and natural language processing and today I want to present
[00:32.960 --> 00:39.880]  a tool called the Twitter Explorer that is already a bit older but that was built in
[00:39.880 --> 00:45.560]  the Max Planck Institute for Mathematics and Sciences in my previous group and the idea
[00:45.560 --> 00:51.080]  was to build a tool that allows researchers who don't necessarily have programming skills
[00:51.080 --> 00:59.360]  to collect Twitter data, visualize them using graphs and explore the data and maybe generate
[00:59.360 --> 01:01.640]  hypotheses in their pipelines.
[01:01.640 --> 01:07.640]  So this kind of tool building and this research happens in the field called computational
[01:07.640 --> 01:14.600]  social science so when I was preparing my slides two days ago I thought it would be
[01:14.600 --> 01:19.840]  good to maybe give a little overview of computational social science then say why we built the Twitter
[01:19.840 --> 01:25.040]  Explorer and where we saw somehow the need for a new tool, of course introduce the features
[01:25.040 --> 01:30.360]  of the tool because it's kind of a talk on programming, the architecture and maybe give
[01:30.360 --> 01:36.520]  some insights on the usage but when I was sitting down to make the slides two days ago
[01:36.520 --> 01:47.920]  I was confronted with this and of course since the tool is essentially an entry point into
[01:47.920 --> 01:55.040]  the free API there is also a part of it that uses the research API which of course led
[01:55.040 --> 01:59.640]  us directly to this question, what happens to the research API?
[01:59.640 --> 02:03.600]  It's also not entirely clear, right?
[02:03.600 --> 02:11.280]  So I want to maybe instead of giving this the talk the way I was planning to do it, I will
[02:11.280 --> 02:16.200]  do it but maybe I wanted to ask a few questions first that we might then discuss maybe in
[02:16.200 --> 02:20.080]  the discussion also and I think there is even some kind of something planned later right
[02:20.080 --> 02:24.680]  so some kind of panel discussion so I'm just going to throw some questions out there that
[02:24.680 --> 02:29.800]  I think are really pressing now especially in the research field.
[02:29.800 --> 02:34.400]  How serious is this but this I don't mean the implications of it because I know a few
[02:34.400 --> 02:39.240]  people whose thesis is now in jeopardy because they can't collect data in a way but how serious
[02:39.240 --> 02:46.120]  is it in the sense, how serious is it in the sense, will it actually happen or is it some
[02:46.120 --> 02:53.320]  scare tactic so I think this is something that is hard to predict and then these are
[02:53.320 --> 02:59.000]  questions also I think that we can discuss here is how or is there a way for us as users
[02:59.000 --> 03:04.400]  and not necessarily only as researchers to claim our data or our digital traces that
[03:04.400 --> 03:09.560]  we use and that we leave on these platforms and how can things like the Digital Services
[03:09.560 --> 03:17.960]  Act play a role in this and the last question is very broad but how do we move on in the
[03:17.960 --> 03:25.480]  sense of how can we see this as some kind of wake up call maybe and how can we use this
[03:25.480 --> 03:31.040]  new development to maybe on one hand move to different platforms but on the other hand
[03:31.040 --> 03:36.920]  also to think about how we do computational social science in the future.
[03:36.920 --> 03:40.600]  So with these questions that we're going to discuss later I'm still going to give my
[03:40.600 --> 03:49.400]  original talk so in computational social science a typical pipeline for a project is you have
[03:49.400 --> 03:55.880]  a research question then you collect data related to it and in this case it may be data
[03:55.880 --> 04:01.960]  from online social platforms and then you analyze it and ideally you generated some
[04:01.960 --> 04:06.120]  more insights on the research question you had in the beginning and sometimes the exploration
[04:06.120 --> 04:10.880]  and the analysis of the data can help you maybe refine also the questions you had in
[04:10.880 --> 04:16.440]  the beginning so it's some kind of loop that you can see in this way and the tool that
[04:16.440 --> 04:21.960]  I'm going to present the Twitter Explorer is precisely made for this second part for
[04:21.960 --> 04:29.480]  both facilitating the collection and also the exploration of such data and this pipeline
[04:29.480 --> 04:35.960]  is that we start with text so in our case it's tweets that are annotated with some kind
[04:35.960 --> 04:41.280]  of metadata we have on Twitter different types of interactions so you can mention someone
[04:41.280 --> 04:49.680]  you can reply to someone or retweet and we choose one type of metadata and cast it into
[04:49.680 --> 04:58.160]  an interaction network and then we want to find the most significant for instance clusters
[04:58.160 --> 05:05.360]  or the significant correlations in this data by using 2D spatializations and typically
[05:05.360 --> 05:10.280]  these are done using force layouts but today for instance in the graph room there were
[05:10.280 --> 05:14.880]  also some talks about new methods of node embedding and so I think this is also something
[05:14.880 --> 05:21.840]  that we can discuss maybe in the question section but one reason why I think force layouts
[05:21.840 --> 05:27.240]  are good is that especially if you use them in a context where you work with social science
[05:27.240 --> 05:33.200]  researchers who don't necessarily have a lot of knowledge about the latest machine learning
[05:33.200 --> 05:38.640]  algorithms they are quite straightforward to explain in the sense that you have a spring
[05:38.640 --> 05:45.640]  system and nodes that are strongly connected tend to attract each other and especially if
[05:45.640 --> 05:52.360]  you look at interaction networks on Twitter since retweeting can be considered endorsement
[05:52.360 --> 05:57.840]  open clusters in such 2D spatializations can then correspond to something like opinion
[05:57.840 --> 06:04.080]  clusters and there's a lot of research being done in that way but one question that we
[06:04.080 --> 06:08.920]  always had when we look at these networks is how do we actually go back to the data
[06:08.920 --> 06:14.560]  that generated them and this is something that we try to tackle with building these
[06:14.560 --> 06:21.200]  tools so why we built it is firstly to provide an interface for researchers without programming
[06:21.200 --> 06:25.720]  skills also to collect and visualize the data because we were working a lot with social
[06:25.720 --> 06:32.040]  scientists that did not have these programming skills but had a lot of hypotheses about the
[06:32.040 --> 06:38.560]  data that they could not test then of course to facilitate the exploration of controversial
[06:38.560 --> 06:45.920]  issues on social media and this is the point that I was making before is add some layer
[06:45.920 --> 06:52.120]  of interpretability to these 2D spatializations by providing an access from within the interface
[06:52.120 --> 07:00.320]  to the actual data that created these node positions and finally we see it in the context
[07:00.320 --> 07:07.560]  of a larger scientific scope of using the network paradigm as something like a sampling
[07:07.560 --> 07:13.600]  mechanism for the data because if you're confronted with a large number of tweets for instance
[07:13.600 --> 07:17.720]  of course everyone knows that you can't read all of them manually so you need some kind
[07:17.720 --> 07:25.080]  of way to get to the tweets that are relevant for you to read and this is what we use the
[07:25.080 --> 07:30.720]  network for essentially so when we look at read-read networks immediately identify for
[07:30.720 --> 07:35.960]  instance the most influential actors in the debate and then read precisely those tweets
[07:35.960 --> 07:42.840]  that they made to maybe influence other actors and we call this guided close reading because
[07:42.840 --> 07:48.520]  if you do only close reading then you have to read all the text if you have distant reading
[07:48.520 --> 07:58.040]  you kind of look only at the network on a structure level and this is something in between
[07:58.040 --> 08:05.960]  so what can the tool do it collects tweets I mean I think we have like one week left
[08:05.960 --> 08:15.080]  for the v2 and the v1 so far the v2 academic is safe but we don't know that so you can
[08:15.080 --> 08:20.800]  search for it from the past seven days using the API and in the second part in the visualizer
[08:20.800 --> 08:25.240]  you can do display just a simple time series of the tweets to see maybe if there's some
[08:25.240 --> 08:34.000]  kind of special activity during one day you can build these interaction networks build
[08:34.000 --> 08:38.840]  co-hashtag networks so we divide it into some kind of two types of networks which we call
[08:38.840 --> 08:45.720]  semantic networks and interaction networks and then you can compute the typical measures
[08:45.720 --> 08:52.880]  people compute on networks and especially compute clusters like using modularity based
[08:52.880 --> 09:00.600]  algorithms and all this happens in some kind of interactive interface using JavaScript
[09:00.600 --> 09:07.120]  and D3JS and this is essentially the part where it gets interesting because so far all
[09:07.120 --> 09:12.520]  the other things you can do it with a lot of other tools especially like AFI or I think
[09:12.520 --> 09:16.800]  you can even collect tweets right with some plugins so I think all of this is not new
[09:16.800 --> 09:23.440]  and this is kind of where it gets interesting and I think this is time for a quick demo
[09:23.440 --> 09:41.200]  I don't know how much okay I have plenty of time I think I talk too fast okay so I have
[09:41.200 --> 09:46.880]  prepared some Python environments that already have the Twitter Explorer installed but usually
[09:46.880 --> 09:57.680]  you would do it like this and then all you need to do to fire up this interactive interface
[09:57.680 --> 10:05.480]  is type Twitter Explorer collector and this will open a browser window from which you
[10:05.480 --> 10:14.920]  can choose your API access choose the path to which the tweets will be downloaded and
[10:14.920 --> 10:21.280]  insert your search query maybe adding some advanced settings and saving options so I
[10:21.280 --> 10:29.720]  don't know this is a question to the audience now what we should search for this is easy
[10:29.720 --> 10:33.920]  and I already this is you're looking into the future I already have this network prepared
[10:33.920 --> 10:43.240]  for the last slide sorry we could but what would we look for then API is there maybe
[10:43.240 --> 10:56.840]  a hashtag like API shutdown maybe we need to go to Twitter itself API something like
[10:56.840 --> 11:06.040]  this we ideally would find some kind of hashtag know that okay let's just use maybe this as
[11:06.040 --> 11:17.560]  a search query no okay now it's collecting in the background
[11:17.560 --> 11:31.320]  then we can open another browser window here fire up the visualizer now we see that while
[11:31.320 --> 11:37.000]  this is still collecting we can already access oh there were only 400 tweets so there seems
[11:37.000 --> 11:48.320]  to be so we can
[11:48.320 --> 11:55.320]  look at a time series of tweets and then we can choose different types of networks to
[11:55.320 --> 12:01.680]  create we can filter them by language if we want and this is the language of the Twitter
[12:01.680 --> 12:10.560]  API returns so it's not there's no language detection going on here we can do some network
[12:10.560 --> 12:17.120]  reduction methods like taking only the largest connected component of the graph then we have
[12:17.120 --> 12:23.480]  this option here to remove the metadata of nodes that are not what we call public figures
[12:23.480 --> 12:31.000]  so if you want to publish some explorable networks it is advisable to do so there is
[12:31.000 --> 12:38.360]  not as far as I know not a very distinctive or clear rule after which point one is considered
[12:38.360 --> 12:44.080]  such a public figure but within our consortium we decided that it's 5000 followers this is
[12:44.080 --> 12:51.120]  also something we could discuss but since Twitter is public by default in a way anything
[12:51.120 --> 12:58.480]  you post is somehow post potentially to be used in displays somewhere then you can export
[12:58.480 --> 13:06.920]  the graph to all sorts of formats then you can aggregate nodes this means that for instance
[13:06.920 --> 13:12.760]  removing them based on how many retweets they have or how many retweets they did themselves
[13:12.760 --> 13:26.160]  and remove for instance nodes that only retweeted one person so is there a chalk maybe somewhere
[13:26.160 --> 13:40.320]  so if you have a graph and then there are some nodes that only retweet this person they
[13:40.320 --> 13:45.600]  I don't know if everyone can see that actually but they tend to clutter the force directed
[13:45.600 --> 13:50.960]  algorithm and structurally they do not necessarily add anything to the network so if you have
[13:50.960 --> 13:56.280]  very very large graphs it makes sense to remove these and somehow englobe them into this super
[13:56.280 --> 14:07.760]  node and then you can do traditional community detection
[14:07.760 --> 14:23.640]  and then it will be saved as a HTML but you can then open so we see here that this is
[14:23.640 --> 14:28.600]  again now in a retweet network every node is a user and the link is drawn from A to
[14:28.600 --> 14:37.240]  B if A retweets B and now we can look at this user t-chambers and look at the actual tweets
[14:37.240 --> 14:52.640]  that were made for them to end up at this part of the visualization okay so the data
[14:52.640 --> 14:58.960]  we collect this kind of sparse so this network doesn't look that interesting but I have prepared
[14:58.960 --> 15:11.120]  some fallback option so what we did in a case study a few months ago was to look at the
[15:11.120 --> 15:18.520]  repercussion of some discussions in the US about red flag laws and red flag laws are
[15:18.520 --> 15:27.600]  specific kinds of laws for gun control that allow state level judges to confiscate temporarily
[15:27.600 --> 15:35.200]  guns from people that are deemed to be a threat to themselves or to the public and these laws
[15:35.200 --> 15:40.640]  created very big repercussions especially on social media and especially in the conservative
[15:40.640 --> 15:47.800]  camps and this is one typical example where people then can analyze on Twitter if there
[15:47.800 --> 15:52.360]  is something like echo chambers or if people then maybe retweet each other only from the
[15:52.360 --> 15:58.320]  similar camps and then people draw very quick conclusions very fast and what we want to
[15:58.320 --> 16:05.120]  do with this tool is to show that maybe things are not that simple as they seem so I have
[16:05.120 --> 16:23.040]  prepared these networks but I think I will make it a bit smaller so this is now a bit
[16:23.040 --> 16:34.120]  bigger than what we had before we have roughly 25,000 nodes and 90,000 links and this is
[16:34.120 --> 16:37.640]  already one limitation of the tool that I think I would also like to discuss in the
[16:37.640 --> 16:43.320]  end is that you can't display mentally huge graphs so 100,000 links approximately is
[16:43.320 --> 16:48.200]  kind of the limit and I think this is also where integrating it with other tools such
[16:48.200 --> 16:56.720]  as Sigma or Gaffrey might actually make a lot of sense and so now I can call it a nodes
[16:56.720 --> 17:06.720]  by the Louvain community we can turn off the light also and now we can wonder what are
[17:06.720 --> 17:13.960]  these two communities and right now the node size is proportional to the in-degree meaning
[17:13.960 --> 17:20.920]  how often a given node was retweeted but if we want to so these may then be considered
[17:20.920 --> 17:27.000]  as something like the opinion leaders of the given camps and so if we go here we see for
[17:27.000 --> 17:33.720]  instance on this side Donald Trump Jr. and we can then look exactly at the tweets that
[17:33.720 --> 17:41.760]  led the visualization to put him where where he was so okay we don't need to go into the
[17:41.760 --> 17:49.040]  details of what he said but you see you see the point we can also change the node size
[17:49.040 --> 17:54.320]  to the number of followers and then we get an immediate view at who the who the main
[17:54.320 --> 18:05.560]  actors are that in general are also influential on Twitter so we have the New York Times here
[18:05.560 --> 18:14.840]  and Wall Street Journal so so we can see that we have something like of a more liberal
[18:14.840 --> 18:20.520]  versus a more conservative camp but if we look only at the retweet behavior we might think
[18:20.520 --> 18:26.320]  that okay these are separated echo chambers and people do not talk to each other but what
[18:26.320 --> 18:31.480]  is interesting is if we look at other types of networks in these example so we can look
[18:31.480 --> 18:39.400]  at the replies I think I will make it a bit smaller and all of the sudden we don't see
[18:39.400 --> 18:46.480]  this very strong segregated clustering anymore that we saw here maybe it's easier if I put
[18:46.480 --> 19:02.640]  it in but we see something more of a hairball layout and when we look at the nodes we see
[19:02.640 --> 19:08.080]  that indeed the path of going for instance from Donald Trump to Hillary Clinton or New
[19:08.080 --> 19:11.760]  York Times of those people that were very far apart in the retweet network is maybe
[19:11.760 --> 19:16.200]  not that not that long in the reply network meaning that these opposing camps actually
[19:16.200 --> 19:20.240]  maybe do talk to each other and it might be more interesting to see how they talk to each
[19:20.240 --> 19:25.200]  other and what they say and this is something that is that you can do when you when you
[19:25.200 --> 19:32.720]  use this interface and look at the tweets and that the actual replies so so it allows
[19:32.720 --> 19:39.600]  you to then actually go to the parts of the platform that that generate this data and
[19:39.600 --> 19:50.080]  that then generate these networks and finally as a small example of the semantic networks
[19:50.080 --> 20:04.720]  we can look at the hashtags that are used again and you see that for instance there
[20:04.720 --> 20:11.000]  is one kind of hateful conservative hashtag cluster which which and again okay maybe I
[20:11.000 --> 20:16.120]  should have said that in the hashtag networks every node is a hashtag and they are connected
[20:16.120 --> 20:23.800]  if they appear together in the same tweet so this is a very very low level way of seeing
[20:23.800 --> 20:27.200]  what is going on in the data in a way you don't need to do some kind of topic modeling
[20:27.200 --> 20:32.200]  and or complicated techniques you can literally just by looking at the hashtags already get
[20:32.200 --> 20:39.800]  a hint at how the different camps speak about the same topic so if you go here in this area
[20:39.800 --> 20:48.600]  this is about gun confiscation laws so Marxism in this case is also good good example right
[20:48.600 --> 20:53.920]  now we don't really know how it is used right and it can be used either by conservatives
[20:53.920 --> 20:59.640]  or by liberals and and and it's important to look at it in the context of of the data
[20:59.640 --> 21:14.520]  so then we would have to okay five minutes left good I will go back to the slides okay
[21:14.520 --> 21:21.680]  so under the hood this this whole backend of the collector individualizer is written
[21:21.680 --> 21:32.040]  in python and it's using the streamlit python library to serve it on a local front end so
[21:32.040 --> 21:37.120]  this is actually a very convenient library and I guess a lot of people also know it but
[21:37.120 --> 21:41.880]  you can write your code in python and then it essentially serves it in interfaces that
[21:41.880 --> 21:54.560]  look like this and the explorer is written in html and javascript and it uses d3 and
[21:54.560 --> 22:03.320]  prints the graph on canvas which is also why it's probably not as as fast as sigma is but
[22:03.320 --> 22:08.920]  it has some nice other features that are that are especially due to this force graph library
[22:08.920 --> 22:15.360]  so I think if anyone has questions I'm going to go into the details in the questions and
[22:15.360 --> 22:22.640]  so this is how we install it it's fairly simple if you have a running python bigger than 3.7
[22:22.640 --> 22:27.280]  and there's also an API so of course especially here probably people will not be so interested
[22:27.280 --> 22:32.600]  in using the streamlit interface but you may want to include it into some kind of existing
[22:32.600 --> 22:41.840]  code pipeline that you have and this is essentially the API for semantic networks and interaction
[22:41.840 --> 22:53.600]  networks so I invite you to try it out yourself while you still can you have five days of course
[22:53.600 --> 23:00.200]  if you have the research API you might be able to use it for a bit longer but otherwise go on
[23:00.200 --> 23:07.720]  these websites fast and I will stop the talk with some questions actually I came here with more
[23:07.720 --> 23:13.640]  questions than answers and I'm really hoping for a lively discussion now because I'm not I'm not
[23:13.640 --> 23:18.960]  originally a developer so I kind of wrote this a bit on my own and I wonder if this
[23:18.960 --> 23:23.320]  interact integration of python and javascript is actually a good idea because in theory it would
[23:23.320 --> 23:28.400]  also be possible to probably do everything in javascript and maybe do it on the client side
[23:28.400 --> 23:33.640]  so you wouldn't have to install all these libraries then okay maybe one thing that I would like to
[23:33.640 --> 23:40.000]  show is that I experimented with temporal networks so of course doing temporal force
[23:40.000 --> 23:47.720]  layouts is kind of a non-trivial task but we can kind of look a little bit at the temporality
[23:47.720 --> 23:53.720]  of these networks by at least displaying only the links that that are active during a given day
[23:53.720 --> 24:01.160]  so this is also kind of nice I think but I would like to discuss maybe other visualization paradigms
[24:01.160 --> 24:07.160]  for this kind of network then one thing that would be really interesting I think is to dig
[24:07.160 --> 24:13.280]  deeper into a visualization paradigm for hierarchical structure of communities meaning that okay in
[24:13.280 --> 24:19.080]  theory I can either run stochastic block models or Luvain community detections and stop them at a
[24:19.080 --> 24:23.880]  certain level and then have some kind of hierarchical node structure but how to visualize that is
[24:23.880 --> 24:27.320]  another question but I think it would be very interesting especially for very large graphs
[24:27.320 --> 24:33.120]  and then another question is force layouts should we still use them now that everyone is
[24:33.120 --> 24:38.400]  doing node2vec and all these other things I think yes but maybe there's good arguments against it
[24:38.400 --> 24:46.240]  and on a more like deeper conceptual level is and this is a question the first one is a question
[24:46.240 --> 24:50.040]  for people who already have more much more experience in building tools for the social
[24:50.040 --> 24:56.800]  sciences is how do you kind of further integrate these kinds of methods into existing maybe also
[24:56.800 --> 25:04.280]  more qualitative social science pipelines so yes it's kind of an open question and how can we
[25:04.280 --> 25:08.480]  devise something like a research protocol for these kinds of interactive network visualizations
[25:08.480 --> 25:14.920]  because as you saw in my demo we kind of look at the big nodes we look at the tweets they made
[25:14.920 --> 25:20.040]  and it gives us some kind of intuition of what's going on in the debate but how can we formalize
[25:20.040 --> 25:24.960]  such kinds of visual network analysis and I think I mean there's people in the audience who
[25:24.960 --> 25:31.000]  actually work on this so I will be very interesting for me to talk about this and finally to end on
[25:31.000 --> 25:40.040]  actually maybe a bit nicer note is that there is the network of force them as we had already said
[25:40.040 --> 25:45.440]  in the beginning on this website so it is updated every 15 minutes thanks to a data
[25:45.440 --> 25:51.560]  collection done by my colleague Beatrice thank you very much and so if you go on this website
[25:51.560 --> 25:59.960]  you can see the retweet network of force them and if you tweet and you can find yourself in
[25:59.960 --> 26:14.960]  the network also so yeah what do we have here in the middle okay force them itself then there was
[26:14.960 --> 26:31.720]  Ubuntu they beyond somewhere okay time's up thank you
[26:44.960 --> 26:59.840]  yes so the question is I mentioned that you can only collect tweets from the last seven days
[26:59.840 --> 27:06.680]  with the free API this is a limitation but the tool itself just writes into existing
[27:06.680 --> 27:11.760]  CSV it depends basically so if you do the same keyword search multiple times and it will just
[27:11.760 --> 27:24.760]  depend to a CSV yes I mean this is the question right now it depends because the question is
[27:24.760 --> 27:32.760]  what happens on mastodon I don't know all these like if you want to look at political controversies
[27:32.760 --> 27:40.560]  and such discussions I don't know if mastodon is mature enough yet to or adopted enough yet I think
[27:40.560 --> 27:45.520]  if you want to look at the fostering community it's great so for this yes but yeah
[28:10.560 --> 28:18.360]  this kind of profit for the city and so well actually for me that's more aggressive to see the
[28:18.360 --> 28:24.440]  difference the stricter of this sort of this diagram that we are going to use about the social
[28:24.440 --> 28:31.840]  of what we are talking about let's see the conference of citizens with human people same in
[28:31.840 --> 28:48.560]  collection yeah I don't know what to think about that well I don't know which point exactly should
[28:48.560 --> 28:56.280]  I address because you raised a lot of so so are you okay if I can rephrase so you are concerned
[28:56.280 --> 29:02.400]  about these kinds of this kind of research also yes because because it can be used to track users
[29:02.400 --> 29:16.600]  across political camps right yes okay I see so I think it's more about the
[29:16.600 --> 29:21.480]  representativity of Twitter data for for the wide wider population which of course you're
[29:21.480 --> 29:27.440]  you're totally right it is kind of a subset of highly politicized maybe also a bit more
[29:27.440 --> 29:31.800]  educated than average people so you cannot but but this is not what we're trying to do
[29:31.800 --> 29:36.800]  also you're not trying to infer I don't know actually lecture results based on Twitter data
[29:36.800 --> 29:52.280]  so I yeah I don't know if I addressed the point now maybe we can take more about it right