[00:00.000 --> 00:12.640] Hi everyone, it's a big pleasure to be here. [00:12.640 --> 00:20.240] My name is Armin Purnaki and I'm a PhD candidate in Applied Mathematics and I work on building [00:20.240 --> 00:27.360] tools for discourse analysis and we build tools for discourse analysis based on methods [00:27.360 --> 00:32.960] from graph theory, network science and natural language processing and today I want to present [00:32.960 --> 00:39.880] a tool called the Twitter Explorer that is already a bit older but that was built in [00:39.880 --> 00:45.560] the Max Planck Institute for Mathematics and Sciences in my previous group and the idea [00:45.560 --> 00:51.080] was to build a tool that allows researchers who don't necessarily have programming skills [00:51.080 --> 00:59.360] to collect Twitter data, visualize them using graphs and explore the data and maybe generate [00:59.360 --> 01:01.640] hypotheses in their pipelines. [01:01.640 --> 01:07.640] So this kind of tool building and this research happens in the field called computational [01:07.640 --> 01:14.600] social science so when I was preparing my slides two days ago I thought it would be [01:14.600 --> 01:19.840] good to maybe give a little overview of computational social science then say why we built the Twitter [01:19.840 --> 01:25.040] Explorer and where we saw somehow the need for a new tool, of course introduce the features [01:25.040 --> 01:30.360] of the tool because it's kind of a talk on programming, the architecture and maybe give [01:30.360 --> 01:36.520] some insights on the usage but when I was sitting down to make the slides two days ago [01:36.520 --> 01:47.920] I was confronted with this and of course since the tool is essentially an entry point into [01:47.920 --> 01:55.040] the free API there is also a part of it that uses the research API which of course led [01:55.040 --> 01:59.640] us directly to this question, what happens to the research API? [01:59.640 --> 02:03.600] It's also not entirely clear, right? [02:03.600 --> 02:11.280] So I want to maybe instead of giving this the talk the way I was planning to do it, I will [02:11.280 --> 02:16.200] do it but maybe I wanted to ask a few questions first that we might then discuss maybe in [02:16.200 --> 02:20.080] the discussion also and I think there is even some kind of something planned later right [02:20.080 --> 02:24.680] so some kind of panel discussion so I'm just going to throw some questions out there that [02:24.680 --> 02:29.800] I think are really pressing now especially in the research field. [02:29.800 --> 02:34.400] How serious is this but this I don't mean the implications of it because I know a few [02:34.400 --> 02:39.240] people whose thesis is now in jeopardy because they can't collect data in a way but how serious [02:39.240 --> 02:46.120] is it in the sense, how serious is it in the sense, will it actually happen or is it some [02:46.120 --> 02:53.320] scare tactic so I think this is something that is hard to predict and then these are [02:53.320 --> 02:59.000] questions also I think that we can discuss here is how or is there a way for us as users [02:59.000 --> 03:04.400] and not necessarily only as researchers to claim our data or our digital traces that [03:04.400 --> 03:09.560] we use and that we leave on these platforms and how can things like the Digital Services [03:09.560 --> 03:17.960] Act play a role in this and the last question is very broad but how do we move on in the [03:17.960 --> 03:25.480] sense of how can we see this as some kind of wake up call maybe and how can we use this [03:25.480 --> 03:31.040] new development to maybe on one hand move to different platforms but on the other hand [03:31.040 --> 03:36.920] also to think about how we do computational social science in the future. [03:36.920 --> 03:40.600] So with these questions that we're going to discuss later I'm still going to give my [03:40.600 --> 03:49.400] original talk so in computational social science a typical pipeline for a project is you have [03:49.400 --> 03:55.880] a research question then you collect data related to it and in this case it may be data [03:55.880 --> 04:01.960] from online social platforms and then you analyze it and ideally you generated some [04:01.960 --> 04:06.120] more insights on the research question you had in the beginning and sometimes the exploration [04:06.120 --> 04:10.880] and the analysis of the data can help you maybe refine also the questions you had in [04:10.880 --> 04:16.440] the beginning so it's some kind of loop that you can see in this way and the tool that [04:16.440 --> 04:21.960] I'm going to present the Twitter Explorer is precisely made for this second part for [04:21.960 --> 04:29.480] both facilitating the collection and also the exploration of such data and this pipeline [04:29.480 --> 04:35.960] is that we start with text so in our case it's tweets that are annotated with some kind [04:35.960 --> 04:41.280] of metadata we have on Twitter different types of interactions so you can mention someone [04:41.280 --> 04:49.680] you can reply to someone or retweet and we choose one type of metadata and cast it into [04:49.680 --> 04:58.160] an interaction network and then we want to find the most significant for instance clusters [04:58.160 --> 05:05.360] or the significant correlations in this data by using 2D spatializations and typically [05:05.360 --> 05:10.280] these are done using force layouts but today for instance in the graph room there were [05:10.280 --> 05:14.880] also some talks about new methods of node embedding and so I think this is also something [05:14.880 --> 05:21.840] that we can discuss maybe in the question section but one reason why I think force layouts [05:21.840 --> 05:27.240] are good is that especially if you use them in a context where you work with social science [05:27.240 --> 05:33.200] researchers who don't necessarily have a lot of knowledge about the latest machine learning [05:33.200 --> 05:38.640] algorithms they are quite straightforward to explain in the sense that you have a spring [05:38.640 --> 05:45.640] system and nodes that are strongly connected tend to attract each other and especially if [05:45.640 --> 05:52.360] you look at interaction networks on Twitter since retweeting can be considered endorsement [05:52.360 --> 05:57.840] open clusters in such 2D spatializations can then correspond to something like opinion [05:57.840 --> 06:04.080] clusters and there's a lot of research being done in that way but one question that we [06:04.080 --> 06:08.920] always had when we look at these networks is how do we actually go back to the data [06:08.920 --> 06:14.560] that generated them and this is something that we try to tackle with building these [06:14.560 --> 06:21.200] tools so why we built it is firstly to provide an interface for researchers without programming [06:21.200 --> 06:25.720] skills also to collect and visualize the data because we were working a lot with social [06:25.720 --> 06:32.040] scientists that did not have these programming skills but had a lot of hypotheses about the [06:32.040 --> 06:38.560] data that they could not test then of course to facilitate the exploration of controversial [06:38.560 --> 06:45.920] issues on social media and this is the point that I was making before is add some layer [06:45.920 --> 06:52.120] of interpretability to these 2D spatializations by providing an access from within the interface [06:52.120 --> 07:00.320] to the actual data that created these node positions and finally we see it in the context [07:00.320 --> 07:07.560] of a larger scientific scope of using the network paradigm as something like a sampling [07:07.560 --> 07:13.600] mechanism for the data because if you're confronted with a large number of tweets for instance [07:13.600 --> 07:17.720] of course everyone knows that you can't read all of them manually so you need some kind [07:17.720 --> 07:25.080] of way to get to the tweets that are relevant for you to read and this is what we use the [07:25.080 --> 07:30.720] network for essentially so when we look at read-read networks immediately identify for [07:30.720 --> 07:35.960] instance the most influential actors in the debate and then read precisely those tweets [07:35.960 --> 07:42.840] that they made to maybe influence other actors and we call this guided close reading because [07:42.840 --> 07:48.520] if you do only close reading then you have to read all the text if you have distant reading [07:48.520 --> 07:58.040] you kind of look only at the network on a structure level and this is something in between [07:58.040 --> 08:05.960] so what can the tool do it collects tweets I mean I think we have like one week left [08:05.960 --> 08:15.080] for the v2 and the v1 so far the v2 academic is safe but we don't know that so you can [08:15.080 --> 08:20.800] search for it from the past seven days using the API and in the second part in the visualizer [08:20.800 --> 08:25.240] you can do display just a simple time series of the tweets to see maybe if there's some [08:25.240 --> 08:34.000] kind of special activity during one day you can build these interaction networks build [08:34.000 --> 08:38.840] co-hashtag networks so we divide it into some kind of two types of networks which we call [08:38.840 --> 08:45.720] semantic networks and interaction networks and then you can compute the typical measures [08:45.720 --> 08:52.880] people compute on networks and especially compute clusters like using modularity based [08:52.880 --> 09:00.600] algorithms and all this happens in some kind of interactive interface using JavaScript [09:00.600 --> 09:07.120] and D3JS and this is essentially the part where it gets interesting because so far all [09:07.120 --> 09:12.520] the other things you can do it with a lot of other tools especially like AFI or I think [09:12.520 --> 09:16.800] you can even collect tweets right with some plugins so I think all of this is not new [09:16.800 --> 09:23.440] and this is kind of where it gets interesting and I think this is time for a quick demo [09:23.440 --> 09:41.200] I don't know how much okay I have plenty of time I think I talk too fast okay so I have [09:41.200 --> 09:46.880] prepared some Python environments that already have the Twitter Explorer installed but usually [09:46.880 --> 09:57.680] you would do it like this and then all you need to do to fire up this interactive interface [09:57.680 --> 10:05.480] is type Twitter Explorer collector and this will open a browser window from which you [10:05.480 --> 10:14.920] can choose your API access choose the path to which the tweets will be downloaded and [10:14.920 --> 10:21.280] insert your search query maybe adding some advanced settings and saving options so I [10:21.280 --> 10:29.720] don't know this is a question to the audience now what we should search for this is easy [10:29.720 --> 10:33.920] and I already this is you're looking into the future I already have this network prepared [10:33.920 --> 10:43.240] for the last slide sorry we could but what would we look for then API is there maybe [10:43.240 --> 10:56.840] a hashtag like API shutdown maybe we need to go to Twitter itself API something like [10:56.840 --> 11:06.040] this we ideally would find some kind of hashtag know that okay let's just use maybe this as [11:06.040 --> 11:17.560] a search query no okay now it's collecting in the background [11:17.560 --> 11:31.320] then we can open another browser window here fire up the visualizer now we see that while [11:31.320 --> 11:37.000] this is still collecting we can already access oh there were only 400 tweets so there seems [11:37.000 --> 11:48.320] to be so we can [11:48.320 --> 11:55.320] look at a time series of tweets and then we can choose different types of networks to [11:55.320 --> 12:01.680] create we can filter them by language if we want and this is the language of the Twitter [12:01.680 --> 12:10.560] API returns so it's not there's no language detection going on here we can do some network [12:10.560 --> 12:17.120] reduction methods like taking only the largest connected component of the graph then we have [12:17.120 --> 12:23.480] this option here to remove the metadata of nodes that are not what we call public figures [12:23.480 --> 12:31.000] so if you want to publish some explorable networks it is advisable to do so there is [12:31.000 --> 12:38.360] not as far as I know not a very distinctive or clear rule after which point one is considered [12:38.360 --> 12:44.080] such a public figure but within our consortium we decided that it's 5000 followers this is [12:44.080 --> 12:51.120] also something we could discuss but since Twitter is public by default in a way anything [12:51.120 --> 12:58.480] you post is somehow post potentially to be used in displays somewhere then you can export [12:58.480 --> 13:06.920] the graph to all sorts of formats then you can aggregate nodes this means that for instance [13:06.920 --> 13:12.760] removing them based on how many retweets they have or how many retweets they did themselves [13:12.760 --> 13:26.160] and remove for instance nodes that only retweeted one person so is there a chalk maybe somewhere [13:26.160 --> 13:40.320] so if you have a graph and then there are some nodes that only retweet this person they [13:40.320 --> 13:45.600] I don't know if everyone can see that actually but they tend to clutter the force directed [13:45.600 --> 13:50.960] algorithm and structurally they do not necessarily add anything to the network so if you have [13:50.960 --> 13:56.280] very very large graphs it makes sense to remove these and somehow englobe them into this super [13:56.280 --> 14:07.760] node and then you can do traditional community detection [14:07.760 --> 14:23.640] and then it will be saved as a HTML but you can then open so we see here that this is [14:23.640 --> 14:28.600] again now in a retweet network every node is a user and the link is drawn from A to [14:28.600 --> 14:37.240] B if A retweets B and now we can look at this user t-chambers and look at the actual tweets [14:37.240 --> 14:52.640] that were made for them to end up at this part of the visualization okay so the data [14:52.640 --> 14:58.960] we collect this kind of sparse so this network doesn't look that interesting but I have prepared [14:58.960 --> 15:11.120] some fallback option so what we did in a case study a few months ago was to look at the [15:11.120 --> 15:18.520] repercussion of some discussions in the US about red flag laws and red flag laws are [15:18.520 --> 15:27.600] specific kinds of laws for gun control that allow state level judges to confiscate temporarily [15:27.600 --> 15:35.200] guns from people that are deemed to be a threat to themselves or to the public and these laws [15:35.200 --> 15:40.640] created very big repercussions especially on social media and especially in the conservative [15:40.640 --> 15:47.800] camps and this is one typical example where people then can analyze on Twitter if there [15:47.800 --> 15:52.360] is something like echo chambers or if people then maybe retweet each other only from the [15:52.360 --> 15:58.320] similar camps and then people draw very quick conclusions very fast and what we want to [15:58.320 --> 16:05.120] do with this tool is to show that maybe things are not that simple as they seem so I have [16:05.120 --> 16:23.040] prepared these networks but I think I will make it a bit smaller so this is now a bit [16:23.040 --> 16:34.120] bigger than what we had before we have roughly 25,000 nodes and 90,000 links and this is [16:34.120 --> 16:37.640] already one limitation of the tool that I think I would also like to discuss in the [16:37.640 --> 16:43.320] end is that you can't display mentally huge graphs so 100,000 links approximately is [16:43.320 --> 16:48.200] kind of the limit and I think this is also where integrating it with other tools such [16:48.200 --> 16:56.720] as Sigma or Gaffrey might actually make a lot of sense and so now I can call it a nodes [16:56.720 --> 17:06.720] by the Louvain community we can turn off the light also and now we can wonder what are [17:06.720 --> 17:13.960] these two communities and right now the node size is proportional to the in-degree meaning [17:13.960 --> 17:20.920] how often a given node was retweeted but if we want to so these may then be considered [17:20.920 --> 17:27.000] as something like the opinion leaders of the given camps and so if we go here we see for [17:27.000 --> 17:33.720] instance on this side Donald Trump Jr. and we can then look exactly at the tweets that [17:33.720 --> 17:41.760] led the visualization to put him where where he was so okay we don't need to go into the [17:41.760 --> 17:49.040] details of what he said but you see you see the point we can also change the node size [17:49.040 --> 17:54.320] to the number of followers and then we get an immediate view at who the who the main [17:54.320 --> 18:05.560] actors are that in general are also influential on Twitter so we have the New York Times here [18:05.560 --> 18:14.840] and Wall Street Journal so so we can see that we have something like of a more liberal [18:14.840 --> 18:20.520] versus a more conservative camp but if we look only at the retweet behavior we might think [18:20.520 --> 18:26.320] that okay these are separated echo chambers and people do not talk to each other but what [18:26.320 --> 18:31.480] is interesting is if we look at other types of networks in these example so we can look [18:31.480 --> 18:39.400] at the replies I think I will make it a bit smaller and all of the sudden we don't see [18:39.400 --> 18:46.480] this very strong segregated clustering anymore that we saw here maybe it's easier if I put [18:46.480 --> 19:02.640] it in but we see something more of a hairball layout and when we look at the nodes we see [19:02.640 --> 19:08.080] that indeed the path of going for instance from Donald Trump to Hillary Clinton or New [19:08.080 --> 19:11.760] York Times of those people that were very far apart in the retweet network is maybe [19:11.760 --> 19:16.200] not that not that long in the reply network meaning that these opposing camps actually [19:16.200 --> 19:20.240] maybe do talk to each other and it might be more interesting to see how they talk to each [19:20.240 --> 19:25.200] other and what they say and this is something that is that you can do when you when you [19:25.200 --> 19:32.720] use this interface and look at the tweets and that the actual replies so so it allows [19:32.720 --> 19:39.600] you to then actually go to the parts of the platform that that generate this data and [19:39.600 --> 19:50.080] that then generate these networks and finally as a small example of the semantic networks [19:50.080 --> 20:04.720] we can look at the hashtags that are used again and you see that for instance there [20:04.720 --> 20:11.000] is one kind of hateful conservative hashtag cluster which which and again okay maybe I [20:11.000 --> 20:16.120] should have said that in the hashtag networks every node is a hashtag and they are connected [20:16.120 --> 20:23.800] if they appear together in the same tweet so this is a very very low level way of seeing [20:23.800 --> 20:27.200] what is going on in the data in a way you don't need to do some kind of topic modeling [20:27.200 --> 20:32.200] and or complicated techniques you can literally just by looking at the hashtags already get [20:32.200 --> 20:39.800] a hint at how the different camps speak about the same topic so if you go here in this area [20:39.800 --> 20:48.600] this is about gun confiscation laws so Marxism in this case is also good good example right [20:48.600 --> 20:53.920] now we don't really know how it is used right and it can be used either by conservatives [20:53.920 --> 20:59.640] or by liberals and and and it's important to look at it in the context of of the data [20:59.640 --> 21:14.520] so then we would have to okay five minutes left good I will go back to the slides okay [21:14.520 --> 21:21.680] so under the hood this this whole backend of the collector individualizer is written [21:21.680 --> 21:32.040] in python and it's using the streamlit python library to serve it on a local front end so [21:32.040 --> 21:37.120] this is actually a very convenient library and I guess a lot of people also know it but [21:37.120 --> 21:41.880] you can write your code in python and then it essentially serves it in interfaces that [21:41.880 --> 21:54.560] look like this and the explorer is written in html and javascript and it uses d3 and [21:54.560 --> 22:03.320] prints the graph on canvas which is also why it's probably not as as fast as sigma is but [22:03.320 --> 22:08.920] it has some nice other features that are that are especially due to this force graph library [22:08.920 --> 22:15.360] so I think if anyone has questions I'm going to go into the details in the questions and [22:15.360 --> 22:22.640] so this is how we install it it's fairly simple if you have a running python bigger than 3.7 [22:22.640 --> 22:27.280] and there's also an API so of course especially here probably people will not be so interested [22:27.280 --> 22:32.600] in using the streamlit interface but you may want to include it into some kind of existing [22:32.600 --> 22:41.840] code pipeline that you have and this is essentially the API for semantic networks and interaction [22:41.840 --> 22:53.600] networks so I invite you to try it out yourself while you still can you have five days of course [22:53.600 --> 23:00.200] if you have the research API you might be able to use it for a bit longer but otherwise go on [23:00.200 --> 23:07.720] these websites fast and I will stop the talk with some questions actually I came here with more [23:07.720 --> 23:13.640] questions than answers and I'm really hoping for a lively discussion now because I'm not I'm not [23:13.640 --> 23:18.960] originally a developer so I kind of wrote this a bit on my own and I wonder if this [23:18.960 --> 23:23.320] interact integration of python and javascript is actually a good idea because in theory it would [23:23.320 --> 23:28.400] also be possible to probably do everything in javascript and maybe do it on the client side [23:28.400 --> 23:33.640] so you wouldn't have to install all these libraries then okay maybe one thing that I would like to [23:33.640 --> 23:40.000] show is that I experimented with temporal networks so of course doing temporal force [23:40.000 --> 23:47.720] layouts is kind of a non-trivial task but we can kind of look a little bit at the temporality [23:47.720 --> 23:53.720] of these networks by at least displaying only the links that that are active during a given day [23:53.720 --> 24:01.160] so this is also kind of nice I think but I would like to discuss maybe other visualization paradigms [24:01.160 --> 24:07.160] for this kind of network then one thing that would be really interesting I think is to dig [24:07.160 --> 24:13.280] deeper into a visualization paradigm for hierarchical structure of communities meaning that okay in [24:13.280 --> 24:19.080] theory I can either run stochastic block models or Luvain community detections and stop them at a [24:19.080 --> 24:23.880] certain level and then have some kind of hierarchical node structure but how to visualize that is [24:23.880 --> 24:27.320] another question but I think it would be very interesting especially for very large graphs [24:27.320 --> 24:33.120] and then another question is force layouts should we still use them now that everyone is [24:33.120 --> 24:38.400] doing node2vec and all these other things I think yes but maybe there's good arguments against it [24:38.400 --> 24:46.240] and on a more like deeper conceptual level is and this is a question the first one is a question [24:46.240 --> 24:50.040] for people who already have more much more experience in building tools for the social [24:50.040 --> 24:56.800] sciences is how do you kind of further integrate these kinds of methods into existing maybe also [24:56.800 --> 25:04.280] more qualitative social science pipelines so yes it's kind of an open question and how can we [25:04.280 --> 25:08.480] devise something like a research protocol for these kinds of interactive network visualizations [25:08.480 --> 25:14.920] because as you saw in my demo we kind of look at the big nodes we look at the tweets they made [25:14.920 --> 25:20.040] and it gives us some kind of intuition of what's going on in the debate but how can we formalize [25:20.040 --> 25:24.960] such kinds of visual network analysis and I think I mean there's people in the audience who [25:24.960 --> 25:31.000] actually work on this so I will be very interesting for me to talk about this and finally to end on [25:31.000 --> 25:40.040] actually maybe a bit nicer note is that there is the network of force them as we had already said [25:40.040 --> 25:45.440] in the beginning on this website so it is updated every 15 minutes thanks to a data [25:45.440 --> 25:51.560] collection done by my colleague Beatrice thank you very much and so if you go on this website [25:51.560 --> 25:59.960] you can see the retweet network of force them and if you tweet and you can find yourself in [25:59.960 --> 26:14.960] the network also so yeah what do we have here in the middle okay force them itself then there was [26:14.960 --> 26:31.720] Ubuntu they beyond somewhere okay time's up thank you [26:44.960 --> 26:59.840] yes so the question is I mentioned that you can only collect tweets from the last seven days [26:59.840 --> 27:06.680] with the free API this is a limitation but the tool itself just writes into existing [27:06.680 --> 27:11.760] CSV it depends basically so if you do the same keyword search multiple times and it will just [27:11.760 --> 27:24.760] depend to a CSV yes I mean this is the question right now it depends because the question is [27:24.760 --> 27:32.760] what happens on mastodon I don't know all these like if you want to look at political controversies [27:32.760 --> 27:40.560] and such discussions I don't know if mastodon is mature enough yet to or adopted enough yet I think [27:40.560 --> 27:45.520] if you want to look at the fostering community it's great so for this yes but yeah [28:10.560 --> 28:18.360] this kind of profit for the city and so well actually for me that's more aggressive to see the [28:18.360 --> 28:24.440] difference the stricter of this sort of this diagram that we are going to use about the social [28:24.440 --> 28:31.840] of what we are talking about let's see the conference of citizens with human people same in [28:31.840 --> 28:48.560] collection yeah I don't know what to think about that well I don't know which point exactly should [28:48.560 --> 28:56.280] I address because you raised a lot of so so are you okay if I can rephrase so you are concerned [28:56.280 --> 29:02.400] about these kinds of this kind of research also yes because because it can be used to track users [29:02.400 --> 29:16.600] across political camps right yes okay I see so I think it's more about the [29:16.600 --> 29:21.480] representativity of Twitter data for for the wide wider population which of course you're [29:21.480 --> 29:27.440] you're totally right it is kind of a subset of highly politicized maybe also a bit more [29:27.440 --> 29:31.800] educated than average people so you cannot but but this is not what we're trying to do [29:31.800 --> 29:36.800] also you're not trying to infer I don't know actually lecture results based on Twitter data [29:36.800 --> 29:52.280] so I yeah I don't know if I addressed the point now maybe we can take more about it right