Thank you.
So well, this is an open source conference and I'm going to talk to you about ads on
Facebook ads of all things and then the Meta API.
So please let me explain.
Well, so I'm Jean-Glennard, I'm working at Viginum and Viginum is a French operational
department, technical department aiming at analyzing foreign digital interference.
So we have three big missions.
The first one is to detect and analyze this foreign information manipulation and interferences
and the social media and online.
Then we also lead and coordinate the French protection mechanism against these attacks.
And finally, we also contribute to international works and in particular European works to
study and understand these foreign digital interferences.
So that's why I'm here today.
And why do we care about ads?
Well, of course, it's hard to talk about ads on Facebook without talking about Cambridge
Analytica.
So it was a micro targeting scandal, very well known because of the involvement in the
Trump campaign in 2016.
And then there was a high suspicion of involvement in Brexit and possibly in other campaigns,
other elections.
And there might be other companies like Cambridge Analytica.
So historically, this is really a big thing in this information world.
But obviously, after these events, Facebook made some steps to provide some transparency
in that advertisement empire.
And they did open a web portal to at least let people know what were the ads that could
target people.
And then we had other operations.
And this is a long running operation, the Doppelganger.
This operation was already fully reported in September 2022 by EU in the fallout.
And it was linked by Meta at the time to Russia.
So this is actually a screenshot at the left of the web portal where you can see the ads.
And those are free fake pages sending the same ad with the aim of polarizing opinion.
And basically trying to manipulate the opinion.
And then the campaign re-suffaced with another shape.
And we made a report on this in June 2023.
And this time, it was like a big network of typosquad websites and media, like big media
in Europe, also governmental sites.
There were like thousands and thousands of Twitter bots and still Facebook ads.
And you see here a screenshot at the top.
And then there was this report in October 2023 by Resets focusing really on the Facebook
ads part of things.
And they found what they estimate to be hundreds of thousands of sleeper pages, pages that
could be mobilized to send fake advertisements to people.
That was in like a few months ago.
So this thing is really going on and on and on.
And this is actually a screenshot from last week when I was making this presentation.
So you see it's always the same thing, disseminating advertisements with the aim to polarize and
provide fake information to people.
So that's why we care about ads.
All right.
So the aim of the talk is to see how open are Facebook ads.
And the first half is about how to, a practical guide, how we can actually access this data.
And then I'm going to show you some analysis of what we can infer from this data.
So there are three official ways to access the data.
The first one is the daily report.
So basically you don't load a CSV file from a website, like there's a screenshot at the
top right here.
And it's very limited because it's only ads that are labeled as political.
So it's really a tiny fraction of the ads.
Then you have the web portal.
On the web portal you can see, you can target like specific country, a specific ad category,
enter some filters, filter by date, and then you see the ads.
And you have to scroll through all the ads.
So it's nice, but it's hard to do data science with this because it's slow and it's finicky.
It's not usable like for real analysis.
You also have the API.
So I'm going to talk a lot about that, where you can actually plug into the data and access
it.
But then it requires registration and coding skills.
And finally, there is this open source project.
And I really wanted to talk about it because it's great.
It's actually an ongoing mirror of the Facebook ad library.
And you will see the code online.
And you can kind of understand how you can access the API just by looking at the code.
And then there are data dumps hosted on Kaggle and other platforms.
So it's really a great project.
It's not official, but it's a great source of information.
All right.
So the how to path.
To access the data with the API, you have to verify as a meta developer.
And for that, you need to provide your cell phone or your debit or credit card information.
And then you have to go through another step.
And you have to verify your government ID.
In my case, I had to send both my ID card on my passport.
Because the first one was rejected.
And it's kind of ironic because if you want to just send an ad, they just ask you for
the debit card.
They don't ask you for anything else.
All right.
So now you have access to the API.
What can you do?
Well, there is pretty good documentation, but it doesn't document the most important
thing, which is the wild card.
It's not just wild card thing.
It's double quote, wild card, double quote.
And with this, you don't have to specify a topic of interest.
You can actually get all the ads.
And then you can work on it.
So it's quite interesting.
And then there are two things that I found a bit counterintuitive.
You have somewhat unspecified quota when you're associated with each token that you have.
And it's like a daily refresh of the quota of ads that you can get.
But then you don't have a limit on the number of tokens that you can create.
So you can just create as many tokens as you want and then get all the ads that you want.
That's the first thing.
And the second thing is that when you ask, like in your request for the ads, you have
to specify how many ads you want back.
So there is a pagination system.
And by default, it's something like 20 or 50.
So it's kind of very low.
But if you ask for a pagination that is too high, then you're just going to have an error.
And the error depends on the size of the text of the ads.
So then you have to try again with a different limit.
So actually, it requires coding to really get the data.
So that's two things I wanted to mention.
All right.
So go through the verification process.
You get your code working.
And this is what you see when you look at the ads in French language and targeting friends.
So kind of a limit is called.
One third of the ads are sent on Facebook, one third on Instagram.
And then the remaining ads are split evenly between Messenger and Audient's network.
Audient's network is kind of, they provide some kind of software thingy so that people
creating mobile apps can send ads and build upon meta like ad system.
So yeah, when you open your mobile app and there is some ad on it, it might use this
Audient's network thing.
So most of it is Facebook and Instagram.
And then when you look at the volume of ads through time, well, you see there's almost
nothing up to mid-2023.
And then you have data.
When people see this thing first, they think, oh, Facebook, they removed data.
And actually it's not that simple.
Actually it's the opposite.
They made a change in the API sometimes in July of 2023 and they really started to make
the data available.
Because prior to that point, only the social issues, elections or politics ads were available.
And then starting in July of 2023, you have housing and employment, credit and most importantly,
the bulk of the ads, like more than 95% of the ads is non-labelled.
So it's actually a good thing that happened a bit more than six months ago because now
we have the data to analyze.
And well, so we've been talking a bit more, we have been talking a bit about the social
issues, elections or politics.
I'm going to give you a bit more information on that.
So it's completely self-declared category.
So this is actually a screenshot.
I went through the process to publish an ad and stopped at the last moment.
And you can check the box if you want.
But then there is absolutely no enhancement mechanism if you don't want to check that
box.
So it's purely self-declared category.
Of course, if you check the box, then you're going to probably have your ID, maybe there
will be more scrutiny of your ad, maybe more moderation.
So obviously, people that want to do this information campaign are not going to check
the ad.
But can we make a model that would tell if an ad should actually have checked the box,
should have declared itself as political?
That model could be used to identify the doppelganger ads I've been showing you.
And it could also be used to estimate the volume of ads that should be labeled as social
issues, elections or politics.
So actually, we can.
And I'm going to show you how I built one.
But just at this point, how many French ads we should expect to be social issues, elections
or politics?
We do have some ideas from past work.
So in Brazil, this work, academic work, found 2.2% mislabeling.
So non-labelled ads that should be social issues, elections or politics.
And that other academic work found 2% to 4% mislabeling in the US.
So the same thing.
Non-labelled ads that should have been labelled.
Okay, so to build the model, what we need is some data and then some technical way to
build a classifier.
So the data I used is from this academic paper.
And it's great because it's in French language, but then you have other data sets in English
and in other languages.
So that's my data sets containing real political ads.
And then we also need a classifier.
So I'm using a large language model because everyone is doing that these days.
And I took specifically the Mistro AI 7B model and I optimized it with the Unclosed Library
because it's very fast and it's open source.
And just to give you another magnitude, it takes one hour on a free to use Google collab
instance.
So it's actually quite quick.
Anybody can do it.
Like resources and GPU or whatnot.
All right.
To dig a bit more in the process.
So to be really clear about what was done.
So we have this data set of 4,000 something political ads.
And it was collected in 2022 during the French presidential campaign.
And it actually categorized the data into nine topics.
Could be energy, social issues, these kind of categories.
And then we need some control.
We need to separate political from non-political.
And for the control, I just collected random data in the French scope in late 2023 and
targeting the non-labeled ads.
So I expect some contamination because some of these non-labeled ads should be political.
But I expect it to be kind of marginal, not so important in my data set and it's not going
to kind of prevent the model from learning.
But anyway, I will have a little bit of performance hits.
Then I split into train and test and I finetune my Mistro AI model.
And to finetune the Mistro AI model, I wanted to predict between one of these nine categories
or non-political.
And I do that to get a bit better results than if I just wanted to instruct the model
to predict between political and non-political.
Because then the model has to kind of think a bit harder about why it's political.
So it's kind of to improve the overall accuracy.
And actually, the results were good because I got more than 90% precision.
It was a very good recall as well.
So I was very happy at this point.
And then how can we actually use it?
Well, the first test is on the doppelganger ads.
So that's another example here on the top of the ads, top right.
And I got actually perfect classification on this because the text is so obviously political
that there is no doubt about it.
So all these ads, as I said, are non-labelled by Facebook.
And then the model says that they should all be political with focus on international
affairs.
So yeah, it's working well.
And in case you're interested, that's the prompt that I use.
So prompt in English asking to categorize into one of the nine categories or other.
So non-political.
That's the advertisement text in French.
And you can see that some words actually, they have spaces inside.
So here, chef, it's leader, UE, European Union, there is an underscore.
And negotiate here, there is an underscore.
So it's as if the text is aiming to kind of bypass some keyword detection.
And those ads were not labelled as political.
So maybe this is the kind of moderation, of automatic moderation that Facebook is doing.
Well, I'm happy because my model is able to identify all these ads.
But what if I want to really estimate how many ads beyond those ones should be labelled
as social issues, elections, or politics?
Well, for this, I took a random sample of like general non-labelled ads.
And I ran them through my model.
And then I wanted to really double check that the outcome of the model could be trusted.
So I asked two human annotators trained on MetaGuard lines to really make sure that these
ads should be, according to MetaGuard lines, political, labelled as political.
And I find that at least 1.9% of the non-labelled ads should be labelled as political.
So it's conservative estimate because some political ads might not be flagged by my model.
So it's really at least.
And it's aligned with previous research showing 2 to 4%.
So I'm not so surprised.
It might not seem much, but then if you take 2% of more than 95%, then you see that most
of the political ads are actually non-labelled, right?
Because the labelled political ads is 0.3%, roughly, in the French scope.
So this means that about 15% only of the political ads are labelled as such.
Well, that's about it.
So what we saw is that Facebook gave access to actually data and more and more data.
It can be used.
It's a bit tricky, but at least the data is there.
And here I talked about foreign interference, but this is relevant much more broadly.
Because you see many things in these ads that are not linked to foreign interference at all.
And I think the civil society should kind of look into that.
You have six months of fully open data, and you have this open source mirror that makes
it really easy to at least make sure that you don't miss any ad.
And my belief is that this self-labelling mechanism of social issues, electional politics, is
problematic in itself.
I think for this to work, it needs to be really coupled with a very strong moderation.
And as of now, I don't see this very strong moderation.
So maybe there are other ways that we can really make sure that political ads are not somehow
they don't go unnoticed.
I don't know.
That's about it.
I'm super happy to answer any questions.
Thanks a lot.
Now we can take some questions.
Paul has a question.
The first question is like, you have any idea why this change happened in obvious class
degrees in the NTI?
And the second question is with your model, what do you do with that at the end for the
visionary and the third developer?
When do you use this?
Okay.
So the first question was why didn't Meta kind of gave access to the NTI?
So the first question was why didn't Meta kind of gave access to more data in mid-2023?
And the second part of the question is how do we use it in practice?
Right.
So the first part, I don't know.
It might be linked to the DSA, the Digital Service Act, that is going to come into action
kind of soon.
It might not be linked at all.
I have no insight about that.
And the second part, of course, we try to automatize and to be much more reactive to
really understand more how Facebook ads can be used and easier for these information
purposes.
Any other questions?
Yeah.
Just on the paper you cited there, I can look into it and they use like a more traditional
mass language model.
Yeah.
So why did you choose to use this rather than the probably lighter model?
I don't remember this one.
It was a nice buy.
Sorry.
Yeah.
So the question was, so this paper used a somewhat traditional model, language model, and why
did I use a large language model?
Why go through all those troubles?
If I remember correctly, this one used naive bias, something, or maybe another one.
Well, so for me, large language models are perfect for this task because they embed some
knowledge of the world.
So it's not like it's not thinking only about, oh, in my training dataset, it's all about
Ukraine and Ukraine and Ukraine.
So I'm going to say everything linked to Ukraine is actually political.
It's more broad and it can predict things like this is kind of political and it's about
gender issues.
And then this will be generally valid beyond this or that specific issue that is in the
training set.
So I think the generalization abilities are much better.
I didn't yet.
I tried to do another approach which had other problems.
So I tried an urban approach and with KNN and the data that was also at all.
So I don't have a proper benchmark.
But the results are good.
So why not?
Another question.
After you find a mislabeled X, do you do something about it?
Is there a way to report it?
So the question is if someone finds a mislabeled add, what can they do about it?
Well, it's a bit disappointing.
There is a system and you can click and say this is wrong.
I want to report this add.
But then you have to indicate precisely the law that has been infringed to say why it's
illegal.
So you can easily report something about guns or something outright illegal.
These texts are just information manipulation campaigns.
So I think the reporting mechanism is lacking as of now.
Another question.
If I understood correctly, you are fine training misdraft on text only.
Would it be profitable to use the images because on all your examples there are images.
Can you process that in the future maybe to gain more precise?
Yeah.
So the question is can we also train on the image to have a much better classifier?
Obviously, yes.
Obviously, it's much harder because then you have different other magnitudes of data.
So yeah, we don't do it yet.
I thought it was very interesting.
You mentioned that Facebook needs more moderation.
Where do you foresee that?
What do you see the future of that if it was done well?
Do you see that as something humans do?
Do you see that as something like the models that you've just used or some combination?
All right.
So the question is how could Facebook enhance the moderation?
More hands or more models?
Both.
Both?
Well, both are possible if the hunch is right and if people can really escape moderation just by putting spaces,
then there is obvious improvement in having better models.
Yeah.
It might not be enough.
Can you talk about the content on the propaganda you see?
Can you teach some group that is trying to do some manipulation on people?
So the question is can I go into more details about what was the content of this information?
So I could, but it would take a long time.
I advise you to read all these reports and there is actually a using for lab page where there are at least 10 or so reports
and you will see it's extensive.
It goes in many directions and the information is perfectly available.
I have a question in the back.
From your experience of using the APIs in general,
you know that there is a way to actually associate the ads with the contents that were displayed when they were displayed.
Okay.
So the question is the ad are linked?
I'm not sure I got the question.
Do you have a way with the APIs to actually link the ads that you were studying with the contents that were displayed around the ads?
Oh, okay.
No, no, no.
So the question was can we infer the context?
Why an ad was shown to a user?
Was it because there was calling on some, for example, Ukrainian topic?
Was it something that they did that was not in relationship to Ukraine or whatnot?
No, we don't have this data available at all.
And that would be very precious.
Maybe one last question.
In your experience, from what you've seen, the posts that are displayed for ads are self-contained or they contain links to other platforms.
Yeah, so this report.
So the question was, was it self-contained?
Like was this information only in the ads or was it just a way to have people click and go somewhere else?
So this report in June 2023 really goes in details in what happened.
So actually there was a system of several redirections through different websites.
And then it was JavaScript redirections and then typosquad redirections and then like several layers of redirections to, at the end, bring people to malicious websites where there was more disinformation.
That's the idea.
And technically it was quite interesting to investigate.
So again, if you have time, these reports are fun to read.
Okay, let's thank again Jan.
Thank you.