[00:00.000 --> 00:09.520]  All right, so thanks for coming to this talk.
[00:09.520 --> 00:11.880]  I know there are a lot of interesting talks
[00:11.880 --> 00:16.280]  for interest talks, so thanks for choosing this one.
[00:16.280 --> 00:19.360]  So I'm going to talk about such more specifically
[00:19.360 --> 00:25.280]  about how to deploy one within an organization.
[00:25.280 --> 00:29.000]  So in this today's talk, I'm going to cover these topics.
[00:29.000 --> 00:32.680]  So it all started with engineers using multiple content
[00:32.680 --> 00:34.520]  management tools.
[00:34.520 --> 00:38.920]  And I'm going to talk about how searching became a problem
[00:38.920 --> 00:43.480]  and how we use an OSS tool to address their problem
[00:43.480 --> 00:45.960]  and the name of the tool is FES.
[00:45.960 --> 00:50.640]  And how we overcame some of the problems with FES.
[00:50.640 --> 00:53.440]  And I'm going to touch on the contribution side
[00:53.440 --> 00:54.400]  of the project.
[00:54.400 --> 00:59.680]  And I'm going to share some insights and observations.
[00:59.680 --> 01:07.320]  OK, so here's the list of chapters, so let's go.
[01:07.320 --> 01:10.920]  So just quickly, me and the team.
[01:10.920 --> 01:13.360]  So I'm an engineer at Toshiba.
[01:13.360 --> 01:16.360]  I've been recently maintaining the company's cloud
[01:16.360 --> 01:20.360]  infrastructure and just trying to increase the scope
[01:20.360 --> 01:23.520]  and range of automation through these activities
[01:23.520 --> 01:27.000]  because otherwise, without increasing the automation,
[01:27.000 --> 01:28.200]  we are doomed.
[01:28.200 --> 01:30.480]  But more importantly, here's the list
[01:30.480 --> 01:33.600]  of very capable, hard-working engineers that
[01:33.600 --> 01:35.160]  made this project possible.
[01:35.160 --> 01:38.480]  So I have a huge respect to them.
[01:38.480 --> 01:41.120]  So now onto the background.
[01:44.200 --> 01:48.520]  So in large corporations like ours,
[01:48.520 --> 01:51.080]  we typically have lots and lots of companies.
[01:51.080 --> 01:54.520]  And our team's job as a software engineering center
[01:54.520 --> 01:59.400]  is to deploy and provide tools to other Toshiba companies,
[01:59.400 --> 02:04.800]  like product development departments and R&D departments.
[02:04.800 --> 02:08.280]  Now, we have about 200 units of deployment.
[02:08.280 --> 02:15.800]  And we have these tools that are basically sort of heavily
[02:15.800 --> 02:17.920]  OSS-based.
[02:17.920 --> 02:24.200]  And now let me quickly turn on the user point here.
[02:24.200 --> 02:27.200]  So we have a few more, but these four
[02:27.200 --> 02:32.320]  are like a core of our tools.
[02:32.320 --> 02:36.320]  So now as we diversify with the tools,
[02:36.320 --> 02:40.160]  the searching became increasingly a problem.
[02:40.160 --> 02:42.760]  And there were two major reasons.
[02:42.760 --> 02:46.960]  For one thing, there is no easy way
[02:46.960 --> 02:49.240]  to search laterally.
[02:49.240 --> 02:54.240]  That is, sometimes we want to search all of these places
[02:54.240 --> 02:58.200]  exhaustively to make sure that we're not missing anything.
[02:58.200 --> 03:00.400]  But there's no easy way to do that.
[03:00.400 --> 03:03.200]  And one more thing is, one more problem
[03:03.200 --> 03:08.960]  is that as we found out, these tools are not quite cut out
[03:08.960 --> 03:13.160]  for searching inside certain binary files, like PDF files
[03:13.160 --> 03:17.640]  and Office document files, which are really something
[03:17.640 --> 03:19.440]  we use quite often.
[03:19.440 --> 03:23.880]  So what all of these lead to is that, ideally, what we want
[03:23.880 --> 03:29.840]  is a single search box that, given a query, searches
[03:29.840 --> 03:34.880]  all the places, like really no matter where they are
[03:34.880 --> 03:38.800]  and where the documents are and what the formats are.
[03:38.800 --> 03:43.800]  So that's what we are going for.
[03:43.800 --> 03:47.640]  However, this is going to be a daunting task, though,
[03:47.640 --> 03:52.720]  because such a tool would have to not only
[03:52.720 --> 03:57.000]  be able to solve the two problems that I talked about,
[03:57.000 --> 04:05.160]  but it would have to come with all the essential features,
[04:05.160 --> 04:08.360]  both on the user side and on the admin side.
[04:08.360 --> 04:13.000]  That is, we have to be able to easily set up crawlers
[04:13.000 --> 04:18.600]  and all those things and run them and maintain them easily.
[04:18.600 --> 04:23.600]  But there is a tool specifically designed for this task.
[04:23.600 --> 04:25.800]  And the name of the tool is FES.
[04:25.800 --> 04:28.240]  So next, I'm going to quickly talk
[04:28.240 --> 04:33.160]  about what this tool is and what it's like to use it,
[04:33.160 --> 04:37.240]  as I don't think this is particularly a well-known tool.
[04:37.240 --> 04:41.320]  So FES is, as read me on the GitHub repo sets,
[04:41.320 --> 04:47.000]  a powerful but easily deployable enterprise search server.
[04:47.000 --> 04:50.160]  So an enterprise search here describes software
[04:50.160 --> 04:53.720]  for searching information within an enterprise,
[04:53.720 --> 04:56.360]  as opposed to web search, like Google and Dr. Go.
[04:59.280 --> 05:03.160]  Now, FES uses elastic search as its search engine,
[05:03.160 --> 05:07.600]  meaning that indexing certain binary files, like office files
[05:07.600 --> 05:12.080]  and PD files, are more or less automatic.
[05:12.080 --> 05:14.760]  And one notable feature of this tool
[05:14.760 --> 05:18.520]  is that it comes with several types of web crawlers.
[05:18.520 --> 05:21.200]  There's one for web pages, and there's also
[05:21.200 --> 05:25.640]  one for file system, like a directory hierarchies.
[05:25.640 --> 05:28.560]  And there's one for a database as well.
[05:28.560 --> 05:31.440]  And all of this is to get data from many different kinds
[05:31.440 --> 05:33.320]  of sources.
[05:33.320 --> 05:35.600]  And if you look at the screenshots,
[05:35.600 --> 05:37.400]  there is a search box.
[05:37.400 --> 05:40.040]  And also, they have admin console.
[05:40.040 --> 05:42.000]  And the search engine results page
[05:42.000 --> 05:46.800]  looks only familiar to many, I think.
[05:46.800 --> 05:51.120]  And this tool is developed by a company named CodeLyps,
[05:51.120 --> 05:54.480]  which is a company that develops and opens sources tools.
[05:54.480 --> 05:56.200]  And they have a lot of experience
[05:56.200 --> 05:59.240]  engaging with OSS community.
[05:59.240 --> 06:05.800]  And let's take a quick look at how this tool works
[06:05.800 --> 06:09.880]  by looking at one of its core features, which is a web crawler.
[06:09.880 --> 06:11.320]  And I think it's a web crawler.
[06:11.320 --> 06:15.400]  It's basically a backbone of this whole system.
[06:15.400 --> 06:19.400]  And I think the concept is familiar to everyone.
[06:19.400 --> 06:24.960]  It basically crawls and indexes web pages, web page contents,
[06:24.960 --> 06:27.840]  and uploaded files.
[06:27.840 --> 06:32.000]  So the way you create web crawlers on Fisk
[06:32.000 --> 06:35.480]  is you go to admin console and then
[06:35.480 --> 06:38.280]  set these parameters for web crawlers.
[06:38.280 --> 06:40.200]  Now, there are quite a few parameters here,
[06:40.200 --> 06:44.000]  but I'm going to focus on a few important ones.
[06:44.000 --> 06:47.920]  And first, there is, of course, URLs field.
[06:47.920 --> 06:52.440]  And then you can include and exclude certain URLs.
[06:52.440 --> 06:57.800]  Fisk respects robots.gxt, but certain robots.gxt
[06:57.800 --> 07:03.080]  file doesn't disallow certain not so relevant pages.
[07:03.080 --> 07:07.480]  So in this case, in that kind of case, this comes in handy.
[07:07.480 --> 07:11.640]  And there are also fields like depth and max access count,
[07:11.640 --> 07:17.640]  which you probably want to set to a very high large value
[07:17.640 --> 07:21.840]  so that crawler is not going to stop pretty much early.
[07:21.840 --> 07:26.520]  And then we came to the permissions parameter.
[07:26.520 --> 07:29.840]  And I think this one needs a little bit more expansion.
[07:29.840 --> 07:35.320]  So this parameter is where you can implement per user access
[07:35.320 --> 07:36.120]  control.
[07:36.120 --> 07:40.680]  That is, when you, I hope the phone is large enough,
[07:40.680 --> 07:45.160]  but when you list users like that,
[07:45.160 --> 07:49.440]  and let's say the crawlers index everything
[07:49.440 --> 07:54.600]  and search is ready, then, and the users search something,
[07:54.600 --> 07:58.880]  only the users listed there see the results.
[07:58.880 --> 08:04.320]  So but notice that this setting is per web crawler basis,
[08:04.320 --> 08:07.760]  meaning that if you have 100 projects on GitLab,
[08:07.760 --> 08:10.280]  you're going to need 100 web crawlers, which is a lot.
[08:10.280 --> 08:12.840]  So clearly, some kind of automation is necessary.
[08:12.840 --> 08:15.200]  And I get back to this point later.
[08:15.200 --> 08:20.960]  One more thing to mention here is that user name here
[08:20.960 --> 08:27.600]  can be either users on FIS, FIS has its own users and groups,
[08:27.600 --> 08:32.120]  but can also be users authenticated
[08:32.120 --> 08:34.840]  by LDAP directory service.
[08:34.840 --> 08:40.280]  There's an option to configure this on FIS.
[08:40.280 --> 08:46.440]  So I hope that gave you some feel on how things work on FIS.
[08:46.440 --> 08:52.440]  So let's move on to the customization part of the stock.
[08:52.440 --> 08:55.280]  So no tools are perfect, and FIS is not an exception.
[08:55.280 --> 09:01.120]  So we have to customize on patch FIS in a few ways.
[09:01.120 --> 09:04.960]  So just quickly, here is a list of patches.
[09:04.960 --> 09:07.080]  So our dev team engineers over time
[09:07.080 --> 09:10.960]  wrote more than a few patches.
[09:10.960 --> 09:15.560]  And the general quality improvement patches and bug
[09:15.560 --> 09:18.200]  fix patches have been merged upstream.
[09:18.200 --> 09:24.680]  But there are also more experimental patches
[09:24.680 --> 09:27.240]  that are very specific to our problem.
[09:27.240 --> 09:29.280]  And those are kept proprietary.
[09:29.280 --> 09:33.000]  And I'm going to talk about two of those patches.
[09:33.000 --> 09:35.800]  First one is the authentication for web crawlers.
[09:35.800 --> 09:38.600]  Now, most of the pages of GitLab and RedMind
[09:38.600 --> 09:42.480]  are behind logging pages.
[09:42.480 --> 09:47.960]  And FIS has, so the web crawler has to authenticate itself
[09:47.960 --> 09:49.400]  as it makes its way.
[09:49.400 --> 09:52.200]  Now, FIS has a mechanism for this.
[09:52.200 --> 09:56.000]  That is, you can create web authentication object
[09:56.000 --> 09:59.560]  and attach it to each web crawler.
[09:59.560 --> 10:04.720]  This works in some cases out of the box
[10:04.720 --> 10:09.800]  if the logging form page is fairly standard.
[10:09.800 --> 10:17.000]  But our GitLab uses SAML.
[10:17.000 --> 10:19.600]  And then as it turns out, FIS does not support this.
[10:19.600 --> 10:22.920]  We have to do some patching.
[10:22.920 --> 10:26.600]  To just go over how the patching works
[10:26.600 --> 10:28.520]  at the conceptual level, what we did
[10:28.520 --> 10:31.960]  is we defined extra optional parameters
[10:31.960 --> 10:36.480]  that Admin can write on the console.
[10:36.480 --> 10:41.040]  That is, if there were parameters
[10:41.040 --> 10:44.240]  starting with SAML underscore, the patched parser
[10:44.240 --> 10:46.920]  of this form is going to pick them up and store them.
[10:46.920 --> 10:52.160]  And later, web crawler is going to see these parameters
[10:52.160 --> 10:57.920]  and recognizes that SAML authentication, SAML logging,
[10:57.920 --> 11:05.160]  needs to be attempted and runs extra SAML specific logic.
[11:05.160 --> 11:09.080]  So that's the patch one.
[11:09.080 --> 11:15.440]  And then the second one is about repository contents.
[11:15.440 --> 11:18.440]  So many of the repositories we have on GitLab
[11:18.440 --> 11:22.200]  are several gigabytes in size.
[11:22.200 --> 11:27.280]  And both GitLab and RedMind have pages
[11:27.280 --> 11:29.720]  to view repository files.
[11:29.720 --> 11:32.200]  So in theory, if you wait long enough,
[11:32.200 --> 11:37.920]  web crawler is going to index all these contents
[11:37.920 --> 11:39.560]  through those pages, in theory.
[11:39.560 --> 11:42.040]  But this turned out to be a complete non-starter
[11:42.040 --> 11:47.840]  because it's too slow and quite understandably so.
[11:47.840 --> 11:52.600]  And the reason is it's just so the web crawler is
[11:52.600 --> 11:54.760]  going to make HTTP requests.
[11:54.760 --> 11:58.280]  And GitLab fetches the file, just one file from Ripple,
[11:58.280 --> 12:00.240]  and then renders it to the web page.
[12:00.240 --> 12:05.320]  So there's just too many steps to just get
[12:05.320 --> 12:07.000]  the content of one file.
[12:07.000 --> 12:13.160]  So what we did is to first clone the repository contents
[12:13.160 --> 12:17.720]  to a local file system, and then run
[12:17.720 --> 12:22.720]  this file crawler, which is a crawler for directory hierarchy
[12:22.720 --> 12:25.760]  and do everything locally.
[12:25.760 --> 12:30.560]  Now this more or less solved the problem of speed.
[12:30.560 --> 12:36.520]  But one problem is that since everything is done locally,
[12:36.520 --> 12:38.480]  when the search indices are stored,
[12:38.480 --> 12:41.720]  these things are the filesystem paths.
[12:41.720 --> 12:45.120]  So we had to remap these to the URL
[12:45.120 --> 12:48.680]  so that later when the user clicks a link,
[12:48.680 --> 12:54.680]  it takes the user to the repository file page
[12:54.680 --> 12:56.960]  on GitLab.
[12:56.960 --> 13:01.240]  So what we did is, again, we defined
[13:01.240 --> 13:06.360]  custom optional parameters that the admin can
[13:06.360 --> 13:10.640]  write on the console config parameters field,
[13:10.640 --> 13:15.640]  specifically this prefix URL and map URL parameters.
[13:15.640 --> 13:19.440]  And when these parameters are present,
[13:19.440 --> 13:22.040]  the parser is going to pick them up,
[13:22.040 --> 13:27.080]  and then later the web crawler is going to see these parameters.
[13:27.080 --> 13:32.560]  And then if these are present, it will perform remapping.
[13:32.560 --> 13:35.440]  So this is the conceptual overview
[13:35.440 --> 13:38.320]  of how this patching works.
[13:41.080 --> 13:44.520]  And now most of these parameters are
[13:44.520 --> 13:47.800]  related to the web driver client on this.
[13:47.800 --> 13:51.760]  And there is information about this web driver
[13:51.760 --> 13:58.120]  client and Fist 14 issue, as since the web driver client
[13:58.120 --> 14:03.120]  is discontinued on Fist 14, and Fist 14 is latest.
[14:03.120 --> 14:09.760]  And Fist 14 has playwright-based crawler,
[14:09.760 --> 14:12.600]  which is still in development, and the information
[14:12.600 --> 14:14.040]  in the appendix.
[14:14.040 --> 14:17.960]  Now I'm going to talk about another important subject,
[14:17.960 --> 14:20.120]  and which is automation.
[14:20.120 --> 14:25.280]  And as you might have guessed, our configuration
[14:25.280 --> 14:29.960]  grew more and more complicated, as it always does.
[14:29.960 --> 14:35.720]  So for instance, we have way more than quite a few configurations
[14:35.720 --> 14:38.080]  for each Fist instance.
[14:38.080 --> 14:42.080]  And so there are lots of manual edits of config files.
[14:42.080 --> 14:45.360]  But these are taken care of by Ansible and Docker file,
[14:45.360 --> 14:47.080]  and I think that's standard.
[14:47.080 --> 14:50.480]  But perhaps a more interesting instance
[14:50.480 --> 14:53.240]  is we have to create several hundred web
[14:53.240 --> 14:56.480]  crawlers per Fist instance.
[14:56.480 --> 14:58.800]  And the reason is, typically on GitLab,
[14:58.800 --> 15:00.320]  you have projects.
[15:00.320 --> 15:03.400]  And for each project, you have members.
[15:03.400 --> 15:07.040]  And what you want to make sure is, when
[15:07.040 --> 15:10.600]  user searches something, you use it
[15:10.600 --> 15:16.880]  to see only the resources they have access to.
[15:16.880 --> 15:28.240]  So to automate the creation of web crawlers in such a case,
[15:28.240 --> 15:33.360]  Fist has APIs, just like GitLab APIs.
[15:33.360 --> 15:39.120]  So to explain how this is handled,
[15:39.120 --> 15:42.440]  to look at the sample script, you
[15:42.440 --> 15:46.680]  can combine the GitLab APIs and Fist APIs.
[15:46.680 --> 15:49.800]  First, you can use GitLab APIs to get all the projects.
[15:49.800 --> 15:53.760]  And then for each project, you can get the list of members.
[15:53.760 --> 15:57.360]  And then for using that list of members,
[15:57.360 --> 16:02.040]  you can create web crawler.
[16:02.040 --> 16:06.360]  And this is where Fist API comes in to create web crawler.
[16:06.360 --> 16:10.720]  And then you can also create web authentication object
[16:10.720 --> 16:13.040]  and attach it to that web crawler.
[16:13.040 --> 16:18.720]  So the Fist APIs are mostly like GitLab APIs.
[16:18.720 --> 16:21.080]  And for those who have used them,
[16:21.080 --> 16:23.920]  I think they'll be fairly intuitive.
[16:28.640 --> 16:32.320]  So that's the quick intro to the Fist APIs.
[16:32.320 --> 16:38.280]  And I'm going to share some insights and observations
[16:38.280 --> 16:40.400]  that I can make.
[16:40.400 --> 16:44.120]  So the first is did Fist solve our problems?
[16:44.120 --> 16:48.200]  And the answer would be definitely yes.
[16:48.200 --> 16:53.840]  So the users can now search across tools
[16:53.840 --> 16:56.800]  and inside binary files.
[16:56.800 --> 17:00.080]  And this turned out to be quite powerful as, for instance,
[17:00.080 --> 17:03.000]  even if the file is like a DocEx file,
[17:03.000 --> 17:06.760]  or even if it's a legacy doc format,
[17:06.760 --> 17:10.240]  and even if it's in a very obscure location
[17:10.240 --> 17:16.040]  in a very old repository, deeply nested places,
[17:16.040 --> 17:20.560]  users can actually find texts inside the file.
[17:20.560 --> 17:23.400]  So this turned out to be quite powerful.
[17:23.400 --> 17:27.360]  But it's not without a problem, though.
[17:27.360 --> 17:30.560]  So one problem is the performance.
[17:30.560 --> 17:34.560]  If you want to index contents of GitLab, which
[17:34.560 --> 17:39.240]  have, say, several hundred projects and several thousand
[17:39.240 --> 17:42.640]  issues using Fist Instance running
[17:42.640 --> 17:46.560]  on this level of computing resource,
[17:46.560 --> 17:51.600]  it takes us about a couple days to index everything.
[17:51.600 --> 17:56.440]  And this is something we are trying to improve on,
[17:56.440 --> 18:01.640]  like how to incrementally index contents
[18:01.640 --> 18:04.520]  and using some other techniques.
[18:04.520 --> 18:09.800]  So for now, that's it for this talk.
[18:09.800 --> 18:11.760]  So thank you so much for listening.
[18:11.760 --> 18:14.720]  And I want to open this up to the Q&A session.
[18:14.720 --> 18:35.080]  So you have a different type of resources, right?
[18:35.080 --> 18:38.760]  So every resource has different properties, I guess.
[18:38.760 --> 18:43.520]  So I do index the same properties for every source.
[18:43.520 --> 18:46.680]  So you have a set of properties to index.
[18:46.680 --> 18:50.200]  Are you only indexing content?
[18:50.200 --> 18:52.800]  So I'm trying to understand the question.
[18:52.800 --> 18:55.160]  When you're indexing documents, you
[18:55.160 --> 18:58.840]  are indexing the content of the document, the text of the document?
[18:58.840 --> 19:01.480]  Or are you indexing also some of the properties,
[19:01.480 --> 19:06.840]  like the name, the title, the author, the?
[19:06.840 --> 19:08.720]  I would say yes.
[19:08.720 --> 19:10.360]  So let me repeat the question.
[19:10.360 --> 19:16.960]  The question is, whether this first index, only the contents,
[19:16.960 --> 19:24.640]  like the text inside the file, or does it index things
[19:24.640 --> 19:29.080]  such as metadata, like title field, or PowerPoint,
[19:29.080 --> 19:32.840]  or they have several metadata.
[19:32.840 --> 19:33.840]  Yes, it does.
[19:33.840 --> 19:37.360]  It indexes those metadata as well.
[19:37.360 --> 19:42.840]  And those are handled by, most likely, the elastic search.
[19:42.840 --> 19:51.120]  But yes, those metadata are indexed.
[19:51.120 --> 19:56.200]  And actually, it shows the metadata title
[19:56.200 --> 20:01.080]  if they are present on the search results page.
[20:01.080 --> 20:03.360]  So yeah, I guess that's fine.
[20:03.360 --> 20:07.360]  The second one is a quick one, so are you accepting contributions
[20:07.360 --> 20:09.360]  to the project?
[20:09.360 --> 20:11.280]  It's an open project, and you are accepting
[20:11.280 --> 20:14.760]  contributions and new plugins?
[20:14.760 --> 20:16.760]  I'm sorry.
[20:16.760 --> 20:22.440]  So your question is, whether we are accepting contributions
[20:22.440 --> 20:24.840]  to the?
[20:24.840 --> 20:32.960]  So on our side of it, I don't quite think so.
[20:32.960 --> 20:39.560]  So you have a catalogue of connectors to all of them,
[20:39.560 --> 20:42.560]  and so on, to elastic search, and so on.
[20:42.560 --> 20:44.520]  So for instance, I was thinking on CMIS,
[20:44.520 --> 20:47.560]  that is a standard for content management.
[20:47.560 --> 20:50.560]  So I was thinking on trying to contribute this new connector
[20:50.560 --> 20:54.680]  to the best project.
[20:54.680 --> 20:56.480]  Is that an option?
[20:56.480 --> 20:57.280]  I'm sorry.
[20:57.280 --> 21:00.720]  I'm trying to understand the question.
[21:00.720 --> 21:04.080]  You said something about connectors, right?
[21:04.080 --> 21:07.240]  I'm naming that connector, but I don't know if connectors
[21:07.240 --> 21:08.960]  is the right word for you.
[21:08.960 --> 21:11.840]  I mean, so you have a specific browser
[21:11.840 --> 21:15.200]  for every different system, right?
[21:15.200 --> 21:16.920]  You have a browser for elastic search,
[21:16.920 --> 21:25.720]  a browser for office, a browser for set options, right?
[21:25.720 --> 21:32.400]  So we can add new browsers for different systems.
[21:32.400 --> 21:33.280]  Yes, yes.
[21:33.280 --> 21:36.680]  I mean, FES has seven different crawlers
[21:36.680 --> 21:40.520]  for different types of resources.
[21:40.520 --> 21:46.640]  And so the question is that are we
[21:46.640 --> 21:54.920]  going to be accepting new types of crawlers as contribution?
[21:54.920 --> 21:57.160]  Is that the question?
[21:57.160 --> 21:57.640]  Yes.
[21:57.640 --> 21:59.640]  It's not right.
[21:59.640 --> 22:03.080]  We're kind of sort of like a corporation
[22:03.080 --> 22:06.120]  and then working on our side, working
[22:06.120 --> 22:09.480]  as a project inside a corporation.
[22:09.480 --> 22:14.440]  So we buy ourselves, not at the moment,
[22:14.440 --> 22:18.960]  accepting a contribution to the project, but yeah.
[22:18.960 --> 22:19.960]  So yeah.
[22:19.960 --> 22:21.960]  Perfect thing.
[22:21.960 --> 22:22.960]  Another question?
[22:22.960 --> 22:23.460]  Yes.
[22:23.460 --> 22:26.440]  Are you going to publish the slides that you presented?
[22:26.440 --> 22:27.440]  OK.
[22:27.440 --> 22:28.920]  Are the slides that you presented,
[22:28.920 --> 22:31.440]  are you going to publish them somewhere?
[22:31.440 --> 22:33.160]  Then they are on their own page.
[22:33.160 --> 22:34.520]  Yes.
[22:34.520 --> 22:37.640]  Yeah, they are on Fesda website.
[22:37.640 --> 22:41.520]  So you should be able to download them.
[22:41.520 --> 22:43.200]  I'm sure I'll be able to.
[22:43.200 --> 22:44.320]  That's the short one.
[22:44.320 --> 22:47.640]  You said indexing takes about several days
[22:47.640 --> 22:50.320]  for this project you have.
[22:50.320 --> 22:50.820]  Yes.
[22:50.820 --> 22:52.800]  It's about re-indexing.
[22:52.800 --> 22:57.960]  So if content changes, re-index all the time a new index,
[22:57.960 --> 23:01.800]  making the whole index new, or is it fast?
[23:01.800 --> 23:04.320]  Just re-index certain documents.
[23:04.320 --> 23:08.760]  So the question is, let's say after index everything,
[23:08.760 --> 23:13.000]  and then in the subsequent run of web crawlers
[23:13.000 --> 23:20.520]  and all kind of crawlers, are they updated incrementally,
[23:20.520 --> 23:23.840]  faster, or do we have to re-index everything?
[23:27.680 --> 23:37.960]  So the Fesda tries to crawl efficiently.
[23:37.960 --> 23:41.360]  It tries to ignore the contents of web pages
[23:41.360 --> 23:42.760]  that haven't been updated.
[23:42.760 --> 23:47.200]  It checks the last modified field.
[23:47.200 --> 23:51.400]  So the mechanism of incremental crawling
[23:51.400 --> 23:56.840]  is not very ideal.
[23:56.840 --> 24:02.200]  For instance, the last modified field is not quite
[24:02.200 --> 24:03.960]  well-inforced, for instance.
[24:03.960 --> 24:12.120]  It's only the certain types of static web pages use it.
[24:12.120 --> 24:18.200]  And so there are a lot of unnecessary re-indexing
[24:18.200 --> 24:19.400]  happening.
[24:19.400 --> 24:23.720]  But there are some mechanisms to only index things
[24:23.720 --> 24:25.000]  that have been updated.
[24:25.000 --> 24:30.880]  But I've got to say that that mechanism is not very well run.
[24:30.880 --> 24:32.440]  And there are a lot of re-indexing.
[24:32.440 --> 24:45.240]  And then the subsequent crawling is not as efficient
[24:45.240 --> 24:46.240]  as we want.
[24:46.240 --> 24:48.240]  And then it is.
[24:48.240 --> 24:48.740]  Yeah.
[24:51.720 --> 24:52.720]  Yes.
[24:52.720 --> 24:53.220]  OK.
[24:53.220 --> 24:54.080]  Thank you very much.
[24:54.080 --> 25:02.680]  Thank you very much.