How to remove index.php parts from the URLs of our Drupal based CMS system

botond published 2020/09/26, Sat - 13:43 time

Content

 

Introductory

Drupal is an open source, free CMS (content management) system that can be used to build a wide variety of web pages, such as this page. It has many features, including support for managing SEO URLs, which is notoriously beneficial for search engines.

About a year and a half ago, I noticed a dad on this page that affected these SEO URLs. Then I had to look for a solution and then I fixed the problem. Then recently this phenomenon came up again, or I just noticed. It is possible that during one of the base system upgrades, I may have accidentally overwritten previously modified files with "factory" files, so I had to make these changes again. Luckily, with the help of my previous notes, I was able to solve it in a fraction of the time now, so I thought I’d make a short little write about it to see if it would work well for someone else, and I’d be more at hand if I had to set it up again.

So in this short troubleshooter, we'll look at how to fix unnecessary index.php parts in the URLs of our pages.

 

The symptom

 

 

When we navigate to a Drupal-based website, it suddenly appears that the "index.php" section appears in the title bar at the beginning of the request that links to the subpage. Some examples from this page:

https://www.linuxportal.info/index.php/
https://www.linuxportal.info/index.php/leirasok
https://www.linuxportal.info/index.php/cikkek
https://www.linuxportal.info/index.php/leirasok/web-hoszting
https://www.linuxportal.info/index.php/enciklopedia

(I didn't link these urls directly so that search engines don't get unnecessarily bad addresses, and by the time this is written, the bug will be fixed, so they'll jump back to the correct URL versions anyway.)

Another interesting thing is that if we delete the "index.php /" section in the title bar, the pages will come in without it, so as it should work properly. But by clicking on any internal link on the page, the URL of the next page will already contain this unwanted "index.php /" section again.

The problem

On the one hand, this is annoying, because if we are talking about SEO URLs, there is no room for any php files or other parameters, and on the other hand, it causes more serious problems: Search engines, for example, if Google finds these incorrect URLs, then starts indexing them, which raises several issues:

  • New URLs are created in the Google index, which start their careers from the beginning in terms of SEO, because they are new, so there are no pointing external links, they don't have SEO value yet, etc ...
  • Because the pages work with and without the index.php parts that are unnecessarily inside, duplicate content is created in Google's index, which is also not good for ranking.

All of this is a fresh example of this site today Apache viewed in your diary:

Crawling URLs by Google's crawler

Here in the command, I first filtered the one containing "66.249" IP addresses, which are notoriously addresses from the address space of Google's crawlers, and then I filtered "index.php" to show them only, and for the sake of the current example, I took the first 20 lines, which were still in the morning before the fix.

What is clear here is that the queries all start with "index.php /" and that the server returned the response codes of 200, i.e. returned the content of the page to the requester without any problems.

So the problem here is that Google is also starting to “grind” these incorrect URLs, so if we don’t notice in time, it will crawl a lot of URLs and our search engine rankings could deteriorate quite a bit.

How can we fix all this?

 

The solution

The solution isn’t as complicated as you might think at first, you just need to modify two files in the web root.

Modify the .htaccess file

.Htaccess files can be used to locally control the operation of Apache in a given directory or directory structure within specified limits.

So first, open the .htaccess file in the root directory of the webpage with your favorite editor:

nano .htaccess

Then put the following part right at the beginning and save it:

<IfModule mod_rewrite.c>
  RewriteEngine on

  # "index.php/" résszel kezdődő URL-ek "visszaterelése" 301-el a helyes címre:
  RewriteRule ^index.php/$ / [R=301,L]
  RewriteRule ^index.php/(.*) /$1 [R=301,L]
</IfModule>

All that happens here is to check in an Apache directive if the rewrite Apache module, which is responsible for redirects, then turn on the redirection system and use a status code of 301 ("permanently migrated") to redirect all requests that begin with "index.php /" to the URL version without this. The "L" flag at the end of the rewrite rule ensures that Apache no longer executes additional rewrite rules for this query if they were still in other parts of .htaccess, for example.

Actually, here we are just packing the whole thing into a control directive just to be regular, because if a Drupal system is already running on the server, this module is definitely available.

This will prevent the "index.php /" sections from being included in our URLs in the future.

Then it doesn't hurt to refresh the cache of the page, because incorrect URLs may have remained inside. You can also do this from the admin menu or if you use it drushthen with the following command:

drush cr

Modify your robots.txt file

 

 

The robots.txt file placed in the web root is a file for robots, in which we can place various directives and rules to instruct crawlers to behave properly. However, many robots ignore this file, many times these robots are also called "bad robots". Fortunately, Google's robots aren't one of these - which is important to us now - but they take the robots.txt file into account. So now let's open this file for editing as well:

nano robots.txt

At the end of this, add the following single line and save it:

Disallow: /index.php/*

Or, if you're more sympathetic, you can add a single command:

echo "Disallow: /index.php/*" >> robots.txt

This line instructs robots (whichever it takes into account) not to index URLs that begin with "/index.php/".

This also corrects the error "retrospectively", ie incorrect URLs that have already been indexed are excluded from the search engine indexes. This is, of course, a longer process, the implementation time of which also depends on many factors, such as how often the robots visit our site, and so on.

 

The result

The fruits of our deeds can be easily verified by looking back at the Apache log file, only now in this example instead of the first 20 lines, which look at the last 20 lines that were created after the error was fixed:

Redirect URLs with 301 response codes

Here, too, we can see that Google's robots are still running the wrong URLs, which contain index.php, but now Apache is not returning the previous response code of 200 with the content, but returning response codes of 301.

And if we "expand" these rows better with grep's "-C" option, we can still see 1-1 plus rows before and after the results, so we can see what happens after that particular query. Also, I put index.php filtering once more at the end, so now it highlights it in color so it can be better reviewed:

URLs redirected to 301

Here you can see that it returns a response code of 301 in queries for index.php lines highlighted in red, followed by a version of the same URL without index.php, which already gives the correct response code of 200, and the content itself.

So Google is pretty much correcting our incorrectly included URLs in our index.

You can also try it out with the broken links shown at the beginning of this description.

 

Conclusion

 

 

Fortunately, I noticed this problem in time, I haven't seen a search drop yet, in any case, when updating the base system, make sure that if you overwrite the old ones with the fresh .htaccess and robots.txt files, don't forget to overwrite your previous changes. so.