Hi,
Hi, I have the following rewrite rule in place on one
of our staging sites to redirect bots and malicious
scripts to our corporate page:
RewriteCond %{HTTP_USER_AGENT}
^$
[OR]
RewriteCond %{HTTP_USER_AGENT}
^.*(<|>|'|%0A|%0D|%27|%3C|%3E|%00).*
[NC,OR]
RewriteCond %{HTTP_USER_AGENT}
^.*(HTTrack|clshttp|archiver|loader|email|nikto|miner|python).*
[NC,OR]
RewriteCond %{HTTP_USER_AGENT}
^.*(winhttp|libwww\-perl|curl|wget|harvest|scan|grab|extract).*
[NC,OR]
RewriteCond %{HTTP_USER_AGENT}
^.*(Googlebot|SemrushBot|PetalBot|Bytespider|bingbot).*
[NC]
RewriteRule (.*) https://guardiandigital.com$1
[L,R=301]
However, it doesn't appear to always work properly:
66.249.68.6 - - [08/Jul/2024:11:43:41 -0400] "GET
/robots.txt HTTP/1.1" 200 343 r:"-" "Mozilla/5.0
(compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
0/5493 1145/6615/343 H:HTTP/1.1 U:/robots.txt s:200
Instead of making changes to my rules then having to
wait until the condition is met (Googlebot scans the
site again), I'd like to simulate the above request
against my ruleset to see if it matches. Is this
possible?
For the user agent, just install an extension in your
browser to "fake" the value, and make a HTTP request.
Alternatively, you can use curl as well.
I should have mentioned that this was part of a larger effort to
redirect bots while also blocking access to others altogether as
well as allowing authorized users. Here's what I've come up with,
which seems to work quite well. This also all has to appear in
.htaccess because it's processed after the virtualhost config and
any requireall/requireany entries are overridden that already
appear there. I also learned that RequireAny is default deny.
RewriteCond %{HTTP_USER_AGENT}
^$
[OR]
RewriteCond %{HTTP_USER_AGENT}
^.*(<|>|'|%0A|%0D|%27|%3C|%3E|%00).*
[NC,OR]
RewriteCond %{HTTP_USER_AGENT}
^.*(HTTrack|clshttp|archiver|loader|email|nikto|miner|python).*
[NC,OR]
RewriteCond %{HTTP_USER_AGENT}
^.*(winhttp|libwww\-perl|curl|wget|harvest|scan|grab|extract).*
[NC,OR]
RewriteCond %{HTTP_USER_AGENT}
^.*(Googlebot|SemrushBot|PetalBot|Bytespider|bingbot).* [NC]
RewriteRule (.*) https://guardiandigital.com/$1 [L,R=301]
SetEnvIf user-agent "(?i:GoogleBot)" googlebot=1
SetEnvIf user-agent "(?i:SemrushBot)" googlebot=1
SetEnvIf user-agent "(?i:PetalBot)" googlebot=1
SetEnvIf user-agent "(?i:Bytespider)" googlebot=1
SetEnvIf user-agent "(?i:bingbot)" googlebot=1
<RequireAny>
Require ip 1.2.3.4
Require env googlebot
</RequireAny>
I was also originally trying to associate the rewriterules with
the requireany using <If> but then realized I didn't even
have to do that - it would just automatically get processed
independently. It looks so simple now, but took me a while to make
it this simple.
|