Re: Optimizing grep, sort, uniq for speed

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]



Thank you Mark and Gordon.  Since the hostnames I needed to collect
are in the same field, at least in the lines of the file that are
important.  I ended up using suggestions from both of you, the code is
like this now.  The egrep is there to make sure whatever is in the 9th
field looks like a domain name.

for host in $(awk '{ print $9 }' ${TMPDIR}/* | egrep
"[-\.0-9a-z][-\.0-9a-z]*.com" | sort -u); do
    HOSTS+=("$host")
done

Original script:
real    28m11.488s
user    26m57.043s
sys     0m30.634s

Using awk instead of grepping the entire batch:
real    6m14.949s
user    5m0.629s
sys     0m26.914s

Using awk and with export LANG=C
real    2m50.611s
user    1m20.849s
sys     0m27.366s

Awesome, thanks for the tips!



> For one, do the sort in one step: sort -u. For another, are the hostnames
> always the same field? For example, if they're all /var/log/messages, I'd
> do awk '{print $4;}' | sort -u

> You have two major performance problems in this script.  First, UTF-8
> processing is slow.  Second, wildcards are EXTREMELY SLOW!

> You'll get a HUGE performance boost from prefixing your search with some
> known prefix to your regex.
_______________________________________________
CentOS mailing list
CentOS@xxxxxxxxxx
http://lists.centos.org/mailman/listinfo/centos



[Index of Archives]     [CentOS]     [CentOS Announce]     [CentOS Development]     [CentOS ARM Devel]     [CentOS Docs]     [CentOS Virtualization]     [Carrier Grade Linux]     [Linux Media]     [Asterisk]     [DCCP]     [Netdev]     [Xorg]     [Linux USB]
  Powered by Linux