robbat2 | [Gentoo] Upgrading/using nss_ldap/nss_mysql/nss_nis/nss... and not breaking your system

So lately there have been a lot of complaints about nss_ldap-249+ breaking systems on boot. The source of this is actually not a breakage, but a change in behavior that exposed something that was always broken. Many of the comments below go for all NSS backends where the actual data source might not be available during the early phases of booting (because the LDAP server may not have started yet, or network may not be started).

In your /etc/nsswitch.conf file, you may have lines like:
passwd: files ldap
group: files ldap
If you have it the other way around, that's the first cause for breakage. The always-on sources need to be available at system boot time.

During boot, nearly every init script causes at least one lookup, in the cases of things like udev, it causes a lot of lookups, as it needs them. If it can find everything from the files nss backend, then it doesn't need to go to LDAP (or any other unavailable backend). In the case of udev, for a very long time there has been this rule:
/etc/udev/rules.d/50-udev.rules:KERNEL=="tpm*", NAME="%k", OWNER="tss", GROUP="tss", MODE="0600"
This causes udev to look up the user and group 'tss' (that's two lookups). Does your system have a 'tss' user and group? Unless you have the app-crypt/trousers package installed, you probably don't.

Ok, so if this has always been a problem, why did it suddenly turn up now? nss_ldap-249 has a change of behavior (badly documented by upstream unfortunetly). It changed from a hardcoded timeout numbers to using configurable timeout numbers, and greatly increased the timeout values. Previously, if the server was not available or otherwise had issues, nss_ldap failed out after at most 30 seconds (and a lot less if the server IP/port were actually unreachable). As of 249, it takes 124 seconds. It tries twice, then waits 4 seconds, then another 8 seconds, another 16 seconds, another 32 seconds, and finally another 64 seconds, with an attempt between each of the waits. Unfortuntely this behavior is serial, and happens for every lookup. udev tries to look up user 'tss', then group 'tss', etc. On some systems, this made the boot-up unbearly slow, as there were 30+ lookups that went to nss_ldap, at 2 minutes each, leading to an hour of waiting before the actual login prompt came up.

How do we fix this?
The proper way: For every Gentoo init script, we need to make sure that every value looked up is actually in the system files, so that no requests go to nss_ldap or any other remote backend. In the case of udev, this is a known flaw of udev, that it looks up stuff it doesn't need to. If somebody has enough time to look at the udev code, upstream would greatly appreciate it - they don't have enough time to do it. You can comment out the tss line temporarily as well if you want.
The temporary hack: I've commited nss_ldap-250-r1 that changes the default timeouts in the header files, as well documenting them, and the old ones, and even faster ones (read: more dangerous) in /etc/ldap.conf.

Side note: It does seem there is something that changed with regards to SSL behaviour in either openldap-2.3.* or nss_ldap between 239 and 249. In some setups, 'ssl on' no longer works, but specifying a plain ldap:// URL instead of ldaps://, and using 'ssl start_tls' works perfectly fine. If you run into this, move to TLS!

16 -#define LDAP_NSS_TRIES 5 /* number of sleeping reconnect attempts */ 17 -#define LDAP_NSS_SLEEPTIME 4 /* seconds to sleep; doubled until max */ 18 -#define LDAP_NSS_MAXSLEEPTIME 64 /* maximum seconds to sleep */ 19 +#define LDAP_NSS_TRIES 4 /* number of sleeping reconnect attempts */ 20 +#define LDAP_NSS_SLEEPTIME 1 /* seconds to sleep; doubled until max */ 21 +#define LDAP_NSS_MAXSLEEPTIME 16 /* maximum seconds to sleep */ 22 #define LDAP_NSS_MAXCONNTRIES 2 /* reconnect attempts before sleeping */

S	M	T	W	T	F	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Most Popular Tags

alsa - 2 uses
barcampvancouver - 4 uses
bicycle - 3 uses
bugzilla - 3 uses
cacert - 3 uses
computers - 5 uses
conferences - 2 uses
cycling - 2 uses
distfiles - 2 uses
email - 2 uses
fc - 3 uses
fibre channel - 3 uses
fibrechannel - 3 uses
forsale - 2 uses
fosdem - 2 uses
geek - 6 uses
gentoo - 54 uses
git - 3 uses
google - 3 uses
gpg - 2 uses
hard drives - 2 uses
ipv6 - 2 uses
lazyweb - 2 uses
linux - 7 uses
livejournal - 4 uses
meme - 3 uses
mirrors - 5 uses
mysql - 4 uses
networking - 3 uses
ols2008 - 6 uses
open source - 3 uses
pgp - 2 uses
php - 2 uses
rsync - 3 uses
security - 6 uses
server - 2 uses
south africa - 2 uses
spam - 7 uses
spdif - 2 uses
speakers - 2 uses
statistics - 3 uses
stolen - 2 uses
sun - 3 uses
techbc - 4 uses
tekbc - 2 uses
theft - 2 uses
tips - 2 uses
translink - 2 uses
travel - 3 uses
wishlist - 2 uses

Flat | Top-Level Comments Only

From: (Anonymous)

hi there,

for my part, i corrected this problem with this hack in config file. I found it was an undocumented feature of the new nss_ldap, and can be used in ldap.conf

nss_reconnect_tries 2

with a bind_policy hard it takes time to boot, but not so much (it tries only 2 times, instead of ... nearly forever). Moreover, i found recently you can change the timeout period with an (undocumented too) parameter. Didn't have the chance (time) for the moment to test it, but i think we can use it as a workaround ?

bye bye

mRyOuNg

From:

robbat2.livejournal.com

Please re-read the above post.
In specific I said:

The temporary hack: I've commited nss_ldap-250-r1 that changes the default timeouts in the header files, as well documenting them, and the old ones, and even faster ones (read: more dangerous) in /etc/ldap.conf.

If you emerge 250-r1, you will see that nss_reconnect_tries is one of those settings that I have changed. From my patch:

ok, i should give a try to 250-r1 then ;)

When this ebuild planned to be in x86 tree ?

bye bye
mRyOuNg

I too ran in to his annoying bug. robbat2, THANK-YOU for the new ebuild. I ran in to one little problem when I attempted to upgrade to nss_ldap-250-r1 - portage would just hang and not do anything at all.

In /etc/nsswitch.conf I *temporarily* changed

passwd: files ldap
shadow: files ldap
group: files ldap

to

passwd: compat
shadow: compat
group: compat

and then portage worked fine. I was booted to the live cd and chrooted, if that makes any difference at all. Then I switched my nsswitch.conf back to

passwd: files ldap
shadow: files ldap
group: files ldap

The system boots much better now.

I don't think this means what you think it means. I think you'll find the rest of the information you need to properly configure things in the libc info pages.

info libc "NSS Configuration File" "Notes on NSS Configuration File"

hydrian.livejournal.com

All I did was started to emerge the app-crypt/trousers package and once the package created the user/group I sent a break to it. That should keep udev from going to the ldap when doing the user/group lookup for the tss user/group.

Or just run the command

# adduser tss

There is actually a user and group for trousers. Also gentoo tends to like certain id/groups as certain UID/GID. If that group is used it will then user the next available GID/UID. Sometimes through later upgrades this becomes an issue. Usually badly written ebuilds. That is why I let the script make it.

in /etc/nsswitch.conf

passwd: files ldap
shadow: files ldap
group: files ldap

in /etc/passwd
ldap:x:439:439:added by portage for openldap:/usr/lib/openldap:/bin/false

I use a command

/usr/lib/openldap/slapd -u ldap

and nss tries to by connected to ldap

... and /usr/lib/openldap/slapd -u 439 -g 439 works with the same problems

I just happened across this page after spending the last 2 days trying to figure this out. This is exactly the issue that I have, and I'd like to add a couple of additional things I've noticed:

1) Setting "bind_policy soft" DOES NOT WORK. It appears to work, as things like "getent passwd" return immediately (instead of blocking) after dumping /etc/passwd, but if you try to ssh in as an ldap user, sshd will bail out with:

nss_ldap: could not search LDAP server - Server is unavailable
fatal: login_get_lastlog: Cannot find account for uid 1000

2) Putting the user in /etc/passwd DOES NOT WORK. nss_ldap will go through the passwd file first, and will indeed find the user, but it will still attempt to do the ldap lookup. Once that fails, it will return the info it found in /etc/passwd. There needs to be a way to short-circuit this (see #4 below).

3) Your comments about init scripts are spot on, and affect /etc/init.d/slapd as well. If your ldap server is using ldap auth, restarting slapd will take the entire timeout period (about 3 minutes).

4) I tried messing with nsswitch.conf:

passwd: files [SUCCESS=return] ldap
shadow: files [SUCCESS=return] ldap
group: files [SUCCESS=return] ldap

This seems like the "right" way to fix this problem, but it does not appear to do anything.

IMHO, v249 should be masked. I've got servers running v239 that have run fine for months and handle this in a more sane way (if ldap isn't available, the lookup fails immediately instead of blocking). I just noticed this problem when setting up a new ldap infrastructure. Should I just revert to v239, or is this problem fixed in newer versions?

- Upgrade to 253! Make sure you use the new timeout settings!
- Try all combinations of the ssl setting in your /etc/ldap.conf.

In response to specific points above:
#2 - this indicates that something ELSE was being looked up in ldap. Turn on debugging to trace what it was.
#4 - no, [SUCCESS=return] is the default behavior.

The source of #2/#3 is that coreutils and other things occassionally attempt to look up a numeric uid/gid as a string! (eg the first column of /etc/{passwd,group}, and the related field in LDAP). This ALWAYS fails - just that LDAP failures take a long time.

Move along, nothing to read

A dis-illusioned software engineer

[Gentoo] Upgrading/using nss_ldap/nss_mysql/nss_nis/nss... and not breaking your system

configuration workaround

Re: configuration workaround

Re: configuration workaround

Something to note

Re: Something to note

sleezy way of doing it

Re: sleezy way of doing it

Re: sleezy way of doing it

glib bug?

Re: glib bug?

mask versions over 239?

Re: mask versions over 239?

Profile

May 2017

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags